2023-06-17 16:35:02,095 INFO [train.py:1064] (1/4) Training started 2023-06-17 16:35:02,095 INFO [train.py:1074] (1/4) Device: cuda:1 2023-06-17 16:35:03,783 INFO [lexicon.py:168] (1/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-17 16:35:03,981 INFO [train.py:1085] (1/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '802bf98-dirty', 'icefall-git-date': 'Fri Jun 16 18:26:55 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-3-0423201227-84b4557756-8lx4n', 'IP address': '10.177.6.147'}, 'world_size': 4, 'master_port': 12537, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small_causal'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-17 16:35:03,981 INFO [train.py:1087] (1/4) About to create model 2023-06-17 16:35:04,460 INFO [train.py:1091] (1/4) Number of model parameters: 32669302 2023-06-17 16:35:08,621 INFO [train.py:1106] (1/4) Using DDP 2023-06-17 16:35:10,581 INFO [asr_datamodule.py:390] (1/4) About to get train cuts 2023-06-17 16:35:10,588 INFO [asr_datamodule.py:398] (1/4) About to get dev cuts 2023-06-17 16:35:10,589 INFO [asr_datamodule.py:211] (1/4) About to get Musan cuts 2023-06-17 16:35:12,837 INFO [asr_datamodule.py:216] (1/4) Enable MUSAN 2023-06-17 16:35:12,838 INFO [asr_datamodule.py:239] (1/4) Enable SpecAugment 2023-06-17 16:35:12,838 INFO [asr_datamodule.py:240] (1/4) Time warp factor: 80 2023-06-17 16:35:12,838 INFO [asr_datamodule.py:250] (1/4) Num frame mask: 10 2023-06-17 16:35:12,838 INFO [asr_datamodule.py:263] (1/4) About to create train dataset 2023-06-17 16:35:12,838 INFO [asr_datamodule.py:289] (1/4) Using DynamicBucketingSampler. 2023-06-17 16:35:16,217 INFO [asr_datamodule.py:305] (1/4) About to create train dataloader 2023-06-17 16:35:16,218 INFO [asr_datamodule.py:336] (1/4) About to create dev dataset 2023-06-17 16:35:16,737 INFO [asr_datamodule.py:354] (1/4) About to create dev dataloader 2023-06-17 16:37:08,029 INFO [train.py:996] (1/4) Epoch 1, batch 0, loss[loss=10.53, simple_loss=9.56, pruned_loss=9.665, over 21621.00 frames. ], tot_loss[loss=10.53, simple_loss=9.56, pruned_loss=9.665, over 21621.00 frames. ], batch size: 298, lr: 2.25e-02, grad_scale: 1.0 2023-06-17 16:37:08,030 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-17 16:37:25,619 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=10.9, simple_loss=9.897, pruned_loss=10.04, over 1796401.00 frames. 2023-06-17 16:37:25,620 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 22602MB 2023-06-17 16:37:34,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=0.0, ans=0.5 2023-06-17 16:37:36,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=0.0, ans=0.9 2023-06-17 16:37:45,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=60.0, ans=0.5 2023-06-17 16:38:06,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=82.95 vs. limit=7.545 2023-06-17 16:38:09,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=255.68 vs. limit=7.59 2023-06-17 16:38:09,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=92.01 vs. limit=7.545 2023-06-17 16:38:15,024 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=175.31 vs. limit=7.59 2023-06-17 16:38:18,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=255.79 vs. limit=7.635 2023-06-17 16:38:50,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=173.91 vs. limit=7.5675 2023-06-17 16:39:11,362 INFO [train.py:996] (1/4) Epoch 1, batch 50, loss[loss=1.383, simple_loss=1.23, pruned_loss=1.379, over 21433.00 frames. ], tot_loss[loss=4.132, simple_loss=3.816, pruned_loss=3.111, over 960804.21 frames. ], batch size: 194, lr: 2.48e-02, grad_scale: 0.5 2023-06-17 16:39:14,114 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=52.83 vs. limit=7.6125 2023-06-17 16:39:20,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=231.32 vs. limit=7.6125 2023-06-17 16:39:22,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=300.0, ans=0.4625 2023-06-17 16:39:31,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=360.0, ans=0.2964 2023-06-17 16:39:38,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=163.82 vs. limit=5.18 2023-06-17 16:39:43,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=360.0, ans=0.2054 2023-06-17 16:39:51,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=10.29 vs. limit=3.063 2023-06-17 16:39:53,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=168.89 vs. limit=7.6575 2023-06-17 16:39:53,500 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=106.88 vs. limit=5.21 2023-06-17 16:40:00,362 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=176.58 vs. limit=7.6575 2023-06-17 16:40:01,614 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 16:40:06,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=248.79 vs. limit=7.86 2023-06-17 16:40:07,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=480.0, ans=0.20720000000000002 2023-06-17 16:40:39,607 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=156.17 vs. limit=5.27 2023-06-17 16:40:45,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=540.0, ans=0.4746875 2023-06-17 16:40:51,870 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=78.74 vs. limit=7.725 2023-06-17 16:40:52,294 INFO [train.py:996] (1/4) Epoch 1, batch 100, loss[loss=1.358, simple_loss=1.173, pruned_loss=1.476, over 21200.00 frames. ], tot_loss[loss=2.603, simple_loss=2.371, pruned_loss=2.156, over 1684610.86 frames. ], batch size: 159, lr: 2.70e-02, grad_scale: 1.0 2023-06-17 16:40:56,007 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 2.341e+02 3.851e+02 6.975e+03 2.847e+04, threshold=7.702e+02, percent-clipped=0.0 2023-06-17 16:40:59,014 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=135.74 vs. limit=7.95 2023-06-17 16:41:06,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=122.92 vs. limit=7.725 2023-06-17 16:41:12,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=4.264 2023-06-17 16:41:25,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=80.19 vs. limit=7.995 2023-06-17 16:41:28,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=720.0, ans=0.46625 2023-06-17 16:41:33,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=720.0, ans=0.46625 2023-06-17 16:41:41,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=209.80 vs. limit=5.36 2023-06-17 16:41:43,349 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=37.33 vs. limit=8.04 2023-06-17 16:41:51,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=780.0, ans=0.17075 2023-06-17 16:42:05,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=780.0, ans=0.4634375 2023-06-17 16:42:21,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=102.42 vs. limit=7.815 2023-06-17 16:42:24,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=21.45 vs. limit=7.815 2023-06-17 16:42:25,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=42.48 vs. limit=7.815 2023-06-17 16:42:37,161 INFO [train.py:996] (1/4) Epoch 1, batch 150, loss[loss=1.122, simple_loss=0.9609, pruned_loss=1.174, over 21784.00 frames. ], tot_loss[loss=2.004, simple_loss=1.801, pruned_loss=1.775, over 2264507.79 frames. ], batch size: 282, lr: 2.93e-02, grad_scale: 1.0 2023-06-17 16:42:37,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=900.0, ans=0.4578125 2023-06-17 16:42:38,196 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=8.175 2023-06-17 16:42:53,629 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=49.48 vs. limit=7.86 2023-06-17 16:42:53,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=41.16 vs. limit=7.86 2023-06-17 16:43:00,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=86.00 vs. limit=7.86 2023-06-17 16:43:02,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=66.94 vs. limit=7.86 2023-06-17 16:43:05,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=960.0, ans=0.8664000000000001 2023-06-17 16:43:14,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=4.408 2023-06-17 16:43:15,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=88.38 vs. limit=7.8825 2023-06-17 16:43:20,382 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=83.56 vs. limit=7.8825 2023-06-17 16:43:26,917 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=41.57 vs. limit=5.51 2023-06-17 16:43:49,634 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=22.92 vs. limit=7.905 2023-06-17 16:44:01,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1080.0, ans=0.2892 2023-06-17 16:44:04,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1080.0, ans=0.8622000000000001 2023-06-17 16:44:05,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=8.31 2023-06-17 16:44:08,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1140.0, ans=0.5 2023-06-17 16:44:12,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.41 vs. limit=8.355 2023-06-17 16:44:26,081 INFO [train.py:996] (1/4) Epoch 1, batch 200, loss[loss=1.091, simple_loss=0.9352, pruned_loss=1.062, over 21632.00 frames. ], tot_loss[loss=1.669, simple_loss=1.485, pruned_loss=1.519, over 2702821.26 frames. ], batch size: 263, lr: 3.15e-02, grad_scale: 2.0 2023-06-17 16:44:29,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 7.013e+01 1.220e+02 1.520e+02 2.087e+02 3.052e+02, threshold=3.040e+02, percent-clipped=0.0 2023-06-17 16:44:34,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=113.08 vs. limit=7.95 2023-06-17 16:44:38,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=54.15 vs. limit=7.95 2023-06-17 16:44:49,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=4.504 2023-06-17 16:45:07,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.06 vs. limit=7.995 2023-06-17 16:45:38,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.80 vs. limit=5.345 2023-06-17 16:45:53,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.20 vs. limit=8.535 2023-06-17 16:46:05,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=21.69 vs. limit=5.72 2023-06-17 16:46:13,945 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.19 vs. limit=8.58 2023-06-17 16:46:16,403 INFO [train.py:996] (1/4) Epoch 1, batch 250, loss[loss=0.9439, simple_loss=0.8113, pruned_loss=0.8619, over 21636.00 frames. ], tot_loss[loss=1.456, simple_loss=1.286, pruned_loss=1.334, over 3064810.05 frames. ], batch size: 471, lr: 3.38e-02, grad_scale: 2.0 2023-06-17 16:46:16,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1500.0, ans=0.8475 2023-06-17 16:46:21,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=18.69 vs. limit=8.0625 2023-06-17 16:46:21,286 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.40 vs. limit=5.75 2023-06-17 16:46:31,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1500.0, ans=0.285 2023-06-17 16:46:42,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=4.624 2023-06-17 16:46:51,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1620.0, ans=0.4240625 2023-06-17 16:46:51,769 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=4.648 2023-06-17 16:46:52,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1620.0, ans=0.2975 2023-06-17 16:47:02,563 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=8.1075 2023-06-17 16:47:22,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1680.0, ans=0.42125 2023-06-17 16:47:35,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=8.13 2023-06-17 16:47:40,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1680.0, ans=0.42125 2023-06-17 16:47:44,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=183.09 vs. limit=8.1525 2023-06-17 16:47:58,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=8.1525 2023-06-17 16:48:00,555 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=8.1525 2023-06-17 16:48:03,327 INFO [train.py:996] (1/4) Epoch 1, batch 300, loss[loss=0.8225, simple_loss=0.7007, pruned_loss=0.7413, over 21576.00 frames. ], tot_loss[loss=1.317, simple_loss=1.155, pruned_loss=1.208, over 3335708.28 frames. ], batch size: 247, lr: 3.60e-02, grad_scale: 4.0 2023-06-17 16:48:07,101 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.171e+01 1.173e+02 1.354e+02 1.820e+02 4.361e+02, threshold=2.708e+02, percent-clipped=2.0 2023-06-17 16:48:07,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1800.0, ans=0.0595 2023-06-17 16:48:35,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.08 vs. limit=8.895 2023-06-17 16:48:39,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1920.0, ans=0.128 2023-06-17 16:48:41,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.31 vs. limit=5.96 2023-06-17 16:49:05,225 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.19 vs. limit=8.22 2023-06-17 16:49:17,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980.0, ans=0.2802 2023-06-17 16:49:24,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1980.0, ans=0.4071875 2023-06-17 16:49:38,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.97 vs. limit=8.265 2023-06-17 16:49:43,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2040.0, ans=0.2796 2023-06-17 16:49:46,027 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=8.265 2023-06-17 16:49:48,627 INFO [train.py:996] (1/4) Epoch 1, batch 350, loss[loss=0.8657, simple_loss=0.7302, pruned_loss=0.7735, over 21452.00 frames. ], tot_loss[loss=1.203, simple_loss=1.048, pruned_loss=1.099, over 3545985.93 frames. ], batch size: 211, lr: 3.83e-02, grad_scale: 4.0 2023-06-17 16:49:50,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2100.0, ans=0.05275 2023-06-17 16:49:50,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2100.0, ans=0.4015625 2023-06-17 16:50:19,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.76 vs. limit=5.54 2023-06-17 16:50:30,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=9.165 2023-06-17 16:51:11,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=9.21 2023-06-17 16:51:16,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2280.0, ans=0.393125 2023-06-17 16:51:19,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2340.0, ans=0.8181 2023-06-17 16:51:26,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2340.0, ans=0.3903125 2023-06-17 16:51:36,704 INFO [train.py:996] (1/4) Epoch 1, batch 400, loss[loss=0.8082, simple_loss=0.6811, pruned_loss=0.6968, over 21423.00 frames. ], tot_loss[loss=1.119, simple_loss=0.9679, pruned_loss=1.014, over 3708248.48 frames. ], batch size: 476, lr: 4.05e-02, grad_scale: 8.0 2023-06-17 16:51:39,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.22 vs. limit=5.6 2023-06-17 16:51:40,349 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 8.615e+01 1.452e+02 1.814e+02 2.451e+02 4.544e+02, threshold=3.628e+02, percent-clipped=11.0 2023-06-17 16:51:41,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=23.38 vs. limit=8.4 2023-06-17 16:51:48,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=8.4 2023-06-17 16:51:49,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2400.0, ans=0.3875 2023-06-17 16:51:58,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=22.89 vs. limit=8.4225 2023-06-17 16:51:59,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=24.02 vs. limit=8.4225 2023-06-17 16:52:17,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2520.0, ans=0.381875 2023-06-17 16:52:28,333 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=8.445 2023-06-17 16:52:29,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=9.39 2023-06-17 16:52:39,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=5.645 2023-06-17 16:53:13,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2640.0, ans=0.37625 2023-06-17 16:53:26,150 INFO [train.py:996] (1/4) Epoch 1, batch 450, loss[loss=1.052, simple_loss=0.8811, pruned_loss=0.8907, over 21742.00 frames. ], tot_loss[loss=1.074, simple_loss=0.9224, pruned_loss=0.9626, over 3841784.94 frames. ], batch size: 332, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 16:53:28,667 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=8.5125 2023-06-17 16:53:46,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=8.535 2023-06-17 16:53:53,584 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=3.414 2023-06-17 16:53:54,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2760.0, ans=0.0965 2023-06-17 16:53:59,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=48.36 vs. limit=8.535 2023-06-17 16:54:21,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2820.0, ans=0.3678125 2023-06-17 16:54:21,694 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.87 vs. limit=8.557500000000001 2023-06-17 16:54:21,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=9.615 2023-06-17 16:54:39,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=5.152 2023-06-17 16:54:53,567 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.90 vs. limit=8.58 2023-06-17 16:55:06,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2940.0, ans=0.3621875 2023-06-17 16:55:10,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.99 vs. limit=8.6025 2023-06-17 16:55:14,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=8.625 2023-06-17 16:55:15,280 INFO [train.py:996] (1/4) Epoch 1, batch 500, loss[loss=1.038, simple_loss=0.8695, pruned_loss=0.8511, over 21721.00 frames. ], tot_loss[loss=1.041, simple_loss=0.8895, pruned_loss=0.9197, over 3944180.63 frames. ], batch size: 332, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:55:18,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.10 vs. limit=8.625 2023-06-17 16:55:19,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 9.969e+01 1.768e+02 2.484e+02 3.323e+02 7.392e+02, threshold=4.968e+02, percent-clipped=16.0 2023-06-17 16:55:37,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=5.224 2023-06-17 16:55:47,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=3060.0, ans=0.2694 2023-06-17 16:55:48,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=21.62 vs. limit=8.6475 2023-06-17 16:56:46,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=8.715 2023-06-17 16:57:02,946 INFO [train.py:996] (1/4) Epoch 1, batch 550, loss[loss=0.7591, simple_loss=0.6411, pruned_loss=0.5929, over 21705.00 frames. ], tot_loss[loss=1.011, simple_loss=0.861, pruned_loss=0.8752, over 4006407.04 frames. ], batch size: 124, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:57:13,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=3300.0, ans=0.267 2023-06-17 16:57:21,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.62 vs. limit=6.68 2023-06-17 16:57:38,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=3.504 2023-06-17 16:57:49,135 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=2.282e+01 2023-06-17 16:57:59,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=3420.0, ans=7.1375 2023-06-17 16:57:59,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=3420.0, ans=7.1375 2023-06-17 16:58:09,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=3420.0, ans=0.7803 2023-06-17 16:58:12,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=8.7825 2023-06-17 16:58:32,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=8.8275 2023-06-17 16:58:50,973 INFO [train.py:996] (1/4) Epoch 1, batch 600, loss[loss=1.016, simple_loss=0.8546, pruned_loss=0.7814, over 21600.00 frames. ], tot_loss[loss=0.983, simple_loss=0.8361, pruned_loss=0.8302, over 4073408.18 frames. ], batch size: 471, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:58:54,330 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 2.961e+02 3.893e+02 6.488e+02 1.570e+03, threshold=7.787e+02, percent-clipped=36.0 2023-06-17 16:59:12,266 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.90 vs. limit=5.915 2023-06-17 16:59:56,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=8.895 2023-06-17 17:00:04,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=3780.0, ans=0.3228125 2023-06-17 17:00:12,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=8.9175 2023-06-17 17:00:36,744 INFO [train.py:996] (1/4) Epoch 1, batch 650, loss[loss=0.7137, simple_loss=0.6113, pruned_loss=0.5139, over 21746.00 frames. ], tot_loss[loss=0.9506, simple_loss=0.809, pruned_loss=0.7816, over 4105419.50 frames. ], batch size: 282, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:01:08,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=8.985 2023-06-17 17:01:11,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=3960.0, ans=0.11256782822743017 2023-06-17 17:01:33,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=4020.0, ans=0.009995652173913044 2023-06-17 17:01:57,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.52 vs. limit=6.02 2023-06-17 17:02:00,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=4080.0, ans=0.04966666666666667 2023-06-17 17:02:09,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=4140.0, ans=0.3059375 2023-06-17 17:02:18,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=4140.0, ans=0.7551 2023-06-17 17:02:21,239 INFO [train.py:996] (1/4) Epoch 1, batch 700, loss[loss=0.8174, simple_loss=0.6865, pruned_loss=0.6047, over 21387.00 frames. ], tot_loss[loss=0.9115, simple_loss=0.777, pruned_loss=0.7292, over 4151179.32 frames. ], batch size: 508, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:02:24,616 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 4.078e+02 5.855e+02 9.456e+02 2.667e+03, threshold=1.171e+03, percent-clipped=39.0 2023-06-17 17:02:27,280 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=9.075 2023-06-17 17:02:27,307 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=9.075 2023-06-17 17:02:28,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=4200.0, ans=0.303125 2023-06-17 17:03:37,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=4380.0, ans=0.2562 2023-06-17 17:03:58,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=6.11 2023-06-17 17:04:05,983 INFO [train.py:996] (1/4) Epoch 1, batch 750, loss[loss=0.7321, simple_loss=0.6212, pruned_loss=0.5196, over 21940.00 frames. ], tot_loss[loss=0.871, simple_loss=0.7443, pruned_loss=0.6781, over 4180683.47 frames. ], batch size: 333, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:04:11,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=4500.0, ans=0.2890625 2023-06-17 17:04:51,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=4620.0, ans=6.155 2023-06-17 17:05:49,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=4800.0, ans=0.275 2023-06-17 17:05:51,448 INFO [train.py:996] (1/4) Epoch 1, batch 800, loss[loss=0.7137, simple_loss=0.6147, pruned_loss=0.4822, over 21609.00 frames. ], tot_loss[loss=0.8329, simple_loss=0.7133, pruned_loss=0.632, over 4199909.88 frames. ], batch size: 414, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:05:54,752 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 4.402e+02 7.390e+02 1.255e+03 3.583e+03, threshold=1.478e+03, percent-clipped=27.0 2023-06-17 17:06:17,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=11.145 2023-06-17 17:06:32,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=4860.0, ans=0.04641666666666667 2023-06-17 17:06:32,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=4860.0, ans=0.04641666666666667 2023-06-17 17:06:59,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=4920.0, ans=0.7278 2023-06-17 17:07:20,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5040.0, ans=0.2496 2023-06-17 17:07:33,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=5040.0, ans=0.26375000000000004 2023-06-17 17:07:33,705 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=11.28 2023-06-17 17:07:35,938 INFO [train.py:996] (1/4) Epoch 1, batch 850, loss[loss=0.6215, simple_loss=0.5483, pruned_loss=0.3932, over 21133.00 frames. ], tot_loss[loss=0.7966, simple_loss=0.6841, pruned_loss=0.5893, over 4219117.11 frames. ], batch size: 143, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:07:51,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=5160.0, ans=0.258125 2023-06-17 17:08:04,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.48 vs. limit=3.774 2023-06-17 17:08:48,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=5280.0, ans=0.2525 2023-06-17 17:08:59,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=5280.0, ans=0.2525 2023-06-17 17:09:01,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.77 vs. limit=11.46 2023-06-17 17:09:11,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.09 vs. limit=9.5025 2023-06-17 17:09:19,818 INFO [train.py:996] (1/4) Epoch 1, batch 900, loss[loss=0.5974, simple_loss=0.5282, pruned_loss=0.3724, over 21248.00 frames. ], tot_loss[loss=0.7558, simple_loss=0.6515, pruned_loss=0.5454, over 4219247.72 frames. ], batch size: 159, lr: 4.48e-02, grad_scale: 16.0 2023-06-17 17:09:23,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 4.491e+02 8.246e+02 1.178e+03 2.944e+03, threshold=1.649e+03, percent-clipped=18.0 2023-06-17 17:09:30,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=11.55 2023-06-17 17:09:33,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=5400.0, ans=0.246875 2023-06-17 17:09:40,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=5460.0, ans=0.07 2023-06-17 17:10:29,286 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=6.208 2023-06-17 17:10:33,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=5580.0, ans=0.23843750000000002 2023-06-17 17:11:05,164 INFO [train.py:996] (1/4) Epoch 1, batch 950, loss[loss=0.6752, simple_loss=0.5937, pruned_loss=0.4218, over 21842.00 frames. ], tot_loss[loss=0.7246, simple_loss=0.6269, pruned_loss=0.5099, over 4238255.95 frames. ], batch size: 118, lr: 4.48e-02, grad_scale: 16.0 2023-06-17 17:11:58,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=5820.0, ans=0.2271875 2023-06-17 17:12:37,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=9.7275 2023-06-17 17:12:43,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=6000.0, ans=0.6900000000000001 2023-06-17 17:12:44,943 INFO [train.py:996] (1/4) Epoch 1, batch 1000, loss[loss=0.803, simple_loss=0.6772, pruned_loss=0.5365, over 21456.00 frames. ], tot_loss[loss=0.7041, simple_loss=0.6113, pruned_loss=0.4836, over 4254357.73 frames. ], batch size: 471, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:12:50,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 4.573e+02 9.444e+02 1.523e+03 4.461e+03, threshold=1.889e+03, percent-clipped=19.0 2023-06-17 17:13:13,638 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=9.7725 2023-06-17 17:13:43,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6120.0, ans=0.23879999999999998 2023-06-17 17:14:00,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=6.4719999999999995 2023-06-17 17:14:03,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6180.0, ans=0.2382 2023-06-17 17:14:10,224 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.520e-03 2023-06-17 17:14:22,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=6240.0, ans=0.20750000000000002 2023-06-17 17:14:22,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=6240.0, ans=0.20750000000000002 2023-06-17 17:14:30,015 INFO [train.py:996] (1/4) Epoch 1, batch 1050, loss[loss=0.5809, simple_loss=0.52, pruned_loss=0.3447, over 21813.00 frames. ], tot_loss[loss=0.6845, simple_loss=0.5968, pruned_loss=0.4592, over 4269691.74 frames. ], batch size: 282, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:14:30,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=6300.0, ans=0.20468750000000002 2023-06-17 17:15:08,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=6360.0, ans=0.030125000000000002 2023-06-17 17:15:13,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=9.885 2023-06-17 17:15:23,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=8.18 2023-06-17 17:15:24,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=6420.0, ans=0.19906249999999998 2023-06-17 17:15:53,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6480.0, ans=0.2352 2023-06-17 17:16:08,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=6540.0, ans=0.03941666666666667 2023-06-17 17:16:09,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6540.0, ans=0.23459999999999998 2023-06-17 17:16:11,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=9.9525 2023-06-17 17:16:11,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=6.616 2023-06-17 17:16:19,596 INFO [train.py:996] (1/4) Epoch 1, batch 1100, loss[loss=0.5197, simple_loss=0.4832, pruned_loss=0.2847, over 21274.00 frames. ], tot_loss[loss=0.6645, simple_loss=0.5824, pruned_loss=0.4353, over 4272846.70 frames. ], batch size: 143, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:16:20,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=6600.0, ans=0.190625 2023-06-17 17:16:24,915 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.653e+02 4.618e+02 6.760e+02 9.652e+02 3.048e+03, threshold=1.352e+03, percent-clipped=4.0 2023-06-17 17:17:28,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=6780.0, ans=0.6627000000000001 2023-06-17 17:18:16,846 INFO [train.py:996] (1/4) Epoch 1, batch 1150, loss[loss=0.5948, simple_loss=0.553, pruned_loss=0.3252, over 21689.00 frames. ], tot_loss[loss=0.6479, simple_loss=0.5709, pruned_loss=0.4148, over 4280846.78 frames. ], batch size: 298, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:18:22,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=6900.0, ans=0.03791666666666667 2023-06-17 17:18:39,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=6.784 2023-06-17 17:18:42,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=10.11 2023-06-17 17:20:04,094 INFO [train.py:996] (1/4) Epoch 1, batch 1200, loss[loss=0.6423, simple_loss=0.5614, pruned_loss=0.3895, over 21842.00 frames. ], tot_loss[loss=0.6405, simple_loss=0.5663, pruned_loss=0.4025, over 4283958.66 frames. ], batch size: 351, lr: 4.47e-02, grad_scale: 16.0 2023-06-17 17:20:04,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=7200.0, ans=0.0 2023-06-17 17:20:09,125 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 4.949e+02 7.827e+02 1.470e+03 3.073e+03, threshold=1.565e+03, percent-clipped=26.0 2023-06-17 17:20:40,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=10.2225 2023-06-17 17:20:41,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.64 vs. limit=6.83 2023-06-17 17:21:12,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=7380.0, ans=0.1540625 2023-06-17 17:21:44,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=7440.0, ans=0.15125 2023-06-17 17:21:50,136 INFO [train.py:996] (1/4) Epoch 1, batch 1250, loss[loss=0.5131, simple_loss=0.4654, pruned_loss=0.2915, over 21342.00 frames. ], tot_loss[loss=0.6339, simple_loss=0.5621, pruned_loss=0.3919, over 4286512.51 frames. ], batch size: 159, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:21:50,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=7500.0, ans=0.6375 2023-06-17 17:23:25,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7740.0, ans=0.22260000000000002 2023-06-17 17:23:27,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=10.4025 2023-06-17 17:23:33,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=7800.0, ans=0.009173913043478261 2023-06-17 17:23:34,680 INFO [train.py:996] (1/4) Epoch 1, batch 1300, loss[loss=0.5687, simple_loss=0.5146, pruned_loss=0.3231, over 21709.00 frames. ], tot_loss[loss=0.621, simple_loss=0.5533, pruned_loss=0.3773, over 4286766.60 frames. ], batch size: 230, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:23:46,454 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.014e+02 6.355e+02 9.383e+02 1.437e+03 4.251e+03, threshold=1.877e+03, percent-clipped=19.0 2023-06-17 17:23:50,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=7800.0, ans=0.222 2023-06-17 17:23:52,358 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=71.00 vs. limit=10.425 2023-06-17 17:24:13,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=7920.0, ans=0.12874999999999998 2023-06-17 17:24:18,608 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=13.440000000000001 2023-06-17 17:24:31,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=7980.0, ans=0.12593749999999998 2023-06-17 17:24:32,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=7980.0, ans=0.12593749999999998 2023-06-17 17:25:18,167 INFO [train.py:996] (1/4) Epoch 1, batch 1350, loss[loss=0.5583, simple_loss=0.5113, pruned_loss=0.3103, over 21929.00 frames. ], tot_loss[loss=0.6094, simple_loss=0.5453, pruned_loss=0.3646, over 4285289.50 frames. ], batch size: 113, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:25:31,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=8100.0, ans=0.0 2023-06-17 17:25:42,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=8160.0, ans=0.125 2023-06-17 17:25:52,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=8160.0, ans=0.009095652173913043 2023-06-17 17:26:00,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=8220.0, ans=0.009082608695652174 2023-06-17 17:27:04,252 INFO [train.py:996] (1/4) Epoch 1, batch 1400, loss[loss=0.5381, simple_loss=0.4731, pruned_loss=0.3157, over 21363.00 frames. ], tot_loss[loss=0.5952, simple_loss=0.535, pruned_loss=0.351, over 4289000.52 frames. ], batch size: 473, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:27:16,082 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.966e+02 8.167e+02 1.163e+03 2.690e+03, threshold=1.633e+03, percent-clipped=5.0 2023-06-17 17:28:15,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=8580.0, ans=0.125 2023-06-17 17:28:16,025 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=8.381e-03 2023-06-17 17:28:16,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=8580.0, ans=0.125 2023-06-17 17:28:33,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=10.74 2023-06-17 17:28:38,275 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:28:47,813 INFO [train.py:996] (1/4) Epoch 1, batch 1450, loss[loss=0.5714, simple_loss=0.5361, pruned_loss=0.3052, over 21840.00 frames. ], tot_loss[loss=0.5898, simple_loss=0.5313, pruned_loss=0.3441, over 4282014.61 frames. ], batch size: 351, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:29:13,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=8760.0, ans=0.2124 2023-06-17 17:29:16,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=8760.0, ans=0.008965217391304348 2023-06-17 17:29:47,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=8880.0, ans=0.125 2023-06-17 17:30:15,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=8940.0, ans=0.0 2023-06-17 17:30:31,052 INFO [train.py:996] (1/4) Epoch 1, batch 1500, loss[loss=0.5218, simple_loss=0.471, pruned_loss=0.2934, over 21659.00 frames. ], tot_loss[loss=0.5855, simple_loss=0.5287, pruned_loss=0.3382, over 4289137.52 frames. ], batch size: 415, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:30:42,865 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.853e+02 4.683e+02 9.412e+02 1.321e+03 2.952e+03, threshold=1.882e+03, percent-clipped=11.0 2023-06-17 17:31:25,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=9120.0, ans=0.125 2023-06-17 17:31:49,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=9180.0, ans=0.125 2023-06-17 17:31:50,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=9180.0, ans=0.02841666666666667 2023-06-17 17:31:50,721 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:31:55,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.97 vs. limit=4.377 2023-06-17 17:32:07,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=9240.0, ans=0.02816666666666667 2023-06-17 17:32:17,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=9240.0, ans=0.5766 2023-06-17 17:32:22,521 INFO [train.py:996] (1/4) Epoch 1, batch 1550, loss[loss=0.5787, simple_loss=0.5466, pruned_loss=0.3059, over 21804.00 frames. ], tot_loss[loss=0.5725, simple_loss=0.5203, pruned_loss=0.3262, over 4277140.47 frames. ], batch size: 371, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:32:40,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=9360.0, ans=0.125 2023-06-17 17:32:43,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=9360.0, ans=0.02766666666666667 2023-06-17 17:32:45,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=9360.0, ans=11.01 2023-06-17 17:32:47,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=11.01 2023-06-17 17:32:50,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=9360.0, ans=0.008834782608695652 2023-06-17 17:33:22,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=9420.0, ans=0.125 2023-06-17 17:33:29,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=9480.0, ans=0.125 2023-06-17 17:33:34,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=9480.0, ans=0.02716666666666667 2023-06-17 17:33:52,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=9540.0, ans=0.008795652173913043 2023-06-17 17:34:09,485 INFO [train.py:996] (1/4) Epoch 1, batch 1600, loss[loss=0.4749, simple_loss=0.447, pruned_loss=0.252, over 21720.00 frames. ], tot_loss[loss=0.5653, simple_loss=0.5153, pruned_loss=0.3191, over 4272899.55 frames. ], batch size: 124, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:34:15,868 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.713e+02 5.768e+02 7.778e+02 1.283e+03 4.290e+03, threshold=1.556e+03, percent-clipped=12.0 2023-06-17 17:34:16,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=9600.0, ans=0.344 2023-06-17 17:34:43,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=9660.0, ans=0.125 2023-06-17 17:35:20,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=9780.0, ans=0.5577000000000001 2023-06-17 17:35:45,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=9840.0, ans=0.125 2023-06-17 17:35:53,561 INFO [train.py:996] (1/4) Epoch 1, batch 1650, loss[loss=0.5133, simple_loss=0.4946, pruned_loss=0.2644, over 21780.00 frames. ], tot_loss[loss=0.5539, simple_loss=0.5088, pruned_loss=0.3085, over 4276843.49 frames. ], batch size: 391, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:36:03,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=9900.0, ans=0.125 2023-06-17 17:37:14,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10080.0, ans=0.0 2023-06-17 17:37:31,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=11.3025 2023-06-17 17:37:33,249 INFO [train.py:996] (1/4) Epoch 1, batch 1700, loss[loss=0.5586, simple_loss=0.536, pruned_loss=0.2895, over 21752.00 frames. ], tot_loss[loss=0.5562, simple_loss=0.5119, pruned_loss=0.3077, over 4278274.19 frames. ], batch size: 332, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:37:41,957 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.683e+02 4.849e+02 8.667e+02 1.230e+03 2.717e+03, threshold=1.733e+03, percent-clipped=16.0 2023-06-17 17:38:40,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=10320.0, ans=0.05 2023-06-17 17:38:46,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=10380.0, ans=0.02341666666666667 2023-06-17 17:39:01,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=10440.0, ans=0.125 2023-06-17 17:39:01,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=10440.0, ans=0.0 2023-06-17 17:39:19,718 INFO [train.py:996] (1/4) Epoch 1, batch 1750, loss[loss=0.4145, simple_loss=0.4254, pruned_loss=0.1974, over 21789.00 frames. ], tot_loss[loss=0.5482, simple_loss=0.5092, pruned_loss=0.2992, over 4265742.73 frames. ], batch size: 316, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:39:43,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10500.0, ans=0.195 2023-06-17 17:40:06,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=10560.0, ans=0.125 2023-06-17 17:40:34,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=10680.0, ans=0.125 2023-06-17 17:40:48,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10740.0, ans=0.1926 2023-06-17 17:41:18,134 INFO [train.py:996] (1/4) Epoch 1, batch 1800, loss[loss=0.4727, simple_loss=0.4873, pruned_loss=0.225, over 21714.00 frames. ], tot_loss[loss=0.5345, simple_loss=0.5008, pruned_loss=0.2882, over 4265154.14 frames. ], batch size: 332, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:41:26,427 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 4.676e+02 7.112e+02 1.184e+03 2.740e+03, threshold=1.422e+03, percent-clipped=6.0 2023-06-17 17:41:51,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=10860.0, ans=0.5199 2023-06-17 17:42:00,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=10920.0, ans=0.125 2023-06-17 17:42:12,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=10980.0, ans=0.125 2023-06-17 17:42:15,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=10980.0, ans=0.19019999999999998 2023-06-17 17:43:02,028 INFO [train.py:996] (1/4) Epoch 1, batch 1850, loss[loss=0.5266, simple_loss=0.4976, pruned_loss=0.2779, over 21808.00 frames. ], tot_loss[loss=0.529, simple_loss=0.5004, pruned_loss=0.2817, over 4274072.01 frames. ], batch size: 414, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 17:43:19,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=11160.0, ans=0.125 2023-06-17 17:43:36,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=11160.0, ans=0.02016666666666667 2023-06-17 17:43:40,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.59 vs. limit=4.683 2023-06-17 17:43:44,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=11.7075 2023-06-17 17:43:53,650 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:44:04,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=11280.0, ans=0.008417391304347826 2023-06-17 17:44:28,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=11340.0, ans=0.01941666666666667 2023-06-17 17:44:39,730 INFO [train.py:996] (1/4) Epoch 1, batch 1900, loss[loss=0.5019, simple_loss=0.4727, pruned_loss=0.2657, over 21802.00 frames. ], tot_loss[loss=0.5222, simple_loss=0.4961, pruned_loss=0.2763, over 4275983.84 frames. ], batch size: 371, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 17:44:47,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.687e+02 6.940e+02 1.118e+03 3.518e+03, threshold=1.388e+03, percent-clipped=15.0 2023-06-17 17:44:48,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=11400.0, ans=0.125 2023-06-17 17:45:02,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=11460.0, ans=0.125 2023-06-17 17:45:26,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=11520.0, ans=0.0 2023-06-17 17:45:37,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=11580.0, ans=0.01841666666666667 2023-06-17 17:45:37,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=11580.0, ans=0.04949747468305833 2023-06-17 17:46:22,593 INFO [train.py:996] (1/4) Epoch 1, batch 1950, loss[loss=0.4531, simple_loss=0.4341, pruned_loss=0.236, over 21644.00 frames. ], tot_loss[loss=0.5122, simple_loss=0.4862, pruned_loss=0.2708, over 4280002.93 frames. ], batch size: 298, lr: 4.43e-02, grad_scale: 4.0 2023-06-17 17:46:30,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=11700.0, ans=0.49050000000000005 2023-06-17 17:46:33,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.40 vs. limit=4.755 2023-06-17 17:46:45,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=11760.0, ans=0.125 2023-06-17 17:46:58,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=16.32 2023-06-17 17:47:08,873 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=1.921e-02 2023-06-17 17:47:50,307 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.14 vs. limit=16.455 2023-06-17 17:48:00,874 INFO [train.py:996] (1/4) Epoch 1, batch 2000, loss[loss=0.4381, simple_loss=0.4102, pruned_loss=0.233, over 20169.00 frames. ], tot_loss[loss=0.4927, simple_loss=0.4718, pruned_loss=0.258, over 4268574.60 frames. ], batch size: 702, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:48:15,558 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.181e+02 5.337e+02 7.170e+02 1.145e+03 2.393e+03, threshold=1.434e+03, percent-clipped=15.0 2023-06-17 17:48:19,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=12000.0, ans=0.00826086956521739 2023-06-17 17:48:22,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=12060.0, ans=0.47790000000000005 2023-06-17 17:49:23,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.32 vs. limit=4.836 2023-06-17 17:49:37,865 INFO [train.py:996] (1/4) Epoch 1, batch 2050, loss[loss=0.4736, simple_loss=0.4584, pruned_loss=0.2444, over 21327.00 frames. ], tot_loss[loss=0.497, simple_loss=0.4769, pruned_loss=0.2595, over 4265290.36 frames. ], batch size: 159, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:50:12,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=12360.0, ans=0.015166666666666669 2023-06-17 17:50:56,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=12480.0, ans=0.1752 2023-06-17 17:51:10,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=12540.0, ans=0.3881 2023-06-17 17:51:15,800 INFO [train.py:996] (1/4) Epoch 1, batch 2100, loss[loss=0.4755, simple_loss=0.4635, pruned_loss=0.2437, over 21771.00 frames. ], tot_loss[loss=0.4995, simple_loss=0.4805, pruned_loss=0.26, over 4276735.86 frames. ], batch size: 316, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:51:28,312 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:51:31,176 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.912e+02 5.111e+02 7.540e+02 1.226e+03 2.396e+03, threshold=1.508e+03, percent-clipped=15.0 2023-06-17 17:51:38,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=12660.0, ans=0.125 2023-06-17 17:51:48,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=12660.0, ans=0.0 2023-06-17 17:52:48,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.315000000000001 2023-06-17 17:53:00,952 INFO [train.py:996] (1/4) Epoch 1, batch 2150, loss[loss=0.6286, simple_loss=0.578, pruned_loss=0.3396, over 21363.00 frames. ], tot_loss[loss=0.4998, simple_loss=0.4807, pruned_loss=0.26, over 4265807.79 frames. ], batch size: 507, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 17:53:01,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=12900.0, ans=0.012916666666666667 2023-06-17 17:53:10,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.72 vs. limit=4.9350000000000005 2023-06-17 17:53:19,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=12900.0, ans=0.07 2023-06-17 17:53:31,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=12960.0, ans=0.125 2023-06-17 17:54:05,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=13080.0, ans=0.1192 2023-06-17 17:54:24,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=13080.0, ans=0.125 2023-06-17 17:54:44,022 INFO [train.py:996] (1/4) Epoch 1, batch 2200, loss[loss=0.4538, simple_loss=0.4334, pruned_loss=0.2371, over 21436.00 frames. ], tot_loss[loss=0.4983, simple_loss=0.4822, pruned_loss=0.2577, over 4266095.19 frames. ], batch size: 195, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 17:54:59,355 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 5.225e+02 6.882e+02 1.154e+03 2.681e+03, threshold=1.376e+03, percent-clipped=19.0 2023-06-17 17:55:01,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=13200.0, ans=0.125 2023-06-17 17:55:07,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=13260.0, ans=0.1674 2023-06-17 17:55:27,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=13320.0, ans=0.011166666666666672 2023-06-17 17:56:18,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=12.54 2023-06-17 17:56:23,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=9.376000000000001 2023-06-17 17:56:24,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=13440.0, ans=0.125 2023-06-17 17:56:34,497 INFO [train.py:996] (1/4) Epoch 1, batch 2250, loss[loss=0.4945, simple_loss=0.5193, pruned_loss=0.2349, over 21167.00 frames. ], tot_loss[loss=0.4872, simple_loss=0.4762, pruned_loss=0.2495, over 4269878.49 frames. ], batch size: 548, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 17:56:50,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.19 vs. limit=17.67 2023-06-17 17:57:03,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.91 vs. limit=8.39 2023-06-17 17:57:30,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=13680.0, ans=0.125 2023-06-17 17:58:18,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.86 vs. limit=8.45 2023-06-17 17:58:19,041 INFO [train.py:996] (1/4) Epoch 1, batch 2300, loss[loss=0.5416, simple_loss=0.5001, pruned_loss=0.2915, over 21563.00 frames. ], tot_loss[loss=0.4791, simple_loss=0.4675, pruned_loss=0.2456, over 4259112.94 frames. ], batch size: 441, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 17:58:19,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13800.0, ans=0.0 2023-06-17 17:58:25,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.66 vs. limit=9.52 2023-06-17 17:58:29,076 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.055e+02 5.278e+02 8.077e+02 1.161e+03 3.244e+03, threshold=1.615e+03, percent-clipped=15.0 2023-06-17 17:58:36,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=13860.0, ans=0.007856521739130436 2023-06-17 17:58:41,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=13860.0, ans=0.0 2023-06-17 17:58:46,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=13860.0, ans=0.007856521739130436 2023-06-17 17:59:11,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=13920.0, ans=0.1608 2023-06-17 18:00:03,804 INFO [train.py:996] (1/4) Epoch 1, batch 2350, loss[loss=0.4602, simple_loss=0.4457, pruned_loss=0.2373, over 21446.00 frames. ], tot_loss[loss=0.4755, simple_loss=0.4647, pruned_loss=0.2434, over 4258088.71 frames. ], batch size: 389, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:01:12,769 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.59 vs. limit=5.1419999999999995 2023-06-17 18:01:20,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=14280.0, ans=0.007765217391304348 2023-06-17 18:01:49,070 INFO [train.py:996] (1/4) Epoch 1, batch 2400, loss[loss=0.4432, simple_loss=0.4686, pruned_loss=0.2088, over 21603.00 frames. ], tot_loss[loss=0.4821, simple_loss=0.4712, pruned_loss=0.2466, over 4266197.99 frames. ], batch size: 263, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:01:57,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=14400.0, ans=0.396 2023-06-17 18:01:59,363 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 4.626e+02 8.072e+02 1.275e+03 2.674e+03, threshold=1.614e+03, percent-clipped=13.0 2023-06-17 18:02:16,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=14460.0, ans=0.125 2023-06-17 18:02:27,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=14520.0, ans=0.007713043478260869 2023-06-17 18:02:29,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=14520.0, ans=0.125 2023-06-17 18:03:04,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=14580.0, ans=0.125 2023-06-17 18:03:22,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=14640.0, ans=0.38760000000000006 2023-06-17 18:03:34,290 INFO [train.py:996] (1/4) Epoch 1, batch 2450, loss[loss=0.4255, simple_loss=0.4053, pruned_loss=0.2229, over 21517.00 frames. ], tot_loss[loss=0.4879, simple_loss=0.4766, pruned_loss=0.2497, over 4271335.38 frames. ], batch size: 442, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:03:36,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=14700.0, ans=0.125 2023-06-17 18:03:39,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=14700.0, ans=0.125 2023-06-17 18:03:41,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=14700.0, ans=0.125 2023-06-17 18:05:04,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.17 vs. limit=12.469999999999999 2023-06-17 18:05:16,571 INFO [train.py:996] (1/4) Epoch 1, batch 2500, loss[loss=0.4029, simple_loss=0.4282, pruned_loss=0.1888, over 21240.00 frames. ], tot_loss[loss=0.4784, simple_loss=0.4685, pruned_loss=0.2442, over 4267198.35 frames. ], batch size: 159, lr: 4.38e-02, grad_scale: 8.0 2023-06-17 18:05:24,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=15000.0, ans=0.125 2023-06-17 18:05:28,401 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.577e+02 4.881e+02 6.609e+02 9.679e+02 1.963e+03, threshold=1.322e+03, percent-clipped=4.0 2023-06-17 18:07:00,747 INFO [train.py:996] (1/4) Epoch 1, batch 2550, loss[loss=0.4215, simple_loss=0.4227, pruned_loss=0.2102, over 21163.00 frames. ], tot_loss[loss=0.4717, simple_loss=0.4643, pruned_loss=0.2396, over 4263646.06 frames. ], batch size: 176, lr: 4.38e-02, grad_scale: 8.0 2023-06-17 18:07:05,009 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=13.2375 2023-06-17 18:07:05,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15300.0, ans=0.14700000000000002 2023-06-17 18:07:31,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=15420.0, ans=0.125 2023-06-17 18:07:47,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=15420.0, ans=0.125 2023-06-17 18:07:52,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=15420.0, ans=0.125 2023-06-17 18:07:54,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=15480.0, ans=0.125 2023-06-17 18:08:29,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=15540.0, ans=0.125 2023-06-17 18:08:33,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=15540.0, ans=8.885 2023-06-17 18:08:38,594 INFO [train.py:996] (1/4) Epoch 1, batch 2600, loss[loss=0.4565, simple_loss=0.459, pruned_loss=0.227, over 21997.00 frames. ], tot_loss[loss=0.4747, simple_loss=0.467, pruned_loss=0.2412, over 4269191.40 frames. ], batch size: 317, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:08:50,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.934e+02 4.629e+02 7.030e+02 1.078e+03 2.784e+03, threshold=1.406e+03, percent-clipped=16.0 2023-06-17 18:09:03,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=15660.0, ans=0.0014166666666666702 2023-06-17 18:09:47,085 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=7.156000000000001 2023-06-17 18:09:59,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=15780.0, ans=0.125 2023-06-17 18:10:24,599 INFO [train.py:996] (1/4) Epoch 1, batch 2650, loss[loss=0.4444, simple_loss=0.4324, pruned_loss=0.2282, over 21260.00 frames. ], tot_loss[loss=0.4739, simple_loss=0.4657, pruned_loss=0.2411, over 4273825.83 frames. ], batch size: 548, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:10:29,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=15900.0, ans=0.00041666666666666935 2023-06-17 18:10:35,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15900.0, ans=0.14100000000000001 2023-06-17 18:11:32,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=16080.0, ans=0.125 2023-06-17 18:11:51,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=16080.0, ans=0.125 2023-06-17 18:11:52,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=16140.0, ans=0.33510000000000006 2023-06-17 18:12:10,375 INFO [train.py:996] (1/4) Epoch 1, batch 2700, loss[loss=0.3752, simple_loss=0.3999, pruned_loss=0.1752, over 21808.00 frames. ], tot_loss[loss=0.4656, simple_loss=0.4611, pruned_loss=0.2351, over 4276131.94 frames. ], batch size: 316, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:12:18,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=16200.0, ans=0.0 2023-06-17 18:12:21,526 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.712e+02 4.286e+02 6.579e+02 1.091e+03 3.152e+03, threshold=1.316e+03, percent-clipped=14.0 2023-06-17 18:13:22,873 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=13.6425 2023-06-17 18:13:54,403 INFO [train.py:996] (1/4) Epoch 1, batch 2750, loss[loss=0.4784, simple_loss=0.4763, pruned_loss=0.2402, over 21827.00 frames. ], tot_loss[loss=0.4653, simple_loss=0.4613, pruned_loss=0.2347, over 4277887.71 frames. ], batch size: 107, lr: 4.36e-02, grad_scale: 4.0 2023-06-17 18:13:55,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.45 vs. limit=9.125 2023-06-17 18:15:34,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=9.2 2023-06-17 18:15:35,669 INFO [train.py:996] (1/4) Epoch 1, batch 2800, loss[loss=0.4387, simple_loss=0.4195, pruned_loss=0.2289, over 21304.00 frames. ], tot_loss[loss=0.4671, simple_loss=0.4643, pruned_loss=0.2349, over 4280952.38 frames. ], batch size: 551, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:15:38,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.31 vs. limit=20.1 2023-06-17 18:15:39,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=16800.0, ans=0.125 2023-06-17 18:15:42,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=16800.0, ans=0.125 2023-06-17 18:15:59,815 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.894e+02 6.832e+02 1.003e+03 4.773e+03, threshold=1.366e+03, percent-clipped=15.0 2023-06-17 18:16:11,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=16860.0, ans=0.125 2023-06-17 18:16:28,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.20 vs. limit=5.538 2023-06-17 18:16:35,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=16920.0, ans=0.0 2023-06-17 18:17:20,313 INFO [train.py:996] (1/4) Epoch 1, batch 2850, loss[loss=0.366, simple_loss=0.3909, pruned_loss=0.1705, over 21748.00 frames. ], tot_loss[loss=0.4628, simple_loss=0.4619, pruned_loss=0.2319, over 4277270.54 frames. ], batch size: 282, lr: 4.35e-02, grad_scale: 8.0 2023-06-17 18:17:21,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.79 vs. limit=9.275 2023-06-17 18:17:25,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=17100.0, ans=0.0 2023-06-17 18:17:52,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=17160.0, ans=10.0 2023-06-17 18:17:55,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17160.0, ans=0.12840000000000001 2023-06-17 18:18:39,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=20.46 2023-06-17 18:18:59,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=17340.0, ans=0.05 2023-06-17 18:19:03,953 INFO [train.py:996] (1/4) Epoch 1, batch 2900, loss[loss=0.4236, simple_loss=0.4213, pruned_loss=0.213, over 21820.00 frames. ], tot_loss[loss=0.457, simple_loss=0.4567, pruned_loss=0.2287, over 4280145.13 frames. ], batch size: 282, lr: 4.35e-02, grad_scale: 8.0 2023-06-17 18:19:15,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.36 vs. limit=5.609999999999999 2023-06-17 18:19:28,660 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 4.517e+02 6.306e+02 8.812e+02 1.788e+03, threshold=1.261e+03, percent-clipped=6.0 2023-06-17 18:20:01,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=17520.0, ans=0.125 2023-06-17 18:20:13,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17520.0, ans=0.12480000000000002 2023-06-17 18:20:54,496 INFO [train.py:996] (1/4) Epoch 1, batch 2950, loss[loss=0.4855, simple_loss=0.51, pruned_loss=0.2305, over 21715.00 frames. ], tot_loss[loss=0.4542, simple_loss=0.4559, pruned_loss=0.2263, over 4284218.33 frames. ], batch size: 414, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:21:35,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=17760.0, ans=14.16 2023-06-17 18:22:07,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17880.0, ans=0.125 2023-06-17 18:22:13,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=17880.0, ans=20.91 2023-06-17 18:22:44,537 INFO [train.py:996] (1/4) Epoch 1, batch 3000, loss[loss=0.4921, simple_loss=0.4951, pruned_loss=0.2446, over 21481.00 frames. ], tot_loss[loss=0.458, simple_loss=0.4607, pruned_loss=0.2277, over 4288500.37 frames. ], batch size: 131, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:22:44,537 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-17 18:23:01,479 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.3658, simple_loss=0.4363, pruned_loss=0.1476, over 1796401.00 frames. 2023-06-17 18:23:01,479 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24375MB 2023-06-17 18:23:06,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=18000.0, ans=0.125 2023-06-17 18:23:20,423 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 5.025e+02 6.573e+02 9.808e+02 2.550e+03, threshold=1.315e+03, percent-clipped=11.0 2023-06-17 18:23:21,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=18000.0, ans=0.0 2023-06-17 18:23:22,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=18060.0, ans=0.125 2023-06-17 18:23:39,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.69 vs. limit=21.045 2023-06-17 18:24:04,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=18180.0, ans=0.125 2023-06-17 18:24:21,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.10 vs. limit=14.09 2023-06-17 18:24:23,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.17 vs. limit=5.736000000000001 2023-06-17 18:24:45,850 INFO [train.py:996] (1/4) Epoch 1, batch 3050, loss[loss=0.4683, simple_loss=0.473, pruned_loss=0.2318, over 21692.00 frames. ], tot_loss[loss=0.453, simple_loss=0.46, pruned_loss=0.2229, over 4288040.39 frames. ], batch size: 389, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:25:15,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=18360.0, ans=0.006878260869565217 2023-06-17 18:25:15,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=18360.0, ans=0.125 2023-06-17 18:25:29,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=18420.0, ans=0.125 2023-06-17 18:25:35,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=18420.0, ans=0.11580000000000001 2023-06-17 18:25:38,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=18420.0, ans=0.25529999999999997 2023-06-17 18:26:35,502 INFO [train.py:996] (1/4) Epoch 1, batch 3100, loss[loss=0.3833, simple_loss=0.4103, pruned_loss=0.1781, over 21291.00 frames. ], tot_loss[loss=0.4471, simple_loss=0.4563, pruned_loss=0.219, over 4282437.70 frames. ], batch size: 159, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:26:53,811 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.845e+02 4.821e+02 6.301e+02 1.043e+03 2.318e+03, threshold=1.260e+03, percent-clipped=14.0 2023-06-17 18:27:15,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=18720.0, ans=0.0 2023-06-17 18:28:20,377 INFO [train.py:996] (1/4) Epoch 1, batch 3150, loss[loss=0.4509, simple_loss=0.4474, pruned_loss=0.2272, over 21638.00 frames. ], tot_loss[loss=0.4485, simple_loss=0.4578, pruned_loss=0.2196, over 4281007.14 frames. ], batch size: 263, lr: 4.32e-02, grad_scale: 8.0 2023-06-17 18:29:24,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=19020.0, ans=0.125 2023-06-17 18:29:42,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19080.0, ans=0.10920000000000002 2023-06-17 18:30:06,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19140.0, ans=0.1086 2023-06-17 18:30:11,511 INFO [train.py:996] (1/4) Epoch 1, batch 3200, loss[loss=0.4868, simple_loss=0.4862, pruned_loss=0.2437, over 21926.00 frames. ], tot_loss[loss=0.4499, simple_loss=0.46, pruned_loss=0.2199, over 4276965.16 frames. ], batch size: 372, lr: 4.32e-02, grad_scale: 16.0 2023-06-17 18:30:12,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=19200.0, ans=0.02 2023-06-17 18:30:24,578 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 4.999e+02 6.065e+02 1.040e+03 2.031e+03, threshold=1.213e+03, percent-clipped=14.0 2023-06-17 18:31:37,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=19440.0, ans=0.125 2023-06-17 18:31:53,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=14.79 2023-06-17 18:31:55,108 INFO [train.py:996] (1/4) Epoch 1, batch 3250, loss[loss=0.5198, simple_loss=0.5124, pruned_loss=0.2636, over 21777.00 frames. ], tot_loss[loss=0.4547, simple_loss=0.462, pruned_loss=0.2237, over 4272592.18 frames. ], batch size: 441, lr: 4.31e-02, grad_scale: 8.0 2023-06-17 18:32:07,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=19500.0, ans=0.10500000000000001 2023-06-17 18:32:11,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=19560.0, ans=0.21540000000000004 2023-06-17 18:33:39,997 INFO [train.py:996] (1/4) Epoch 1, batch 3300, loss[loss=0.3907, simple_loss=0.4401, pruned_loss=0.1707, over 21758.00 frames. ], tot_loss[loss=0.4486, simple_loss=0.456, pruned_loss=0.2207, over 4280077.48 frames. ], batch size: 351, lr: 4.31e-02, grad_scale: 8.0 2023-06-17 18:34:06,163 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 4.541e+02 6.764e+02 1.015e+03 2.529e+03, threshold=1.353e+03, percent-clipped=14.0 2023-06-17 18:34:57,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=14.9925 2023-06-17 18:35:18,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=20040.0, ans=0.1 2023-06-17 18:35:24,216 INFO [train.py:996] (1/4) Epoch 1, batch 3350, loss[loss=0.4945, simple_loss=0.4843, pruned_loss=0.2523, over 21730.00 frames. ], tot_loss[loss=0.4516, simple_loss=0.4591, pruned_loss=0.2221, over 4282763.14 frames. ], batch size: 389, lr: 4.30e-02, grad_scale: 8.0 2023-06-17 18:36:23,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=20220.0, ans=0.1 2023-06-17 18:37:12,824 INFO [train.py:996] (1/4) Epoch 1, batch 3400, loss[loss=0.45, simple_loss=0.4568, pruned_loss=0.2216, over 21545.00 frames. ], tot_loss[loss=0.4503, simple_loss=0.4581, pruned_loss=0.2213, over 4281364.89 frames. ], batch size: 389, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 18:37:20,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=20400.0, ans=0.006434782608695653 2023-06-17 18:37:34,558 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.423e+02 4.338e+02 6.007e+02 8.675e+02 3.027e+03, threshold=1.201e+03, percent-clipped=6.0 2023-06-17 18:38:17,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=20580.0, ans=0.125 2023-06-17 18:39:03,059 INFO [train.py:996] (1/4) Epoch 1, batch 3450, loss[loss=0.3811, simple_loss=0.3991, pruned_loss=0.1815, over 21130.00 frames. ], tot_loss[loss=0.4438, simple_loss=0.451, pruned_loss=0.2183, over 4280274.78 frames. ], batch size: 143, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 18:39:33,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=20760.0, ans=0.04949747468305833 2023-06-17 18:39:37,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-17 18:39:52,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=20820.0, ans=0.125 2023-06-17 18:40:13,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=20880.0, ans=0.006330434782608696 2023-06-17 18:40:47,103 INFO [train.py:996] (1/4) Epoch 1, batch 3500, loss[loss=0.4844, simple_loss=0.4792, pruned_loss=0.2448, over 21691.00 frames. ], tot_loss[loss=0.4529, simple_loss=0.4602, pruned_loss=0.2228, over 4283057.29 frames. ], batch size: 298, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 18:40:49,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=21000.0, ans=0.125 2023-06-17 18:40:54,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=21000.0, ans=0.125 2023-06-17 18:41:09,661 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 4.958e+02 6.770e+02 9.160e+02 2.307e+03, threshold=1.354e+03, percent-clipped=16.0 2023-06-17 18:42:32,159 INFO [train.py:996] (1/4) Epoch 1, batch 3550, loss[loss=0.4041, simple_loss=0.4221, pruned_loss=0.193, over 21761.00 frames. ], tot_loss[loss=0.4571, simple_loss=0.4635, pruned_loss=0.2253, over 4288178.58 frames. ], batch size: 351, lr: 4.28e-02, grad_scale: 4.0 2023-06-17 18:42:32,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=21300.0, ans=0.0 2023-06-17 18:42:50,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=21300.0, ans=0.125 2023-06-17 18:43:12,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=21360.0, ans=0.006226086956521739 2023-06-17 18:43:20,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=21420.0, ans=0.2 2023-06-17 18:44:21,625 INFO [train.py:996] (1/4) Epoch 1, batch 3600, loss[loss=0.5388, simple_loss=0.5205, pruned_loss=0.2785, over 21784.00 frames. ], tot_loss[loss=0.4538, simple_loss=0.4587, pruned_loss=0.2244, over 4285745.93 frames. ], batch size: 124, lr: 4.27e-02, grad_scale: 8.0 2023-06-17 18:44:35,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=21600.0, ans=0.5 2023-06-17 18:44:38,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=21660.0, ans=6.0 2023-06-17 18:44:39,437 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 4.436e+02 5.716e+02 8.040e+02 1.927e+03, threshold=1.143e+03, percent-clipped=4.0 2023-06-17 18:44:41,509 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:44:56,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=21660.0, ans=0.0 2023-06-17 18:45:46,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=21840.0, ans=0.09899494936611666 2023-06-17 18:46:05,190 INFO [train.py:996] (1/4) Epoch 1, batch 3650, loss[loss=0.3359, simple_loss=0.3669, pruned_loss=0.1524, over 21301.00 frames. ], tot_loss[loss=0.4542, simple_loss=0.4599, pruned_loss=0.2243, over 4283630.61 frames. ], batch size: 159, lr: 4.27e-02, grad_scale: 8.0 2023-06-17 18:46:19,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=21960.0, ans=0.1 2023-06-17 18:46:22,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=21960.0, ans=0.0 2023-06-17 18:46:56,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=22020.0, ans=0.0 2023-06-17 18:46:57,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=22080.0, ans=0.125 2023-06-17 18:47:01,284 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:47:13,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=22080.0, ans=0.2 2023-06-17 18:47:48,602 INFO [train.py:996] (1/4) Epoch 1, batch 3700, loss[loss=0.4142, simple_loss=0.415, pruned_loss=0.2067, over 21242.00 frames. ], tot_loss[loss=0.4496, simple_loss=0.4569, pruned_loss=0.2211, over 4286075.12 frames. ], batch size: 608, lr: 4.26e-02, grad_scale: 8.0 2023-06-17 18:48:06,728 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.875e+02 4.996e+02 7.328e+02 1.013e+03 2.628e+03, threshold=1.466e+03, percent-clipped=16.0 2023-06-17 18:49:08,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=22440.0, ans=0.005991304347826087 2023-06-17 18:49:12,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=22440.0, ans=0.2 2023-06-17 18:49:19,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=22440.0, ans=0.015 2023-06-17 18:49:20,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=22440.0, ans=0.005991304347826087 2023-06-17 18:49:32,154 INFO [train.py:996] (1/4) Epoch 1, batch 3750, loss[loss=0.2865, simple_loss=0.2999, pruned_loss=0.1365, over 16670.00 frames. ], tot_loss[loss=0.4433, simple_loss=0.4511, pruned_loss=0.2178, over 4287574.72 frames. ], batch size: 60, lr: 4.26e-02, grad_scale: 8.0 2023-06-17 18:49:40,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22500.0, ans=0.1 2023-06-17 18:50:32,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=22680.0, ans=0.2 2023-06-17 18:50:43,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=22680.0, ans=0.125 2023-06-17 18:50:59,879 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:51:16,276 INFO [train.py:996] (1/4) Epoch 1, batch 3800, loss[loss=0.4474, simple_loss=0.4554, pruned_loss=0.2196, over 21810.00 frames. ], tot_loss[loss=0.4396, simple_loss=0.4486, pruned_loss=0.2153, over 4281577.66 frames. ], batch size: 247, lr: 4.25e-02, grad_scale: 8.0 2023-06-17 18:51:23,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=22800.0, ans=0.125 2023-06-17 18:51:39,167 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.895e+02 5.418e+02 7.571e+02 2.562e+03, threshold=1.084e+03, percent-clipped=5.0 2023-06-17 18:51:54,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=22920.0, ans=0.005886956521739131 2023-06-17 18:52:41,706 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:52:57,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=23100.0, ans=0.0 2023-06-17 18:52:58,984 INFO [train.py:996] (1/4) Epoch 1, batch 3850, loss[loss=0.3832, simple_loss=0.3948, pruned_loss=0.1858, over 21689.00 frames. ], tot_loss[loss=0.4386, simple_loss=0.4457, pruned_loss=0.2158, over 4283790.45 frames. ], batch size: 333, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 18:53:28,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=23160.0, ans=0.125 2023-06-17 18:54:20,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=23280.0, ans=0.05 2023-06-17 18:54:35,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23340.0, ans=0.1 2023-06-17 18:54:40,119 INFO [train.py:996] (1/4) Epoch 1, batch 3900, loss[loss=0.4029, simple_loss=0.4139, pruned_loss=0.1959, over 21891.00 frames. ], tot_loss[loss=0.4331, simple_loss=0.4397, pruned_loss=0.2132, over 4271652.97 frames. ], batch size: 113, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 18:54:52,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=23400.0, ans=0.0 2023-06-17 18:54:59,267 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.423e+02 4.722e+02 6.490e+02 9.055e+02 2.329e+03, threshold=1.298e+03, percent-clipped=15.0 2023-06-17 18:55:41,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-17 18:55:43,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=23580.0, ans=0.005743478260869565 2023-06-17 18:55:52,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=23580.0, ans=0.0 2023-06-17 18:56:02,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=23580.0, ans=0.125 2023-06-17 18:56:07,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=23640.0, ans=0.125 2023-06-17 18:56:07,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=23640.0, ans=0.125 2023-06-17 18:56:25,077 INFO [train.py:996] (1/4) Epoch 1, batch 3950, loss[loss=0.2531, simple_loss=0.276, pruned_loss=0.1151, over 17343.00 frames. ], tot_loss[loss=0.4313, simple_loss=0.44, pruned_loss=0.2113, over 4277204.32 frames. ], batch size: 60, lr: 4.23e-02, grad_scale: 8.0 2023-06-17 18:56:51,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=23760.0, ans=0.1 2023-06-17 18:56:56,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=23760.0, ans=0.005704347826086957 2023-06-17 18:57:17,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=23820.0, ans=0.125 2023-06-17 18:57:20,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=12.0 2023-06-17 18:58:09,730 INFO [train.py:996] (1/4) Epoch 1, batch 4000, loss[loss=0.392, simple_loss=0.3959, pruned_loss=0.1941, over 21737.00 frames. ], tot_loss[loss=0.4196, simple_loss=0.4302, pruned_loss=0.2045, over 4274472.72 frames. ], batch size: 300, lr: 4.23e-02, grad_scale: 16.0 2023-06-17 18:58:18,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=24000.0, ans=0.125 2023-06-17 18:58:33,373 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.808e+02 4.109e+02 5.052e+02 7.332e+02 1.857e+03, threshold=1.010e+03, percent-clipped=6.0 2023-06-17 18:58:34,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.99 vs. limit=22.5 2023-06-17 18:58:36,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-17 18:59:52,451 INFO [train.py:996] (1/4) Epoch 1, batch 4050, loss[loss=0.5387, simple_loss=0.5153, pruned_loss=0.2811, over 21621.00 frames. ], tot_loss[loss=0.4147, simple_loss=0.4277, pruned_loss=0.2009, over 4282467.24 frames. ], batch size: 507, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 18:59:52,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=24300.0, ans=0.125 2023-06-17 19:00:00,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=24300.0, ans=0.2 2023-06-17 19:00:04,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=24300.0, ans=0.2 2023-06-17 19:00:12,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=24360.0, ans=0.125 2023-06-17 19:00:20,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=24360.0, ans=0.1 2023-06-17 19:01:08,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.86 vs. limit=22.5 2023-06-17 19:01:35,938 INFO [train.py:996] (1/4) Epoch 1, batch 4100, loss[loss=0.3571, simple_loss=0.3848, pruned_loss=0.1647, over 21286.00 frames. ], tot_loss[loss=0.4193, simple_loss=0.4313, pruned_loss=0.2036, over 4289032.03 frames. ], batch size: 143, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 19:02:00,998 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.803e+02 4.142e+02 6.350e+02 1.020e+03 2.376e+03, threshold=1.270e+03, percent-clipped=25.0 2023-06-17 19:02:53,783 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-17 19:03:10,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.14 vs. limit=15.0 2023-06-17 19:03:19,135 INFO [train.py:996] (1/4) Epoch 1, batch 4150, loss[loss=0.3626, simple_loss=0.4005, pruned_loss=0.1624, over 21726.00 frames. ], tot_loss[loss=0.41, simple_loss=0.4286, pruned_loss=0.1957, over 4288169.13 frames. ], batch size: 282, lr: 4.21e-02, grad_scale: 8.0 2023-06-17 19:03:19,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-17 19:04:30,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=25080.0, ans=0.125 2023-06-17 19:05:05,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=25140.0, ans=0.125 2023-06-17 19:05:09,984 INFO [train.py:996] (1/4) Epoch 1, batch 4200, loss[loss=0.4037, simple_loss=0.4307, pruned_loss=0.1884, over 21674.00 frames. ], tot_loss[loss=0.4077, simple_loss=0.4275, pruned_loss=0.194, over 4272811.58 frames. ], batch size: 298, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:05:38,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=25260.0, ans=0.0 2023-06-17 19:05:46,229 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 4.243e+02 5.422e+02 7.726e+02 1.559e+03, threshold=1.084e+03, percent-clipped=3.0 2023-06-17 19:05:55,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=15.0 2023-06-17 19:06:10,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=25320.0, ans=0.0 2023-06-17 19:06:46,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=25440.0, ans=0.125 2023-06-17 19:07:07,382 INFO [train.py:996] (1/4) Epoch 1, batch 4250, loss[loss=0.4802, simple_loss=0.4991, pruned_loss=0.2307, over 21615.00 frames. ], tot_loss[loss=0.417, simple_loss=0.4373, pruned_loss=0.1984, over 4269168.34 frames. ], batch size: 389, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:07:10,034 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.07 vs. limit=22.5 2023-06-17 19:07:24,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25500.0, ans=0.1 2023-06-17 19:08:13,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=25680.0, ans=0.0 2023-06-17 19:08:58,887 INFO [train.py:996] (1/4) Epoch 1, batch 4300, loss[loss=0.3774, simple_loss=0.433, pruned_loss=0.1609, over 21860.00 frames. ], tot_loss[loss=0.4262, simple_loss=0.4456, pruned_loss=0.2034, over 4269603.59 frames. ], batch size: 316, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:08:59,549 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:09:03,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-17 19:09:14,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=25860.0, ans=0.2 2023-06-17 19:09:18,522 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.691e+02 4.313e+02 6.396e+02 8.892e+02 2.391e+03, threshold=1.279e+03, percent-clipped=16.0 2023-06-17 19:09:59,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=5.21 vs. limit=15.0 2023-06-17 19:10:42,355 INFO [train.py:996] (1/4) Epoch 1, batch 4350, loss[loss=0.4403, simple_loss=0.4325, pruned_loss=0.2241, over 21625.00 frames. ], tot_loss[loss=0.429, simple_loss=0.4488, pruned_loss=0.2045, over 4262653.01 frames. ], batch size: 332, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:10:42,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=26100.0, ans=0.125 2023-06-17 19:10:52,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=26100.0, ans=0.125 2023-06-17 19:11:28,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=26220.0, ans=0.125 2023-06-17 19:11:39,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=26280.0, ans=0.125 2023-06-17 19:12:27,477 INFO [train.py:996] (1/4) Epoch 1, batch 4400, loss[loss=0.3446, simple_loss=0.374, pruned_loss=0.1576, over 21823.00 frames. ], tot_loss[loss=0.4238, simple_loss=0.4434, pruned_loss=0.202, over 4264100.92 frames. ], batch size: 118, lr: 4.18e-02, grad_scale: 16.0 2023-06-17 19:12:28,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=15.0 2023-06-17 19:12:31,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=26400.0, ans=22.5 2023-06-17 19:12:32,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=26400.0, ans=0.2 2023-06-17 19:12:48,103 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.456e+02 3.803e+02 5.319e+02 7.173e+02 2.856e+03, threshold=1.064e+03, percent-clipped=8.0 2023-06-17 19:13:22,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=26520.0, ans=10.0 2023-06-17 19:14:05,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=26640.0, ans=0.1 2023-06-17 19:14:13,435 INFO [train.py:996] (1/4) Epoch 1, batch 4450, loss[loss=0.4023, simple_loss=0.426, pruned_loss=0.1893, over 21857.00 frames. ], tot_loss[loss=0.4275, simple_loss=0.4495, pruned_loss=0.2028, over 4270018.37 frames. ], batch size: 118, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 19:14:37,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=26760.0, ans=10.0 2023-06-17 19:14:43,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=26760.0, ans=0.07 2023-06-17 19:14:52,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=26820.0, ans=0.0 2023-06-17 19:15:38,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.03 vs. limit=15.0 2023-06-17 19:15:52,001 INFO [train.py:996] (1/4) Epoch 1, batch 4500, loss[loss=0.4437, simple_loss=0.4475, pruned_loss=0.2199, over 21483.00 frames. ], tot_loss[loss=0.4299, simple_loss=0.4504, pruned_loss=0.2047, over 4279320.68 frames. ], batch size: 144, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 19:15:54,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=27000.0, ans=0.0 2023-06-17 19:15:57,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=27000.0, ans=0.1 2023-06-17 19:16:19,139 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.691e+02 6.117e+02 8.779e+02 1.856e+03, threshold=1.223e+03, percent-clipped=14.0 2023-06-17 19:16:47,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=27120.0, ans=0.004973913043478261 2023-06-17 19:17:36,015 INFO [train.py:996] (1/4) Epoch 1, batch 4550, loss[loss=0.5137, simple_loss=0.5128, pruned_loss=0.2573, over 21406.00 frames. ], tot_loss[loss=0.4332, simple_loss=0.4536, pruned_loss=0.2064, over 4280668.92 frames. ], batch size: 471, lr: 4.16e-02, grad_scale: 8.0 2023-06-17 19:17:43,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=27300.0, ans=0.015 2023-06-17 19:18:18,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=27420.0, ans=0.0 2023-06-17 19:18:33,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=27420.0, ans=0.125 2023-06-17 19:18:59,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=27540.0, ans=0.125 2023-06-17 19:19:17,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=27600.0, ans=0.125 2023-06-17 19:19:19,233 INFO [train.py:996] (1/4) Epoch 1, batch 4600, loss[loss=0.3731, simple_loss=0.4005, pruned_loss=0.1728, over 21803.00 frames. ], tot_loss[loss=0.4359, simple_loss=0.4558, pruned_loss=0.2081, over 4288003.18 frames. ], batch size: 282, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 19:19:46,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.894e+02 4.493e+02 6.587e+02 9.549e+02 1.987e+03, threshold=1.317e+03, percent-clipped=15.0 2023-06-17 19:19:57,639 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-17 19:20:01,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=27720.0, ans=0.125 2023-06-17 19:20:10,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=27720.0, ans=0.0 2023-06-17 19:20:22,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=27720.0, ans=0.125 2023-06-17 19:20:56,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.36 vs. limit=22.5 2023-06-17 19:20:58,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=27840.0, ans=0.125 2023-06-17 19:21:02,837 INFO [train.py:996] (1/4) Epoch 1, batch 4650, loss[loss=0.3446, simple_loss=0.3775, pruned_loss=0.1558, over 21524.00 frames. ], tot_loss[loss=0.4237, simple_loss=0.4437, pruned_loss=0.2018, over 4293335.92 frames. ], batch size: 131, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 19:21:11,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=27900.0, ans=0.0 2023-06-17 19:21:14,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=27900.0, ans=0.1 2023-06-17 19:22:40,783 INFO [train.py:996] (1/4) Epoch 1, batch 4700, loss[loss=0.403, simple_loss=0.4042, pruned_loss=0.2009, over 21656.00 frames. ], tot_loss[loss=0.4189, simple_loss=0.4373, pruned_loss=0.2002, over 4290303.44 frames. ], batch size: 416, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 19:23:12,932 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.969e+02 4.560e+02 5.738e+02 6.731e+02 1.328e+03, threshold=1.148e+03, percent-clipped=1.0 2023-06-17 19:23:35,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=28320.0, ans=0.1 2023-06-17 19:23:56,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=28380.0, ans=0.125 2023-06-17 19:24:07,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28440.0, ans=0.125 2023-06-17 19:24:22,952 INFO [train.py:996] (1/4) Epoch 1, batch 4750, loss[loss=0.4547, simple_loss=0.4613, pruned_loss=0.224, over 21829.00 frames. ], tot_loss[loss=0.414, simple_loss=0.4301, pruned_loss=0.199, over 4294242.69 frames. ], batch size: 112, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 19:24:42,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28500.0, ans=0.1 2023-06-17 19:25:08,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28620.0, ans=0.1 2023-06-17 19:25:21,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.40 vs. limit=22.5 2023-06-17 19:25:29,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=28680.0, ans=0.0 2023-06-17 19:25:41,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=28680.0, ans=0.125 2023-06-17 19:25:52,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=28740.0, ans=0.125 2023-06-17 19:26:08,413 INFO [train.py:996] (1/4) Epoch 1, batch 4800, loss[loss=0.3762, simple_loss=0.3954, pruned_loss=0.1785, over 21862.00 frames. ], tot_loss[loss=0.4176, simple_loss=0.4331, pruned_loss=0.201, over 4299861.41 frames. ], batch size: 124, lr: 4.13e-02, grad_scale: 16.0 2023-06-17 19:26:24,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=28800.0, ans=0.125 2023-06-17 19:26:30,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=28860.0, ans=0.125 2023-06-17 19:26:40,345 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.777e+02 4.396e+02 5.630e+02 9.544e+02 1.768e+03, threshold=1.126e+03, percent-clipped=14.0 2023-06-17 19:26:56,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=28920.0, ans=0.015 2023-06-17 19:27:04,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=28920.0, ans=0.07 2023-06-17 19:27:26,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-17 19:27:30,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=29040.0, ans=0.004556521739130435 2023-06-17 19:27:44,653 INFO [train.py:996] (1/4) Epoch 1, batch 4850, loss[loss=0.4384, simple_loss=0.4511, pruned_loss=0.2128, over 21557.00 frames. ], tot_loss[loss=0.416, simple_loss=0.4318, pruned_loss=0.2001, over 4297859.46 frames. ], batch size: 441, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 19:28:07,073 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-17 19:28:33,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=29220.0, ans=0.0 2023-06-17 19:28:49,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=29280.0, ans=0.2 2023-06-17 19:29:11,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=29340.0, ans=0.125 2023-06-17 19:29:24,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=29340.0, ans=0.125 2023-06-17 19:29:28,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=29400.0, ans=0.04949747468305833 2023-06-17 19:29:29,273 INFO [train.py:996] (1/4) Epoch 1, batch 4900, loss[loss=0.5946, simple_loss=0.5545, pruned_loss=0.3174, over 21532.00 frames. ], tot_loss[loss=0.4166, simple_loss=0.4334, pruned_loss=0.1999, over 4300132.22 frames. ], batch size: 507, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 19:30:02,444 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 4.351e+02 5.424e+02 7.801e+02 1.566e+03, threshold=1.085e+03, percent-clipped=9.0 2023-06-17 19:30:14,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=29460.0, ans=0.2 2023-06-17 19:30:15,738 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=8.0 2023-06-17 19:31:25,671 INFO [train.py:996] (1/4) Epoch 1, batch 4950, loss[loss=0.2806, simple_loss=0.3458, pruned_loss=0.1077, over 21281.00 frames. ], tot_loss[loss=0.4127, simple_loss=0.4342, pruned_loss=0.1956, over 4282277.35 frames. ], batch size: 131, lr: 4.11e-02, grad_scale: 16.0 2023-06-17 19:31:48,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=29760.0, ans=0.125 2023-06-17 19:32:27,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=29880.0, ans=0.0 2023-06-17 19:32:34,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=29880.0, ans=0.07 2023-06-17 19:32:37,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=29880.0, ans=0.0 2023-06-17 19:33:07,721 INFO [train.py:996] (1/4) Epoch 1, batch 5000, loss[loss=0.444, simple_loss=0.4368, pruned_loss=0.2256, over 21503.00 frames. ], tot_loss[loss=0.4064, simple_loss=0.4326, pruned_loss=0.1901, over 4282791.71 frames. ], batch size: 548, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 19:33:16,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=30000.0, ans=0.125 2023-06-17 19:33:32,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=30060.0, ans=0.2 2023-06-17 19:33:34,067 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 4.453e+02 5.189e+02 7.873e+02 1.529e+03, threshold=1.038e+03, percent-clipped=6.0 2023-06-17 19:33:36,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=30060.0, ans=0.125 2023-06-17 19:34:26,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=30240.0, ans=0.2 2023-06-17 19:34:45,159 INFO [train.py:996] (1/4) Epoch 1, batch 5050, loss[loss=0.3868, simple_loss=0.4167, pruned_loss=0.1785, over 21462.00 frames. ], tot_loss[loss=0.41, simple_loss=0.4343, pruned_loss=0.1928, over 4290789.67 frames. ], batch size: 131, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 19:35:02,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=30300.0, ans=0.125 2023-06-17 19:35:05,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=30360.0, ans=0.125 2023-06-17 19:35:12,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=30360.0, ans=0.125 2023-06-17 19:36:13,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=30540.0, ans=0.0 2023-06-17 19:36:27,872 INFO [train.py:996] (1/4) Epoch 1, batch 5100, loss[loss=0.4489, simple_loss=0.4542, pruned_loss=0.2218, over 21726.00 frames. ], tot_loss[loss=0.4106, simple_loss=0.4338, pruned_loss=0.1936, over 4290900.20 frames. ], batch size: 112, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 19:36:50,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-17 19:36:53,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=30660.0, ans=0.125 2023-06-17 19:36:59,579 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.656e+02 4.521e+02 5.607e+02 7.657e+02 1.284e+03, threshold=1.121e+03, percent-clipped=8.0 2023-06-17 19:37:15,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=30720.0, ans=0.2 2023-06-17 19:38:01,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=30840.0, ans=0.0 2023-06-17 19:38:11,324 INFO [train.py:996] (1/4) Epoch 1, batch 5150, loss[loss=0.4649, simple_loss=0.4622, pruned_loss=0.2338, over 21784.00 frames. ], tot_loss[loss=0.4115, simple_loss=0.4332, pruned_loss=0.1949, over 4290980.39 frames. ], batch size: 508, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 19:38:26,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=30900.0, ans=0.05 2023-06-17 19:38:40,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=30960.0, ans=0.0 2023-06-17 19:39:00,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=31020.0, ans=0.125 2023-06-17 19:39:47,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=31140.0, ans=0.0041 2023-06-17 19:40:01,170 INFO [train.py:996] (1/4) Epoch 1, batch 5200, loss[loss=0.5958, simple_loss=0.5852, pruned_loss=0.3032, over 21505.00 frames. ], tot_loss[loss=0.4144, simple_loss=0.4363, pruned_loss=0.1963, over 4287465.01 frames. ], batch size: 507, lr: 4.08e-02, grad_scale: 32.0 2023-06-17 19:40:22,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=31260.0, ans=0.035 2023-06-17 19:40:27,280 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.519e+02 4.450e+02 5.949e+02 9.427e+02 1.654e+03, threshold=1.190e+03, percent-clipped=14.0 2023-06-17 19:40:32,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=31260.0, ans=0.00407391304347826 2023-06-17 19:41:22,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31440.0, ans=0.1 2023-06-17 19:41:23,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=31440.0, ans=0.125 2023-06-17 19:41:25,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=31440.0, ans=0.0 2023-06-17 19:41:44,198 INFO [train.py:996] (1/4) Epoch 1, batch 5250, loss[loss=0.3518, simple_loss=0.3998, pruned_loss=0.1519, over 21637.00 frames. ], tot_loss[loss=0.4115, simple_loss=0.4374, pruned_loss=0.1928, over 4285064.21 frames. ], batch size: 230, lr: 4.07e-02, grad_scale: 16.0 2023-06-17 19:42:09,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-17 19:42:52,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=31680.0, ans=0.09899494936611666 2023-06-17 19:43:18,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=31740.0, ans=0.2 2023-06-17 19:43:18,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=31740.0, ans=0.04949747468305833 2023-06-17 19:43:25,582 INFO [train.py:996] (1/4) Epoch 1, batch 5300, loss[loss=0.3861, simple_loss=0.4058, pruned_loss=0.1832, over 21339.00 frames. ], tot_loss[loss=0.4145, simple_loss=0.4385, pruned_loss=0.1952, over 4286874.54 frames. ], batch size: 159, lr: 4.07e-02, grad_scale: 16.0 2023-06-17 19:43:53,702 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 4.200e+02 5.076e+02 7.002e+02 1.420e+03, threshold=1.015e+03, percent-clipped=3.0 2023-06-17 19:44:24,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=31980.0, ans=0.125 2023-06-17 19:44:28,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-06-17 19:44:49,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=32040.0, ans=0.125 2023-06-17 19:45:07,642 INFO [train.py:996] (1/4) Epoch 1, batch 5350, loss[loss=0.3974, simple_loss=0.415, pruned_loss=0.1899, over 21977.00 frames. ], tot_loss[loss=0.4164, simple_loss=0.4378, pruned_loss=0.1975, over 4291882.54 frames. ], batch size: 333, lr: 4.06e-02, grad_scale: 16.0 2023-06-17 19:45:29,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=32160.0, ans=0.0 2023-06-17 19:45:50,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=32220.0, ans=0.2 2023-06-17 19:45:50,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=32220.0, ans=0.003865217391304348 2023-06-17 19:45:52,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=32220.0, ans=0.125 2023-06-17 19:46:19,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=32280.0, ans=0.125 2023-06-17 19:46:23,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=32280.0, ans=0.2 2023-06-17 19:46:26,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.53 vs. limit=6.0 2023-06-17 19:46:54,995 INFO [train.py:996] (1/4) Epoch 1, batch 5400, loss[loss=0.3176, simple_loss=0.3678, pruned_loss=0.1337, over 21710.00 frames. ], tot_loss[loss=0.4174, simple_loss=0.437, pruned_loss=0.199, over 4289656.02 frames. ], batch size: 247, lr: 4.05e-02, grad_scale: 16.0 2023-06-17 19:46:55,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=32400.0, ans=0.125 2023-06-17 19:47:05,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=32400.0, ans=0.125 2023-06-17 19:47:23,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 4.680e+02 5.760e+02 7.952e+02 1.690e+03, threshold=1.152e+03, percent-clipped=11.0 2023-06-17 19:47:48,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32580.0, ans=0.1 2023-06-17 19:47:56,065 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-17 19:48:03,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=32580.0, ans=0.0037869565217391304 2023-06-17 19:48:32,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-17 19:48:38,253 INFO [train.py:996] (1/4) Epoch 1, batch 5450, loss[loss=0.3747, simple_loss=0.4454, pruned_loss=0.152, over 21390.00 frames. ], tot_loss[loss=0.4123, simple_loss=0.4364, pruned_loss=0.1941, over 4286560.47 frames. ], batch size: 194, lr: 4.05e-02, grad_scale: 16.0 2023-06-17 19:48:53,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=32700.0, ans=0.2 2023-06-17 19:48:55,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=32700.0, ans=0.1 2023-06-17 19:49:13,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32760.0, ans=0.1 2023-06-17 19:49:16,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=32820.0, ans=0.125 2023-06-17 19:50:09,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=32940.0, ans=0.125 2023-06-17 19:50:26,861 INFO [train.py:996] (1/4) Epoch 1, batch 5500, loss[loss=0.3343, simple_loss=0.3953, pruned_loss=0.1366, over 21669.00 frames. ], tot_loss[loss=0.4071, simple_loss=0.4376, pruned_loss=0.1883, over 4280257.58 frames. ], batch size: 247, lr: 4.04e-02, grad_scale: 16.0 2023-06-17 19:50:33,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=33000.0, ans=0.125 2023-06-17 19:50:38,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=33000.0, ans=0.0 2023-06-17 19:50:49,561 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.509e+02 4.085e+02 5.638e+02 7.299e+02 1.416e+03, threshold=1.128e+03, percent-clipped=6.0 2023-06-17 19:52:07,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=33240.0, ans=0.125 2023-06-17 19:52:13,362 INFO [train.py:996] (1/4) Epoch 1, batch 5550, loss[loss=0.2543, simple_loss=0.3059, pruned_loss=0.1013, over 21425.00 frames. ], tot_loss[loss=0.3969, simple_loss=0.4314, pruned_loss=0.1812, over 4276980.52 frames. ], batch size: 131, lr: 4.03e-02, grad_scale: 16.0 2023-06-17 19:52:17,599 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.51 vs. limit=10.0 2023-06-17 19:52:25,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=33300.0, ans=10.0 2023-06-17 19:52:42,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=33360.0, ans=0.125 2023-06-17 19:53:44,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=33540.0, ans=0.2 2023-06-17 19:53:56,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-17 19:53:56,953 INFO [train.py:996] (1/4) Epoch 1, batch 5600, loss[loss=0.6084, simple_loss=0.614, pruned_loss=0.3014, over 21541.00 frames. ], tot_loss[loss=0.3879, simple_loss=0.4259, pruned_loss=0.1749, over 4276982.98 frames. ], batch size: 471, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 19:54:00,604 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:54:29,860 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 4.088e+02 5.346e+02 7.510e+02 1.919e+03, threshold=1.069e+03, percent-clipped=8.0 2023-06-17 19:54:46,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=33720.0, ans=0.125 2023-06-17 19:54:56,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=33720.0, ans=0.2 2023-06-17 19:55:17,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=33780.0, ans=0.1 2023-06-17 19:55:19,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=33780.0, ans=0.125 2023-06-17 19:55:20,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=33840.0, ans=0.125 2023-06-17 19:55:38,020 INFO [train.py:996] (1/4) Epoch 1, batch 5650, loss[loss=0.4074, simple_loss=0.4315, pruned_loss=0.1917, over 21875.00 frames. ], tot_loss[loss=0.3973, simple_loss=0.4326, pruned_loss=0.1811, over 4279108.00 frames. ], batch size: 118, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 19:55:47,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=33900.0, ans=0.125 2023-06-17 19:56:10,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=33960.0, ans=0.125 2023-06-17 19:56:59,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=34080.0, ans=0.125 2023-06-17 19:57:27,462 INFO [train.py:996] (1/4) Epoch 1, batch 5700, loss[loss=0.3684, simple_loss=0.3994, pruned_loss=0.1687, over 21561.00 frames. ], tot_loss[loss=0.4022, simple_loss=0.4336, pruned_loss=0.1854, over 4280865.69 frames. ], batch size: 212, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 19:57:53,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=34260.0, ans=0.125 2023-06-17 19:58:00,889 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.144e+02 5.223e+02 7.602e+02 1.708e+03, threshold=1.045e+03, percent-clipped=9.0 2023-06-17 19:58:28,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=34320.0, ans=0.125 2023-06-17 19:58:46,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=34380.0, ans=0.2 2023-06-17 19:58:47,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2023-06-17 19:58:51,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=34440.0, ans=0.0033826086956521735 2023-06-17 19:59:11,780 INFO [train.py:996] (1/4) Epoch 1, batch 5750, loss[loss=0.3008, simple_loss=0.3689, pruned_loss=0.1163, over 21738.00 frames. ], tot_loss[loss=0.3937, simple_loss=0.428, pruned_loss=0.1797, over 4281385.50 frames. ], batch size: 298, lr: 4.01e-02, grad_scale: 32.0 2023-06-17 19:59:12,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=34500.0, ans=0.125 2023-06-17 19:59:50,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=34560.0, ans=0.125 2023-06-17 20:00:59,960 INFO [train.py:996] (1/4) Epoch 1, batch 5800, loss[loss=0.3881, simple_loss=0.3912, pruned_loss=0.1925, over 20234.00 frames. ], tot_loss[loss=0.3872, simple_loss=0.4244, pruned_loss=0.175, over 4273207.83 frames. ], batch size: 703, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 20:01:00,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=34800.0, ans=0.125 2023-06-17 20:01:00,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=34800.0, ans=0.0033043478260869567 2023-06-17 20:01:11,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-17 20:01:28,182 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.873e+02 4.586e+02 6.036e+02 1.114e+03, threshold=9.172e+02, percent-clipped=1.0 2023-06-17 20:01:54,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=34980.0, ans=0.2 2023-06-17 20:02:03,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-17 20:02:31,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.65 vs. limit=6.0 2023-06-17 20:02:40,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=35040.0, ans=0.1 2023-06-17 20:02:43,506 INFO [train.py:996] (1/4) Epoch 1, batch 5850, loss[loss=0.2544, simple_loss=0.3265, pruned_loss=0.09112, over 21396.00 frames. ], tot_loss[loss=0.3726, simple_loss=0.4161, pruned_loss=0.1646, over 4270961.33 frames. ], batch size: 131, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 20:03:03,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=35160.0, ans=0.0 2023-06-17 20:03:12,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=35160.0, ans=0.003226086956521739 2023-06-17 20:03:13,169 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2023-06-17 20:03:49,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-17 20:04:06,472 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.19 vs. limit=5.0 2023-06-17 20:04:18,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=35340.0, ans=0.1 2023-06-17 20:04:20,743 INFO [train.py:996] (1/4) Epoch 1, batch 5900, loss[loss=0.3149, simple_loss=0.3561, pruned_loss=0.1368, over 21469.00 frames. ], tot_loss[loss=0.3563, simple_loss=0.4045, pruned_loss=0.1541, over 4277283.43 frames. ], batch size: 211, lr: 3.99e-02, grad_scale: 32.0 2023-06-17 20:04:48,502 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 3.363e+02 4.037e+02 5.226e+02 1.298e+03, threshold=8.074e+02, percent-clipped=7.0 2023-06-17 20:04:55,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=35460.0, ans=0.125 2023-06-17 20:05:10,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=35520.0, ans=0.003147826086956522 2023-06-17 20:05:13,330 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:05:45,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=35640.0, ans=0.125 2023-06-17 20:05:59,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-17 20:06:07,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=35700.0, ans=0.02 2023-06-17 20:06:08,242 INFO [train.py:996] (1/4) Epoch 1, batch 5950, loss[loss=0.349, simple_loss=0.4435, pruned_loss=0.1273, over 21202.00 frames. ], tot_loss[loss=0.3675, simple_loss=0.408, pruned_loss=0.1634, over 4279172.47 frames. ], batch size: 548, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 20:06:08,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=35700.0, ans=0.125 2023-06-17 20:06:40,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-17 20:06:52,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=35820.0, ans=0.5 2023-06-17 20:07:04,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-17 20:07:37,780 INFO [train.py:996] (1/4) Epoch 1, batch 6000, loss[loss=0.3902, simple_loss=0.4054, pruned_loss=0.1875, over 14759.00 frames. ], tot_loss[loss=0.3732, simple_loss=0.4076, pruned_loss=0.1694, over 4269063.37 frames. ], batch size: 60, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 20:07:37,781 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-17 20:07:56,500 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.3636, simple_loss=0.4388, pruned_loss=0.1442, over 1796401.00 frames. 2023-06-17 20:07:56,501 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24375MB 2023-06-17 20:07:58,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=36000.0, ans=0.0 2023-06-17 20:08:18,294 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:08:19,450 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.037e+02 4.782e+02 6.358e+02 7.928e+02 1.970e+03, threshold=1.272e+03, percent-clipped=23.0 2023-06-17 20:08:23,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=36060.0, ans=0.125 2023-06-17 20:08:48,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-17 20:08:53,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=36120.0, ans=0.003017391304347827 2023-06-17 20:09:04,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=36180.0, ans=0.125 2023-06-17 20:09:07,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=36180.0, ans=0.125 2023-06-17 20:09:34,139 INFO [train.py:996] (1/4) Epoch 1, batch 6050, loss[loss=0.3904, simple_loss=0.4677, pruned_loss=0.1566, over 19792.00 frames. ], tot_loss[loss=0.375, simple_loss=0.4046, pruned_loss=0.1727, over 4263520.43 frames. ], batch size: 703, lr: 3.97e-02, grad_scale: 32.0 2023-06-17 20:09:55,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36360.0, ans=0.1 2023-06-17 20:11:11,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=36540.0, ans=0.2 2023-06-17 20:11:15,926 INFO [train.py:996] (1/4) Epoch 1, batch 6100, loss[loss=0.5048, simple_loss=0.5394, pruned_loss=0.2351, over 20743.00 frames. ], tot_loss[loss=0.3726, simple_loss=0.4032, pruned_loss=0.171, over 4270113.13 frames. ], batch size: 607, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 20:11:30,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=36660.0, ans=0.125 2023-06-17 20:11:33,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=36660.0, ans=0.0029 2023-06-17 20:11:38,227 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 4.105e+02 5.881e+02 8.261e+02 1.678e+03, threshold=1.176e+03, percent-clipped=6.0 2023-06-17 20:11:54,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=36660.0, ans=0.1 2023-06-17 20:12:47,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=36840.0, ans=0.125 2023-06-17 20:12:57,168 INFO [train.py:996] (1/4) Epoch 1, batch 6150, loss[loss=0.358, simple_loss=0.3929, pruned_loss=0.1615, over 21561.00 frames. ], tot_loss[loss=0.3796, simple_loss=0.4067, pruned_loss=0.1762, over 4277156.55 frames. ], batch size: 195, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 20:13:07,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=36900.0, ans=0.1 2023-06-17 20:13:50,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-17 20:14:23,246 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:14:38,999 INFO [train.py:996] (1/4) Epoch 1, batch 6200, loss[loss=0.3674, simple_loss=0.4092, pruned_loss=0.1628, over 21771.00 frames. ], tot_loss[loss=0.3784, simple_loss=0.4066, pruned_loss=0.1751, over 4276844.62 frames. ], batch size: 247, lr: 3.95e-02, grad_scale: 32.0 2023-06-17 20:14:54,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=37260.0, ans=0.125 2023-06-17 20:15:07,345 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.629e+02 4.899e+02 6.626e+02 1.862e+03, threshold=9.798e+02, percent-clipped=4.0 2023-06-17 20:15:31,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=37320.0, ans=0.025 2023-06-17 20:15:35,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=37320.0, ans=0.0027565217391304353 2023-06-17 20:16:05,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=37440.0, ans=0.0 2023-06-17 20:16:06,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-17 20:16:22,537 INFO [train.py:996] (1/4) Epoch 1, batch 6250, loss[loss=0.3694, simple_loss=0.4365, pruned_loss=0.1512, over 21796.00 frames. ], tot_loss[loss=0.3833, simple_loss=0.415, pruned_loss=0.1759, over 4277804.04 frames. ], batch size: 282, lr: 3.94e-02, grad_scale: 32.0 2023-06-17 20:17:18,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-06-17 20:17:32,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=37680.0, ans=0.0026782608695652176 2023-06-17 20:17:47,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=37740.0, ans=0.125 2023-06-17 20:17:54,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=37740.0, ans=0.125 2023-06-17 20:18:02,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-17 20:18:03,395 INFO [train.py:996] (1/4) Epoch 1, batch 6300, loss[loss=0.3839, simple_loss=0.4262, pruned_loss=0.1708, over 21816.00 frames. ], tot_loss[loss=0.3867, simple_loss=0.4214, pruned_loss=0.176, over 4284240.90 frames. ], batch size: 282, lr: 3.94e-02, grad_scale: 32.0 2023-06-17 20:18:36,994 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:18:41,523 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.261e+02 6.027e+02 8.452e+02 1.541e+03, threshold=1.205e+03, percent-clipped=13.0 2023-06-17 20:18:47,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37860.0, ans=0.1 2023-06-17 20:19:10,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=37920.0, ans=0.00262608695652174 2023-06-17 20:19:38,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=38040.0, ans=0.125 2023-06-17 20:19:46,069 INFO [train.py:996] (1/4) Epoch 1, batch 6350, loss[loss=0.5295, simple_loss=0.5608, pruned_loss=0.2492, over 19803.00 frames. ], tot_loss[loss=0.3987, simple_loss=0.4289, pruned_loss=0.1842, over 4285200.51 frames. ], batch size: 703, lr: 3.93e-02, grad_scale: 32.0 2023-06-17 20:20:13,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=38100.0, ans=0.0 2023-06-17 20:20:47,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=38220.0, ans=0.0025608695652173915 2023-06-17 20:21:46,037 INFO [train.py:996] (1/4) Epoch 1, batch 6400, loss[loss=0.5091, simple_loss=0.5031, pruned_loss=0.2575, over 21773.00 frames. ], tot_loss[loss=0.4087, simple_loss=0.436, pruned_loss=0.1907, over 4284277.34 frames. ], batch size: 441, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 20:21:55,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=38400.0, ans=0.125 2023-06-17 20:21:58,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=38400.0, ans=0.125 2023-06-17 20:22:15,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.706e+02 4.358e+02 5.224e+02 7.258e+02 1.926e+03, threshold=1.045e+03, percent-clipped=7.0 2023-06-17 20:23:22,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=38640.0, ans=0.0024695652173913037 2023-06-17 20:23:30,203 INFO [train.py:996] (1/4) Epoch 1, batch 6450, loss[loss=0.3936, simple_loss=0.4369, pruned_loss=0.1752, over 21566.00 frames. ], tot_loss[loss=0.4049, simple_loss=0.4346, pruned_loss=0.1876, over 4283371.47 frames. ], batch size: 389, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 20:24:16,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=38820.0, ans=0.125 2023-06-17 20:24:55,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=15.0 2023-06-17 20:24:56,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=38940.0, ans=0.125 2023-06-17 20:25:09,308 INFO [train.py:996] (1/4) Epoch 1, batch 6500, loss[loss=0.3427, simple_loss=0.3955, pruned_loss=0.145, over 19937.00 frames. ], tot_loss[loss=0.3981, simple_loss=0.4257, pruned_loss=0.1853, over 4268648.11 frames. ], batch size: 703, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 20:25:37,744 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.797e+02 4.922e+02 6.987e+02 1.536e+03, threshold=9.843e+02, percent-clipped=9.0 2023-06-17 20:25:47,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-17 20:25:56,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=39120.0, ans=0.125 2023-06-17 20:25:59,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=39120.0, ans=0.015 2023-06-17 20:26:04,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=39180.0, ans=0.2 2023-06-17 20:26:34,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=39240.0, ans=0.125 2023-06-17 20:26:43,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=39240.0, ans=0.125 2023-06-17 20:26:44,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-06-17 20:26:49,854 INFO [train.py:996] (1/4) Epoch 1, batch 6550, loss[loss=0.3653, simple_loss=0.3989, pruned_loss=0.1659, over 21534.00 frames. ], tot_loss[loss=0.3951, simple_loss=0.4232, pruned_loss=0.1835, over 4273638.40 frames. ], batch size: 230, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 20:27:19,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-17 20:28:08,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=39480.0, ans=0.125 2023-06-17 20:28:28,280 INFO [train.py:996] (1/4) Epoch 1, batch 6600, loss[loss=0.3429, simple_loss=0.3573, pruned_loss=0.1643, over 21492.00 frames. ], tot_loss[loss=0.3931, simple_loss=0.4184, pruned_loss=0.1839, over 4270450.07 frames. ], batch size: 230, lr: 3.90e-02, grad_scale: 16.0 2023-06-17 20:28:35,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=39600.0, ans=0.125 2023-06-17 20:28:57,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 4.216e+02 5.009e+02 6.284e+02 1.954e+03, threshold=1.002e+03, percent-clipped=7.0 2023-06-17 20:29:14,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=39720.0, ans=0.125 2023-06-17 20:29:31,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-17 20:29:53,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-17 20:30:17,382 INFO [train.py:996] (1/4) Epoch 1, batch 6650, loss[loss=0.3544, simple_loss=0.3743, pruned_loss=0.1673, over 21275.00 frames. ], tot_loss[loss=0.3828, simple_loss=0.4097, pruned_loss=0.1779, over 4270101.21 frames. ], batch size: 159, lr: 3.89e-02, grad_scale: 16.0 2023-06-17 20:30:25,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=39900.0, ans=0.1 2023-06-17 20:31:30,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.96 vs. limit=22.5 2023-06-17 20:31:37,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=40140.0, ans=0.2 2023-06-17 20:31:54,705 INFO [train.py:996] (1/4) Epoch 1, batch 6700, loss[loss=0.3368, simple_loss=0.3614, pruned_loss=0.1561, over 21763.00 frames. ], tot_loss[loss=0.3785, simple_loss=0.4041, pruned_loss=0.1764, over 4257673.84 frames. ], batch size: 124, lr: 3.89e-02, grad_scale: 16.0 2023-06-17 20:32:03,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=40200.0, ans=0.125 2023-06-17 20:32:18,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2023-06-17 20:32:24,887 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.374e+02 3.778e+02 4.910e+02 6.670e+02 1.888e+03, threshold=9.820e+02, percent-clipped=8.0 2023-06-17 20:32:28,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=40260.0, ans=0.025 2023-06-17 20:32:33,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40320.0, ans=0.1 2023-06-17 20:33:04,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-17 20:33:20,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=40440.0, ans=0.1 2023-06-17 20:33:21,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-17 20:33:29,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.11 vs. limit=22.5 2023-06-17 20:33:38,600 INFO [train.py:996] (1/4) Epoch 1, batch 6750, loss[loss=0.3629, simple_loss=0.4398, pruned_loss=0.143, over 19742.00 frames. ], tot_loss[loss=0.3758, simple_loss=0.4015, pruned_loss=0.175, over 4264589.50 frames. ], batch size: 702, lr: 3.88e-02, grad_scale: 16.0 2023-06-17 20:34:33,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.63 vs. limit=15.0 2023-06-17 20:34:56,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.11 vs. limit=15.0 2023-06-17 20:35:02,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=12.0 2023-06-17 20:35:05,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=40740.0, ans=0.1 2023-06-17 20:35:21,747 INFO [train.py:996] (1/4) Epoch 1, batch 6800, loss[loss=0.3213, simple_loss=0.3518, pruned_loss=0.1455, over 21877.00 frames. ], tot_loss[loss=0.382, simple_loss=0.4049, pruned_loss=0.1796, over 4273021.18 frames. ], batch size: 98, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 20:35:50,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 4.135e+02 5.210e+02 7.018e+02 1.112e+03, threshold=1.042e+03, percent-clipped=5.0 2023-06-17 20:35:53,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=40860.0, ans=0.125 2023-06-17 20:36:10,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-17 20:36:45,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41040.0, ans=0.1 2023-06-17 20:37:02,293 INFO [train.py:996] (1/4) Epoch 1, batch 6850, loss[loss=0.3679, simple_loss=0.3838, pruned_loss=0.176, over 21313.00 frames. ], tot_loss[loss=0.3829, simple_loss=0.403, pruned_loss=0.1814, over 4284336.90 frames. ], batch size: 159, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 20:37:27,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-17 20:37:41,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=41220.0, ans=0.0 2023-06-17 20:37:54,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=41280.0, ans=0.2 2023-06-17 20:38:05,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.34 vs. limit=15.0 2023-06-17 20:38:21,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=41340.0, ans=0.125 2023-06-17 20:38:43,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=41400.0, ans=0.0 2023-06-17 20:38:44,850 INFO [train.py:996] (1/4) Epoch 1, batch 6900, loss[loss=0.3486, simple_loss=0.4134, pruned_loss=0.1419, over 21757.00 frames. ], tot_loss[loss=0.3848, simple_loss=0.4059, pruned_loss=0.1818, over 4290215.56 frames. ], batch size: 332, lr: 3.86e-02, grad_scale: 32.0 2023-06-17 20:39:15,040 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 4.043e+02 5.136e+02 6.723e+02 1.147e+03, threshold=1.027e+03, percent-clipped=4.0 2023-06-17 20:39:41,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=41520.0, ans=0.05 2023-06-17 20:39:51,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=41580.0, ans=0.125 2023-06-17 20:40:33,690 INFO [train.py:996] (1/4) Epoch 1, batch 6950, loss[loss=0.339, simple_loss=0.4137, pruned_loss=0.1321, over 21255.00 frames. ], tot_loss[loss=0.379, simple_loss=0.4067, pruned_loss=0.1757, over 4286226.75 frames. ], batch size: 548, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 20:41:05,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=12.0 2023-06-17 20:41:16,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=41820.0, ans=0.07 2023-06-17 20:41:52,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=41940.0, ans=0.125 2023-06-17 20:42:15,392 INFO [train.py:996] (1/4) Epoch 1, batch 7000, loss[loss=0.3668, simple_loss=0.401, pruned_loss=0.1663, over 15301.00 frames. ], tot_loss[loss=0.387, simple_loss=0.4109, pruned_loss=0.1816, over 4275704.16 frames. ], batch size: 61, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 20:42:37,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=42060.0, ans=0.0 2023-06-17 20:42:40,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.856e+02 5.678e+02 7.793e+02 1.284e+03, threshold=1.136e+03, percent-clipped=9.0 2023-06-17 20:43:23,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-17 20:43:58,195 INFO [train.py:996] (1/4) Epoch 1, batch 7050, loss[loss=0.3257, simple_loss=0.3848, pruned_loss=0.1333, over 21581.00 frames. ], tot_loss[loss=0.3825, simple_loss=0.4075, pruned_loss=0.1787, over 4271965.25 frames. ], batch size: 263, lr: 3.84e-02, grad_scale: 32.0 2023-06-17 20:44:15,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=42360.0, ans=0.125 2023-06-17 20:44:20,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=42360.0, ans=0.1 2023-06-17 20:45:23,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=42540.0, ans=0.125 2023-06-17 20:45:36,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-06-17 20:45:41,222 INFO [train.py:996] (1/4) Epoch 1, batch 7100, loss[loss=0.36, simple_loss=0.4015, pruned_loss=0.1592, over 21471.00 frames. ], tot_loss[loss=0.3906, simple_loss=0.4161, pruned_loss=0.1826, over 4262037.68 frames. ], batch size: 131, lr: 3.83e-02, grad_scale: 16.0 2023-06-17 20:45:51,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=42600.0, ans=0.035 2023-06-17 20:46:23,389 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.465e+02 3.499e+02 4.765e+02 6.343e+02 1.936e+03, threshold=9.530e+02, percent-clipped=5.0 2023-06-17 20:47:12,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=42840.0, ans=0.0 2023-06-17 20:47:17,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=42840.0, ans=0.125 2023-06-17 20:47:23,717 INFO [train.py:996] (1/4) Epoch 1, batch 7150, loss[loss=0.4248, simple_loss=0.448, pruned_loss=0.2008, over 22011.00 frames. ], tot_loss[loss=0.3829, simple_loss=0.4111, pruned_loss=0.1774, over 4265710.49 frames. ], batch size: 317, lr: 3.83e-02, grad_scale: 16.0 2023-06-17 20:47:58,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=42960.0, ans=0.0 2023-06-17 20:48:29,791 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:48:54,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=43140.0, ans=0.125 2023-06-17 20:49:07,063 INFO [train.py:996] (1/4) Epoch 1, batch 7200, loss[loss=0.4493, simple_loss=0.4359, pruned_loss=0.2314, over 21244.00 frames. ], tot_loss[loss=0.3895, simple_loss=0.4152, pruned_loss=0.1819, over 4269847.15 frames. ], batch size: 471, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 20:49:27,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43200.0, ans=0.0 2023-06-17 20:49:27,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=43200.0, ans=0.1 2023-06-17 20:49:44,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=43260.0, ans=0.0 2023-06-17 20:49:48,757 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 4.300e+02 5.251e+02 6.410e+02 9.416e+02, threshold=1.050e+03, percent-clipped=0.0 2023-06-17 20:50:33,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=43440.0, ans=0.0014260869565217386 2023-06-17 20:50:48,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=43500.0, ans=0.1 2023-06-17 20:50:49,416 INFO [train.py:996] (1/4) Epoch 1, batch 7250, loss[loss=0.3929, simple_loss=0.4036, pruned_loss=0.1911, over 21540.00 frames. ], tot_loss[loss=0.3867, simple_loss=0.4098, pruned_loss=0.1818, over 4275880.17 frames. ], batch size: 132, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 20:51:15,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=43560.0, ans=0.125 2023-06-17 20:51:20,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=43560.0, ans=0.07 2023-06-17 20:52:31,510 INFO [train.py:996] (1/4) Epoch 1, batch 7300, loss[loss=0.3241, simple_loss=0.3597, pruned_loss=0.1442, over 21519.00 frames. ], tot_loss[loss=0.3794, simple_loss=0.4009, pruned_loss=0.1789, over 4283346.67 frames. ], batch size: 132, lr: 3.81e-02, grad_scale: 32.0 2023-06-17 20:52:38,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=43800.0, ans=0.125 2023-06-17 20:53:07,893 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.808e+02 5.144e+02 6.713e+02 1.157e+03, threshold=1.029e+03, percent-clipped=4.0 2023-06-17 20:54:07,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=44040.0, ans=0.125 2023-06-17 20:54:25,386 INFO [train.py:996] (1/4) Epoch 1, batch 7350, loss[loss=0.3878, simple_loss=0.4052, pruned_loss=0.1852, over 21698.00 frames. ], tot_loss[loss=0.3773, simple_loss=0.397, pruned_loss=0.1788, over 4277703.71 frames. ], batch size: 298, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 20:54:52,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44160.0, ans=0.1 2023-06-17 20:55:02,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=44220.0, ans=0.1 2023-06-17 20:55:18,391 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-17 20:56:02,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-17 20:56:05,309 INFO [train.py:996] (1/4) Epoch 1, batch 7400, loss[loss=0.3408, simple_loss=0.3604, pruned_loss=0.1606, over 21266.00 frames. ], tot_loss[loss=0.3864, simple_loss=0.4067, pruned_loss=0.1831, over 4276109.86 frames. ], batch size: 176, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 20:56:22,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44400.0, ans=0.1 2023-06-17 20:56:36,852 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.553e+02 4.273e+02 5.813e+02 7.639e+02 1.411e+03, threshold=1.163e+03, percent-clipped=7.0 2023-06-17 20:57:03,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=44580.0, ans=0.1 2023-06-17 20:57:28,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=44640.0, ans=0.0 2023-06-17 20:57:32,199 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.32 vs. limit=15.0 2023-06-17 20:57:42,557 INFO [train.py:996] (1/4) Epoch 1, batch 7450, loss[loss=0.3862, simple_loss=0.3939, pruned_loss=0.1892, over 21476.00 frames. ], tot_loss[loss=0.3847, simple_loss=0.4054, pruned_loss=0.182, over 4270195.17 frames. ], batch size: 211, lr: 3.79e-02, grad_scale: 32.0 2023-06-17 20:58:01,587 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:58:10,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-17 20:58:36,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=44820.0, ans=0.0011260869565217404 2023-06-17 20:59:31,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=44940.0, ans=0.125 2023-06-17 20:59:33,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=45000.0, ans=0.0 2023-06-17 20:59:34,473 INFO [train.py:996] (1/4) Epoch 1, batch 7500, loss[loss=0.3665, simple_loss=0.4352, pruned_loss=0.1489, over 21444.00 frames. ], tot_loss[loss=0.3893, simple_loss=0.4113, pruned_loss=0.1837, over 4273912.05 frames. ], batch size: 211, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 20:59:59,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=45060.0, ans=0.0 2023-06-17 21:00:01,011 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.011e+02 4.419e+02 5.234e+02 7.057e+02 1.215e+03, threshold=1.047e+03, percent-clipped=2.0 2023-06-17 21:01:03,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=45240.0, ans=0.04949747468305833 2023-06-17 21:01:07,059 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:01:13,066 INFO [train.py:996] (1/4) Epoch 1, batch 7550, loss[loss=0.3056, simple_loss=0.3687, pruned_loss=0.1213, over 21296.00 frames. ], tot_loss[loss=0.3894, simple_loss=0.4186, pruned_loss=0.1801, over 4274892.52 frames. ], batch size: 194, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 21:01:15,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=45300.0, ans=0.125 2023-06-17 21:02:32,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=45480.0, ans=0.2 2023-06-17 21:02:36,977 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:02:56,467 INFO [train.py:996] (1/4) Epoch 1, batch 7600, loss[loss=0.392, simple_loss=0.4312, pruned_loss=0.1765, over 21836.00 frames. ], tot_loss[loss=0.3886, simple_loss=0.4192, pruned_loss=0.179, over 4280633.28 frames. ], batch size: 351, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 21:03:22,195 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.546e+02 4.998e+02 6.623e+02 1.459e+03, threshold=9.996e+02, percent-clipped=5.0 2023-06-17 21:03:29,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=45720.0, ans=0.125 2023-06-17 21:03:29,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.89 vs. limit=15.0 2023-06-17 21:04:38,379 INFO [train.py:996] (1/4) Epoch 1, batch 7650, loss[loss=0.4234, simple_loss=0.424, pruned_loss=0.2115, over 21803.00 frames. ], tot_loss[loss=0.3924, simple_loss=0.4183, pruned_loss=0.1832, over 4288119.04 frames. ], batch size: 441, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 21:05:20,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=46020.0, ans=0.0 2023-06-17 21:05:20,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-17 21:05:40,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=46080.0, ans=0.0 2023-06-17 21:06:09,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=46140.0, ans=0.125 2023-06-17 21:06:22,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=46200.0, ans=0.125 2023-06-17 21:06:24,124 INFO [train.py:996] (1/4) Epoch 1, batch 7700, loss[loss=0.4695, simple_loss=0.4717, pruned_loss=0.2336, over 21452.00 frames. ], tot_loss[loss=0.3989, simple_loss=0.4229, pruned_loss=0.1874, over 4292674.85 frames. ], batch size: 471, lr: 3.76e-02, grad_scale: 32.0 2023-06-17 21:06:36,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=46200.0, ans=0.125 2023-06-17 21:06:43,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=46260.0, ans=0.2 2023-06-17 21:06:50,966 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.909e+02 4.180e+02 5.539e+02 6.663e+02 1.200e+03, threshold=1.108e+03, percent-clipped=4.0 2023-06-17 21:06:58,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=46320.0, ans=0.2 2023-06-17 21:07:22,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=46320.0, ans=0.0 2023-06-17 21:07:41,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-17 21:07:54,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=46440.0, ans=0.125 2023-06-17 21:07:57,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=46440.0, ans=0.04949747468305833 2023-06-17 21:07:58,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-17 21:08:09,030 INFO [train.py:996] (1/4) Epoch 1, batch 7750, loss[loss=0.6107, simple_loss=0.6147, pruned_loss=0.3033, over 21470.00 frames. ], tot_loss[loss=0.3984, simple_loss=0.4266, pruned_loss=0.1851, over 4287992.64 frames. ], batch size: 507, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 21:08:23,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=46500.0, ans=0.125 2023-06-17 21:09:16,271 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.06 vs. limit=22.5 2023-06-17 21:09:53,148 INFO [train.py:996] (1/4) Epoch 1, batch 7800, loss[loss=0.3632, simple_loss=0.3767, pruned_loss=0.1749, over 20749.00 frames. ], tot_loss[loss=0.3966, simple_loss=0.4257, pruned_loss=0.1838, over 4279568.10 frames. ], batch size: 609, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 21:10:00,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=46800.0, ans=0.0 2023-06-17 21:10:29,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.732e+02 4.420e+02 5.608e+02 7.244e+02 1.529e+03, threshold=1.122e+03, percent-clipped=4.0 2023-06-17 21:11:25,190 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.83 vs. limit=6.0 2023-06-17 21:11:26,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=47040.0, ans=0.0 2023-06-17 21:11:27,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=47040.0, ans=0.05 2023-06-17 21:11:34,722 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.92 vs. limit=6.0 2023-06-17 21:11:35,291 INFO [train.py:996] (1/4) Epoch 1, batch 7850, loss[loss=0.3394, simple_loss=0.3572, pruned_loss=0.1608, over 21555.00 frames. ], tot_loss[loss=0.3901, simple_loss=0.4182, pruned_loss=0.181, over 4267973.73 frames. ], batch size: 263, lr: 3.74e-02, grad_scale: 32.0 2023-06-17 21:11:50,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=47160.0, ans=0.2 2023-06-17 21:12:39,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=47220.0, ans=0.02 2023-06-17 21:12:43,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=19.68 vs. limit=15.0 2023-06-17 21:12:45,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-17 21:13:19,397 INFO [train.py:996] (1/4) Epoch 1, batch 7900, loss[loss=0.3121, simple_loss=0.3452, pruned_loss=0.1395, over 21145.00 frames. ], tot_loss[loss=0.3855, simple_loss=0.4121, pruned_loss=0.1794, over 4263451.41 frames. ], batch size: 143, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 21:13:50,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=47460.0, ans=0.125 2023-06-17 21:13:56,670 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.667e+02 5.315e+02 6.476e+02 7.892e+02 1.492e+03, threshold=1.295e+03, percent-clipped=7.0 2023-06-17 21:14:01,965 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=3.397e-03 2023-06-17 21:14:30,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=47580.0, ans=0.2 2023-06-17 21:14:38,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=47580.0, ans=0.125 2023-06-17 21:14:57,990 INFO [train.py:996] (1/4) Epoch 1, batch 7950, loss[loss=0.428, simple_loss=0.4461, pruned_loss=0.205, over 20700.00 frames. ], tot_loss[loss=0.3889, simple_loss=0.4171, pruned_loss=0.1803, over 4251889.19 frames. ], batch size: 607, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 21:15:39,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47760.0, ans=0.1 2023-06-17 21:15:46,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=47760.0, ans=0.125 2023-06-17 21:15:58,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=47820.0, ans=0.09899494936611666 2023-06-17 21:16:01,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=47820.0, ans=0.5 2023-06-17 21:16:10,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.14 vs. limit=15.0 2023-06-17 21:16:50,544 INFO [train.py:996] (1/4) Epoch 1, batch 8000, loss[loss=0.3671, simple_loss=0.407, pruned_loss=0.1636, over 21578.00 frames. ], tot_loss[loss=0.3956, simple_loss=0.422, pruned_loss=0.1846, over 4255279.27 frames. ], batch size: 230, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 21:16:57,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=48000.0, ans=0.2 2023-06-17 21:17:28,892 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.757e+02 4.328e+02 5.465e+02 6.460e+02 1.072e+03, threshold=1.093e+03, percent-clipped=0.0 2023-06-17 21:17:44,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=48120.0, ans=0.125 2023-06-17 21:18:06,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=48180.0, ans=0.0003956521739130435 2023-06-17 21:18:16,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=48240.0, ans=0.125 2023-06-17 21:18:43,556 INFO [train.py:996] (1/4) Epoch 1, batch 8050, loss[loss=0.2986, simple_loss=0.3319, pruned_loss=0.1326, over 21202.00 frames. ], tot_loss[loss=0.3947, simple_loss=0.4226, pruned_loss=0.1833, over 4259558.23 frames. ], batch size: 159, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 21:19:01,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=48360.0, ans=0.2 2023-06-17 21:19:02,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=48360.0, ans=0.2 2023-06-17 21:20:27,368 INFO [train.py:996] (1/4) Epoch 1, batch 8100, loss[loss=0.3955, simple_loss=0.4224, pruned_loss=0.1843, over 21873.00 frames. ], tot_loss[loss=0.3949, simple_loss=0.4221, pruned_loss=0.1838, over 4266151.01 frames. ], batch size: 391, lr: 3.71e-02, grad_scale: 32.0 2023-06-17 21:20:54,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 4.589e+02 6.477e+02 8.462e+02 1.426e+03, threshold=1.295e+03, percent-clipped=5.0 2023-06-17 21:21:44,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=48780.0, ans=0.125 2023-06-17 21:21:55,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=48840.0, ans=0.0 2023-06-17 21:22:14,023 INFO [train.py:996] (1/4) Epoch 1, batch 8150, loss[loss=0.3355, simple_loss=0.3949, pruned_loss=0.138, over 21629.00 frames. ], tot_loss[loss=0.406, simple_loss=0.4336, pruned_loss=0.1891, over 4261767.72 frames. ], batch size: 247, lr: 3.70e-02, grad_scale: 16.0 2023-06-17 21:23:58,506 INFO [train.py:996] (1/4) Epoch 1, batch 8200, loss[loss=0.441, simple_loss=0.436, pruned_loss=0.2229, over 20032.00 frames. ], tot_loss[loss=0.3985, simple_loss=0.4275, pruned_loss=0.1848, over 4262169.75 frames. ], batch size: 703, lr: 3.70e-02, grad_scale: 16.0 2023-06-17 21:24:07,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=49200.0, ans=0.0 2023-06-17 21:24:36,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.823e+02 4.911e+02 6.054e+02 7.943e+02 1.649e+03, threshold=1.211e+03, percent-clipped=3.0 2023-06-17 21:25:06,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=49320.0, ans=0.00014782608695652205 2023-06-17 21:25:08,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=49380.0, ans=0.0 2023-06-17 21:25:24,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=49440.0, ans=0.2 2023-06-17 21:25:42,202 INFO [train.py:996] (1/4) Epoch 1, batch 8250, loss[loss=0.4544, simple_loss=0.4849, pruned_loss=0.212, over 21603.00 frames. ], tot_loss[loss=0.3998, simple_loss=0.4266, pruned_loss=0.1865, over 4264065.56 frames. ], batch size: 414, lr: 3.69e-02, grad_scale: 16.0 2023-06-17 21:25:47,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=49500.0, ans=0.0 2023-06-17 21:26:35,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=49620.0, ans=0.125 2023-06-17 21:27:08,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=49740.0, ans=0.0 2023-06-17 21:27:25,179 INFO [train.py:996] (1/4) Epoch 1, batch 8300, loss[loss=0.3447, simple_loss=0.3886, pruned_loss=0.1504, over 21773.00 frames. ], tot_loss[loss=0.3917, simple_loss=0.4217, pruned_loss=0.1808, over 4272015.87 frames. ], batch size: 282, lr: 3.68e-02, grad_scale: 16.0 2023-06-17 21:27:52,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=49860.0, ans=0.0 2023-06-17 21:28:02,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=49860.0, ans=3.043478260869592e-05 2023-06-17 21:28:03,865 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 3.951e+02 4.948e+02 6.196e+02 1.080e+03, threshold=9.896e+02, percent-clipped=0.0 2023-06-17 21:28:36,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=49980.0, ans=4.347826086955817e-06 2023-06-17 21:28:41,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=49980.0, ans=0.2 2023-06-17 21:29:01,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=50040.0, ans=0.125 2023-06-17 21:29:09,835 INFO [train.py:996] (1/4) Epoch 1, batch 8350, loss[loss=0.3352, simple_loss=0.3812, pruned_loss=0.1446, over 21348.00 frames. ], tot_loss[loss=0.3863, simple_loss=0.4187, pruned_loss=0.177, over 4269585.87 frames. ], batch size: 176, lr: 3.68e-02, grad_scale: 16.0 2023-06-17 21:29:47,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=50160.0, ans=0.2 2023-06-17 21:30:10,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=50220.0, ans=0.125 2023-06-17 21:30:23,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=50280.0, ans=0.125 2023-06-17 21:30:23,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=50280.0, ans=0.2 2023-06-17 21:30:26,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=50280.0, ans=0.0 2023-06-17 21:30:53,932 INFO [train.py:996] (1/4) Epoch 1, batch 8400, loss[loss=0.3464, simple_loss=0.4087, pruned_loss=0.142, over 21664.00 frames. ], tot_loss[loss=0.3762, simple_loss=0.4107, pruned_loss=0.1708, over 4260313.68 frames. ], batch size: 414, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 21:31:17,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=50460.0, ans=0.125 2023-06-17 21:31:32,768 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.584e+02 3.530e+02 4.836e+02 6.875e+02 1.901e+03, threshold=9.672e+02, percent-clipped=8.0 2023-06-17 21:32:05,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=50580.0, ans=0.0 2023-06-17 21:32:21,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=50640.0, ans=0.2 2023-06-17 21:32:41,516 INFO [train.py:996] (1/4) Epoch 1, batch 8450, loss[loss=0.3794, simple_loss=0.3968, pruned_loss=0.1811, over 21805.00 frames. ], tot_loss[loss=0.381, simple_loss=0.414, pruned_loss=0.174, over 4266597.13 frames. ], batch size: 282, lr: 3.67e-02, grad_scale: 16.0 2023-06-17 21:32:41,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=50700.0, ans=0.0 2023-06-17 21:32:45,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.91 vs. limit=15.0 2023-06-17 21:32:48,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=50700.0, ans=0.1 2023-06-17 21:33:08,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=50760.0, ans=0.125 2023-06-17 21:33:14,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=50760.0, ans=0.2 2023-06-17 21:33:34,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=50820.0, ans=0.125 2023-06-17 21:34:01,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=50940.0, ans=0.0 2023-06-17 21:34:11,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=12.0 2023-06-17 21:34:13,627 INFO [train.py:996] (1/4) Epoch 1, batch 8500, loss[loss=0.3969, simple_loss=0.416, pruned_loss=0.1889, over 21363.00 frames. ], tot_loss[loss=0.3806, simple_loss=0.4108, pruned_loss=0.1752, over 4259432.08 frames. ], batch size: 131, lr: 3.66e-02, grad_scale: 16.0 2023-06-17 21:34:44,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=51060.0, ans=0.125 2023-06-17 21:35:00,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.023e+02 4.457e+02 5.550e+02 7.009e+02 1.801e+03, threshold=1.110e+03, percent-clipped=10.0 2023-06-17 21:35:37,676 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:35:45,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-17 21:35:46,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-17 21:36:04,099 INFO [train.py:996] (1/4) Epoch 1, batch 8550, loss[loss=0.3354, simple_loss=0.3826, pruned_loss=0.1441, over 21290.00 frames. ], tot_loss[loss=0.388, simple_loss=0.417, pruned_loss=0.1795, over 4264843.59 frames. ], batch size: 176, lr: 3.65e-02, grad_scale: 16.0 2023-06-17 21:36:13,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=51300.0, ans=0.125 2023-06-17 21:37:00,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=51420.0, ans=0.125 2023-06-17 21:37:09,065 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:37:21,425 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-17 21:37:56,909 INFO [train.py:996] (1/4) Epoch 1, batch 8600, loss[loss=0.4201, simple_loss=0.5082, pruned_loss=0.1661, over 19910.00 frames. ], tot_loss[loss=0.3977, simple_loss=0.4271, pruned_loss=0.1842, over 4261962.06 frames. ], batch size: 702, lr: 3.65e-02, grad_scale: 16.0 2023-06-17 21:38:03,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=51600.0, ans=0.0 2023-06-17 21:38:38,363 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.423e+02 4.881e+02 5.851e+02 7.697e+02 1.206e+03, threshold=1.170e+03, percent-clipped=2.0 2023-06-17 21:38:50,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=51720.0, ans=0.125 2023-06-17 21:39:04,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-17 21:39:47,591 INFO [train.py:996] (1/4) Epoch 1, batch 8650, loss[loss=0.2168, simple_loss=0.2627, pruned_loss=0.08541, over 16128.00 frames. ], tot_loss[loss=0.4, simple_loss=0.4327, pruned_loss=0.1837, over 4263928.06 frames. ], batch size: 60, lr: 3.64e-02, grad_scale: 16.0 2023-06-17 21:40:24,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=52020.0, ans=0.2 2023-06-17 21:40:29,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=52020.0, ans=0.0 2023-06-17 21:41:24,306 INFO [train.py:996] (1/4) Epoch 1, batch 8700, loss[loss=0.4477, simple_loss=0.4297, pruned_loss=0.2328, over 21309.00 frames. ], tot_loss[loss=0.3875, simple_loss=0.4209, pruned_loss=0.1771, over 4270665.43 frames. ], batch size: 507, lr: 3.64e-02, grad_scale: 16.0 2023-06-17 21:41:45,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=52200.0, ans=0.2 2023-06-17 21:41:57,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-06-17 21:42:04,515 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.951e+02 4.948e+02 6.720e+02 1.137e+03, threshold=9.897e+02, percent-clipped=0.0 2023-06-17 21:42:20,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=52320.0, ans=0.125 2023-06-17 21:42:32,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=52380.0, ans=0.2 2023-06-17 21:42:55,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=52440.0, ans=0.0 2023-06-17 21:43:13,259 INFO [train.py:996] (1/4) Epoch 1, batch 8750, loss[loss=0.4043, simple_loss=0.4243, pruned_loss=0.1921, over 21862.00 frames. ], tot_loss[loss=0.3844, simple_loss=0.4144, pruned_loss=0.1772, over 4279653.27 frames. ], batch size: 124, lr: 3.63e-02, grad_scale: 16.0 2023-06-17 21:44:12,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=52680.0, ans=0.07 2023-06-17 21:44:23,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=52680.0, ans=0.0 2023-06-17 21:45:03,160 INFO [train.py:996] (1/4) Epoch 1, batch 8800, loss[loss=0.4824, simple_loss=0.4991, pruned_loss=0.2328, over 21586.00 frames. ], tot_loss[loss=0.3963, simple_loss=0.4254, pruned_loss=0.1836, over 4283812.10 frames. ], batch size: 414, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 21:45:23,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=52860.0, ans=0.125 2023-06-17 21:45:32,797 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 5.221e+02 6.385e+02 9.121e+02 2.025e+03, threshold=1.277e+03, percent-clipped=20.0 2023-06-17 21:45:34,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=52920.0, ans=0.125 2023-06-17 21:46:49,094 INFO [train.py:996] (1/4) Epoch 1, batch 8850, loss[loss=0.4634, simple_loss=0.5288, pruned_loss=0.199, over 19809.00 frames. ], tot_loss[loss=0.4055, simple_loss=0.4352, pruned_loss=0.1879, over 4281584.40 frames. ], batch size: 702, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 21:47:10,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=53160.0, ans=0.2 2023-06-17 21:47:17,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-17 21:47:23,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-17 21:48:33,205 INFO [train.py:996] (1/4) Epoch 1, batch 8900, loss[loss=0.3324, simple_loss=0.3728, pruned_loss=0.146, over 21631.00 frames. ], tot_loss[loss=0.3988, simple_loss=0.4284, pruned_loss=0.1846, over 4278089.18 frames. ], batch size: 247, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 21:48:44,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53400.0, ans=0.1 2023-06-17 21:49:10,047 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.829e+02 5.153e+02 6.429e+02 1.062e+03, threshold=1.031e+03, percent-clipped=0.0 2023-06-17 21:49:19,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-17 21:49:22,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=53520.0, ans=0.125 2023-06-17 21:50:19,519 INFO [train.py:996] (1/4) Epoch 1, batch 8950, loss[loss=0.3744, simple_loss=0.3972, pruned_loss=0.1758, over 21570.00 frames. ], tot_loss[loss=0.3954, simple_loss=0.427, pruned_loss=0.1819, over 4276397.75 frames. ], batch size: 263, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 21:50:21,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=53700.0, ans=0.125 2023-06-17 21:50:31,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53700.0, ans=0.1 2023-06-17 21:50:56,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53820.0, ans=0.1 2023-06-17 21:51:19,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=53820.0, ans=0.0 2023-06-17 21:51:31,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53880.0, ans=0.1 2023-06-17 21:52:01,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=53940.0, ans=0.0 2023-06-17 21:52:04,289 INFO [train.py:996] (1/4) Epoch 1, batch 9000, loss[loss=0.4932, simple_loss=0.5683, pruned_loss=0.2091, over 19733.00 frames. ], tot_loss[loss=0.3917, simple_loss=0.4205, pruned_loss=0.1814, over 4277488.52 frames. ], batch size: 702, lr: 3.60e-02, grad_scale: 32.0 2023-06-17 21:52:04,290 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-17 21:52:23,524 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.3404, simple_loss=0.4251, pruned_loss=0.1278, over 1796401.00 frames. 2023-06-17 21:52:23,525 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24375MB 2023-06-17 21:53:06,928 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.196e+02 5.638e+02 6.877e+02 1.385e+03, threshold=1.128e+03, percent-clipped=3.0 2023-06-17 21:54:04,845 INFO [train.py:996] (1/4) Epoch 1, batch 9050, loss[loss=0.3879, simple_loss=0.4061, pruned_loss=0.1849, over 21586.00 frames. ], tot_loss[loss=0.3838, simple_loss=0.415, pruned_loss=0.1763, over 4271016.87 frames. ], batch size: 230, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 21:55:09,160 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.817e-03 2023-06-17 21:55:10,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=54480.0, ans=0.0 2023-06-17 21:55:42,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=54540.0, ans=0.2 2023-06-17 21:55:44,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=54600.0, ans=0.125 2023-06-17 21:55:45,363 INFO [train.py:996] (1/4) Epoch 1, batch 9100, loss[loss=0.3564, simple_loss=0.4244, pruned_loss=0.1442, over 21821.00 frames. ], tot_loss[loss=0.3928, simple_loss=0.4233, pruned_loss=0.1812, over 4277227.61 frames. ], batch size: 371, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 21:55:56,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=54600.0, ans=0.125 2023-06-17 21:55:58,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=54600.0, ans=0.0 2023-06-17 21:56:31,621 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.753e+02 4.940e+02 6.814e+02 2.174e+03, threshold=9.881e+02, percent-clipped=7.0 2023-06-17 21:56:45,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=54720.0, ans=0.05 2023-06-17 21:56:48,709 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:57:35,302 INFO [train.py:996] (1/4) Epoch 1, batch 9150, loss[loss=0.3853, simple_loss=0.4003, pruned_loss=0.1852, over 21035.00 frames. ], tot_loss[loss=0.386, simple_loss=0.4221, pruned_loss=0.1749, over 4276288.50 frames. ], batch size: 608, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 21:58:04,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-17 21:58:08,939 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=15.0 2023-06-17 21:58:16,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=54960.0, ans=0.0 2023-06-17 21:58:34,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-17 21:58:43,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=10.0 2023-06-17 21:58:54,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-17 21:59:29,966 INFO [train.py:996] (1/4) Epoch 1, batch 9200, loss[loss=0.5421, simple_loss=0.5849, pruned_loss=0.2496, over 19720.00 frames. ], tot_loss[loss=0.3853, simple_loss=0.4237, pruned_loss=0.1734, over 4273141.17 frames. ], batch size: 702, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 21:59:45,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-17 21:59:59,129 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.127e+02 5.458e+02 7.694e+02 1.391e+03, threshold=1.092e+03, percent-clipped=9.0 2023-06-17 22:00:29,542 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-17 22:01:13,162 INFO [train.py:996] (1/4) Epoch 1, batch 9250, loss[loss=0.3773, simple_loss=0.3897, pruned_loss=0.1824, over 21658.00 frames. ], tot_loss[loss=0.3953, simple_loss=0.43, pruned_loss=0.1803, over 4269992.00 frames. ], batch size: 298, lr: 3.57e-02, grad_scale: 32.0 2023-06-17 22:01:17,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=55500.0, ans=0.125 2023-06-17 22:01:25,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=55500.0, ans=0.125 2023-06-17 22:02:16,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=55680.0, ans=0.5 2023-06-17 22:02:58,735 INFO [train.py:996] (1/4) Epoch 1, batch 9300, loss[loss=0.4178, simple_loss=0.4546, pruned_loss=0.1905, over 21611.00 frames. ], tot_loss[loss=0.3929, simple_loss=0.4256, pruned_loss=0.1801, over 4267404.48 frames. ], batch size: 414, lr: 3.57e-02, grad_scale: 32.0 2023-06-17 22:03:07,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=55800.0, ans=0.2 2023-06-17 22:03:28,667 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 4.090e+02 5.143e+02 6.278e+02 1.452e+03, threshold=1.029e+03, percent-clipped=2.0 2023-06-17 22:04:44,704 INFO [train.py:996] (1/4) Epoch 1, batch 9350, loss[loss=0.4312, simple_loss=0.4575, pruned_loss=0.2024, over 21823.00 frames. ], tot_loss[loss=0.3989, simple_loss=0.4335, pruned_loss=0.1822, over 4270139.74 frames. ], batch size: 118, lr: 3.56e-02, grad_scale: 32.0 2023-06-17 22:04:49,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-17 22:06:00,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.80 vs. limit=15.0 2023-06-17 22:06:02,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=56280.0, ans=0.125 2023-06-17 22:06:07,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=56280.0, ans=0.0 2023-06-17 22:06:07,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=56280.0, ans=0.125 2023-06-17 22:06:09,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=56280.0, ans=0.125 2023-06-17 22:06:29,231 INFO [train.py:996] (1/4) Epoch 1, batch 9400, loss[loss=0.3583, simple_loss=0.3803, pruned_loss=0.1681, over 21704.00 frames. ], tot_loss[loss=0.4022, simple_loss=0.4361, pruned_loss=0.1842, over 4274996.95 frames. ], batch size: 282, lr: 3.55e-02, grad_scale: 32.0 2023-06-17 22:07:03,735 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.872e+02 4.762e+02 5.781e+02 7.006e+02 1.289e+03, threshold=1.156e+03, percent-clipped=1.0 2023-06-17 22:07:04,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=56460.0, ans=0.125 2023-06-17 22:07:46,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=56580.0, ans=0.05 2023-06-17 22:07:50,342 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.40 vs. limit=10.0 2023-06-17 22:07:54,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=56640.0, ans=0.1 2023-06-17 22:08:03,009 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-17 22:08:07,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=56640.0, ans=0.1 2023-06-17 22:08:11,244 INFO [train.py:996] (1/4) Epoch 1, batch 9450, loss[loss=0.3452, simple_loss=0.3636, pruned_loss=0.1634, over 21302.00 frames. ], tot_loss[loss=0.3969, simple_loss=0.4288, pruned_loss=0.1825, over 4257986.78 frames. ], batch size: 177, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 22:09:43,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=56940.0, ans=0.0 2023-06-17 22:09:50,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2023-06-17 22:09:52,848 INFO [train.py:996] (1/4) Epoch 1, batch 9500, loss[loss=0.3327, simple_loss=0.3771, pruned_loss=0.1441, over 21410.00 frames. ], tot_loss[loss=0.3873, simple_loss=0.4178, pruned_loss=0.1783, over 4254046.68 frames. ], batch size: 211, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 22:10:35,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.522e+02 3.981e+02 4.935e+02 6.509e+02 1.656e+03, threshold=9.871e+02, percent-clipped=4.0 2023-06-17 22:10:55,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=57120.0, ans=0.025 2023-06-17 22:11:15,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=57180.0, ans=0.125 2023-06-17 22:11:33,472 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-17 22:11:37,256 INFO [train.py:996] (1/4) Epoch 1, batch 9550, loss[loss=0.4105, simple_loss=0.4577, pruned_loss=0.1817, over 21741.00 frames. ], tot_loss[loss=0.3943, simple_loss=0.4231, pruned_loss=0.1828, over 4253993.23 frames. ], batch size: 247, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 22:11:57,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=57360.0, ans=0.0 2023-06-17 22:12:31,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=57420.0, ans=0.125 2023-06-17 22:13:19,598 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-17 22:13:22,028 INFO [train.py:996] (1/4) Epoch 1, batch 9600, loss[loss=0.3989, simple_loss=0.4165, pruned_loss=0.1907, over 21823.00 frames. ], tot_loss[loss=0.3963, simple_loss=0.4255, pruned_loss=0.1835, over 4265507.39 frames. ], batch size: 124, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 22:13:32,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.96 vs. limit=10.0 2023-06-17 22:13:46,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=57660.0, ans=0.125 2023-06-17 22:14:08,850 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 4.156e+02 5.294e+02 7.045e+02 1.358e+03, threshold=1.059e+03, percent-clipped=6.0 2023-06-17 22:14:18,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=57720.0, ans=0.125 2023-06-17 22:15:06,090 INFO [train.py:996] (1/4) Epoch 1, batch 9650, loss[loss=0.4712, simple_loss=0.4815, pruned_loss=0.2305, over 21558.00 frames. ], tot_loss[loss=0.3937, simple_loss=0.4232, pruned_loss=0.1821, over 4265387.39 frames. ], batch size: 414, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 22:15:18,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57900.0, ans=0.1 2023-06-17 22:15:20,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-17 22:15:26,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=57960.0, ans=10.0 2023-06-17 22:16:18,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-17 22:16:28,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58080.0, ans=0.1 2023-06-17 22:16:32,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=58140.0, ans=0.04949747468305833 2023-06-17 22:16:49,790 INFO [train.py:996] (1/4) Epoch 1, batch 9700, loss[loss=0.3319, simple_loss=0.3761, pruned_loss=0.1438, over 21320.00 frames. ], tot_loss[loss=0.3951, simple_loss=0.4261, pruned_loss=0.1821, over 4264969.95 frames. ], batch size: 143, lr: 3.52e-02, grad_scale: 32.0 2023-06-17 22:17:02,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=58200.0, ans=0.5 2023-06-17 22:17:31,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58260.0, ans=0.1 2023-06-17 22:17:37,772 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.787e+02 4.137e+02 5.402e+02 6.942e+02 1.239e+03, threshold=1.080e+03, percent-clipped=2.0 2023-06-17 22:17:50,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=58320.0, ans=22.5 2023-06-17 22:18:26,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=15.0 2023-06-17 22:18:33,465 INFO [train.py:996] (1/4) Epoch 1, batch 9750, loss[loss=0.3862, simple_loss=0.384, pruned_loss=0.1942, over 21252.00 frames. ], tot_loss[loss=0.3874, simple_loss=0.4167, pruned_loss=0.1791, over 4262734.22 frames. ], batch size: 471, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 22:18:46,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=58500.0, ans=0.125 2023-06-17 22:18:50,205 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=22.5 2023-06-17 22:19:08,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=58560.0, ans=0.125 2023-06-17 22:19:27,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=58620.0, ans=0.1 2023-06-17 22:19:32,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=58620.0, ans=0.125 2023-06-17 22:20:12,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=58740.0, ans=0.125 2023-06-17 22:20:15,523 INFO [train.py:996] (1/4) Epoch 1, batch 9800, loss[loss=0.3894, simple_loss=0.414, pruned_loss=0.1824, over 21693.00 frames. ], tot_loss[loss=0.3881, simple_loss=0.4172, pruned_loss=0.1795, over 4257592.53 frames. ], batch size: 112, lr: 3.51e-02, grad_scale: 16.0 2023-06-17 22:20:20,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=58800.0, ans=0.125 2023-06-17 22:20:33,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-17 22:20:44,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=58860.0, ans=0.2 2023-06-17 22:21:03,541 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.967e+02 4.740e+02 5.847e+02 8.148e+02 2.070e+03, threshold=1.169e+03, percent-clipped=10.0 2023-06-17 22:21:10,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=58920.0, ans=0.125 2023-06-17 22:21:33,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=58980.0, ans=0.2 2023-06-17 22:21:52,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=59040.0, ans=0.1 2023-06-17 22:21:56,344 INFO [train.py:996] (1/4) Epoch 1, batch 9850, loss[loss=0.3349, simple_loss=0.3699, pruned_loss=0.15, over 21824.00 frames. ], tot_loss[loss=0.3865, simple_loss=0.4148, pruned_loss=0.1791, over 4261480.08 frames. ], batch size: 351, lr: 3.50e-02, grad_scale: 16.0 2023-06-17 22:22:43,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=59220.0, ans=0.2 2023-06-17 22:23:00,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=59280.0, ans=0.2 2023-06-17 22:23:13,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59280.0, ans=0.1 2023-06-17 22:23:20,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=59280.0, ans=0.2 2023-06-17 22:23:35,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-17 22:23:39,142 INFO [train.py:996] (1/4) Epoch 1, batch 9900, loss[loss=0.4052, simple_loss=0.4287, pruned_loss=0.1909, over 21309.00 frames. ], tot_loss[loss=0.3842, simple_loss=0.4108, pruned_loss=0.1787, over 4249909.69 frames. ], batch size: 549, lr: 3.50e-02, grad_scale: 16.0 2023-06-17 22:23:44,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=59400.0, ans=0.125 2023-06-17 22:23:44,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=59400.0, ans=0.125 2023-06-17 22:24:03,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=59400.0, ans=0.0 2023-06-17 22:24:10,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.14 vs. limit=10.0 2023-06-17 22:24:21,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=59460.0, ans=0.125 2023-06-17 22:24:21,783 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-17 22:24:28,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.114e+02 4.325e+02 5.228e+02 6.725e+02 1.103e+03, threshold=1.046e+03, percent-clipped=0.0 2023-06-17 22:24:29,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=59520.0, ans=0.05 2023-06-17 22:24:39,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=59520.0, ans=0.0 2023-06-17 22:25:23,896 INFO [train.py:996] (1/4) Epoch 1, batch 9950, loss[loss=0.3756, simple_loss=0.3933, pruned_loss=0.1789, over 21760.00 frames. ], tot_loss[loss=0.3861, simple_loss=0.4117, pruned_loss=0.1803, over 4252378.80 frames. ], batch size: 102, lr: 3.49e-02, grad_scale: 16.0 2023-06-17 22:25:29,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=59700.0, ans=0.125 2023-06-17 22:25:54,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=59760.0, ans=0.0 2023-06-17 22:26:03,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=59760.0, ans=0.0 2023-06-17 22:26:38,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=59880.0, ans=0.125 2023-06-17 22:26:50,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=59940.0, ans=0.0 2023-06-17 22:27:12,370 INFO [train.py:996] (1/4) Epoch 1, batch 10000, loss[loss=0.3547, simple_loss=0.3834, pruned_loss=0.163, over 21913.00 frames. ], tot_loss[loss=0.3817, simple_loss=0.4075, pruned_loss=0.178, over 4254067.81 frames. ], batch size: 317, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 22:27:19,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=60000.0, ans=0.05 2023-06-17 22:27:46,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60060.0, ans=0.1 2023-06-17 22:27:50,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-17 22:28:03,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.788e+02 4.452e+02 5.196e+02 6.727e+02 1.360e+03, threshold=1.039e+03, percent-clipped=5.0 2023-06-17 22:28:13,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=60120.0, ans=0.125 2023-06-17 22:28:20,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60180.0, ans=0.1 2023-06-17 22:28:47,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=60240.0, ans=0.035 2023-06-17 22:29:04,514 INFO [train.py:996] (1/4) Epoch 1, batch 10050, loss[loss=0.3578, simple_loss=0.3958, pruned_loss=0.16, over 21202.00 frames. ], tot_loss[loss=0.3841, simple_loss=0.4098, pruned_loss=0.1792, over 4250973.76 frames. ], batch size: 143, lr: 3.48e-02, grad_scale: 32.0 2023-06-17 22:29:11,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=60300.0, ans=0.125 2023-06-17 22:29:34,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-17 22:29:54,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=60420.0, ans=0.0 2023-06-17 22:30:02,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=60420.0, ans=0.125 2023-06-17 22:30:20,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=60480.0, ans=0.1 2023-06-17 22:30:54,296 INFO [train.py:996] (1/4) Epoch 1, batch 10100, loss[loss=0.2647, simple_loss=0.3231, pruned_loss=0.1031, over 21647.00 frames. ], tot_loss[loss=0.3773, simple_loss=0.4054, pruned_loss=0.1746, over 4258022.86 frames. ], batch size: 230, lr: 3.47e-02, grad_scale: 32.0 2023-06-17 22:31:26,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=60660.0, ans=0.1 2023-06-17 22:31:32,503 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.854e+02 4.188e+02 5.288e+02 6.297e+02 1.348e+03, threshold=1.058e+03, percent-clipped=5.0 2023-06-17 22:31:41,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=60720.0, ans=0.0 2023-06-17 22:31:42,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=60720.0, ans=0.125 2023-06-17 22:32:37,659 INFO [train.py:996] (1/4) Epoch 1, batch 10150, loss[loss=0.3994, simple_loss=0.4302, pruned_loss=0.1842, over 21336.00 frames. ], tot_loss[loss=0.3857, simple_loss=0.4132, pruned_loss=0.1791, over 4269451.72 frames. ], batch size: 131, lr: 3.47e-02, grad_scale: 32.0 2023-06-17 22:32:38,817 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-06-17 22:32:44,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=60900.0, ans=0.95 2023-06-17 22:32:52,259 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.25 vs. limit=6.0 2023-06-17 22:33:06,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=60960.0, ans=0.125 2023-06-17 22:33:22,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=61020.0, ans=10.0 2023-06-17 22:33:36,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=61080.0, ans=0.1 2023-06-17 22:33:40,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=61080.0, ans=0.125 2023-06-17 22:34:22,084 INFO [train.py:996] (1/4) Epoch 1, batch 10200, loss[loss=0.3207, simple_loss=0.378, pruned_loss=0.1317, over 21636.00 frames. ], tot_loss[loss=0.3774, simple_loss=0.4084, pruned_loss=0.1732, over 4268884.44 frames. ], batch size: 263, lr: 3.46e-02, grad_scale: 32.0 2023-06-17 22:35:01,225 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 3.806e+02 4.726e+02 6.535e+02 1.145e+03, threshold=9.453e+02, percent-clipped=1.0 2023-06-17 22:36:11,346 INFO [train.py:996] (1/4) Epoch 1, batch 10250, loss[loss=0.239, simple_loss=0.2991, pruned_loss=0.08943, over 21152.00 frames. ], tot_loss[loss=0.3618, simple_loss=0.3996, pruned_loss=0.162, over 4261331.78 frames. ], batch size: 176, lr: 3.46e-02, grad_scale: 32.0 2023-06-17 22:36:17,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=61500.0, ans=0.1 2023-06-17 22:37:37,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=61740.0, ans=0.5 2023-06-17 22:37:58,243 INFO [train.py:996] (1/4) Epoch 1, batch 10300, loss[loss=0.3576, simple_loss=0.4055, pruned_loss=0.1548, over 21466.00 frames. ], tot_loss[loss=0.3677, simple_loss=0.4053, pruned_loss=0.1651, over 4265396.15 frames. ], batch size: 194, lr: 3.45e-02, grad_scale: 16.0 2023-06-17 22:38:15,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=61860.0, ans=0.125 2023-06-17 22:38:39,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 4.187e+02 5.738e+02 8.381e+02 2.086e+03, threshold=1.148e+03, percent-clipped=17.0 2023-06-17 22:38:49,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=61920.0, ans=0.0 2023-06-17 22:39:09,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.18 vs. limit=10.0 2023-06-17 22:39:23,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=62040.0, ans=0.1 2023-06-17 22:39:43,534 INFO [train.py:996] (1/4) Epoch 1, batch 10350, loss[loss=0.3336, simple_loss=0.3844, pruned_loss=0.1414, over 21720.00 frames. ], tot_loss[loss=0.367, simple_loss=0.4054, pruned_loss=0.1643, over 4263144.41 frames. ], batch size: 351, lr: 3.45e-02, grad_scale: 16.0 2023-06-17 22:39:47,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=62100.0, ans=0.125 2023-06-17 22:40:11,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=62160.0, ans=0.0 2023-06-17 22:40:13,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=62160.0, ans=0.2 2023-06-17 22:40:28,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=15.0 2023-06-17 22:40:40,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=62220.0, ans=0.125 2023-06-17 22:40:50,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=62280.0, ans=0.125 2023-06-17 22:41:29,013 INFO [train.py:996] (1/4) Epoch 1, batch 10400, loss[loss=0.3531, simple_loss=0.3768, pruned_loss=0.1648, over 21733.00 frames. ], tot_loss[loss=0.3572, simple_loss=0.3951, pruned_loss=0.1596, over 4270415.47 frames. ], batch size: 298, lr: 3.44e-02, grad_scale: 32.0 2023-06-17 22:41:37,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=62400.0, ans=0.125 2023-06-17 22:41:53,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=62460.0, ans=0.125 2023-06-17 22:42:19,007 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.646e+02 4.942e+02 6.227e+02 1.303e+03, threshold=9.884e+02, percent-clipped=2.0 2023-06-17 22:42:44,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=62580.0, ans=0.2 2023-06-17 22:42:47,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-17 22:43:18,604 INFO [train.py:996] (1/4) Epoch 1, batch 10450, loss[loss=0.358, simple_loss=0.3766, pruned_loss=0.1697, over 21251.00 frames. ], tot_loss[loss=0.3665, simple_loss=0.4012, pruned_loss=0.1659, over 4271265.04 frames. ], batch size: 608, lr: 3.44e-02, grad_scale: 32.0 2023-06-17 22:43:30,028 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:43:33,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-17 22:43:36,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=62760.0, ans=0.0 2023-06-17 22:44:31,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=62880.0, ans=0.0 2023-06-17 22:44:42,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=62940.0, ans=0.0 2023-06-17 22:44:58,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=62940.0, ans=0.1 2023-06-17 22:45:02,369 INFO [train.py:996] (1/4) Epoch 1, batch 10500, loss[loss=0.3563, simple_loss=0.396, pruned_loss=0.1583, over 20708.00 frames. ], tot_loss[loss=0.3665, simple_loss=0.4025, pruned_loss=0.1652, over 4276891.08 frames. ], batch size: 607, lr: 3.43e-02, grad_scale: 32.0 2023-06-17 22:45:48,073 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.889e+02 5.007e+02 6.898e+02 1.631e+03, threshold=1.001e+03, percent-clipped=5.0 2023-06-17 22:46:13,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=63180.0, ans=0.125 2023-06-17 22:46:30,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=63240.0, ans=0.125 2023-06-17 22:46:45,930 INFO [train.py:996] (1/4) Epoch 1, batch 10550, loss[loss=0.3475, simple_loss=0.3656, pruned_loss=0.1647, over 21186.00 frames. ], tot_loss[loss=0.3642, simple_loss=0.3969, pruned_loss=0.1658, over 4267500.21 frames. ], batch size: 144, lr: 3.43e-02, grad_scale: 32.0 2023-06-17 22:46:58,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63300.0, ans=0.1 2023-06-17 22:47:13,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=63360.0, ans=0.0 2023-06-17 22:48:12,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=63540.0, ans=0.2 2023-06-17 22:48:29,734 INFO [train.py:996] (1/4) Epoch 1, batch 10600, loss[loss=0.3563, simple_loss=0.4144, pruned_loss=0.1491, over 21619.00 frames. ], tot_loss[loss=0.3595, simple_loss=0.3917, pruned_loss=0.1637, over 4268986.72 frames. ], batch size: 441, lr: 3.42e-02, grad_scale: 32.0 2023-06-17 22:48:55,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=63600.0, ans=0.125 2023-06-17 22:49:10,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=63660.0, ans=0.125 2023-06-17 22:49:22,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.854e+02 4.619e+02 6.310e+02 1.881e+03, threshold=9.238e+02, percent-clipped=9.0 2023-06-17 22:49:35,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=63720.0, ans=0.2 2023-06-17 22:49:50,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=22.5 2023-06-17 22:50:28,335 INFO [train.py:996] (1/4) Epoch 1, batch 10650, loss[loss=0.2721, simple_loss=0.334, pruned_loss=0.1051, over 21844.00 frames. ], tot_loss[loss=0.3568, simple_loss=0.3916, pruned_loss=0.161, over 4271315.00 frames. ], batch size: 317, lr: 3.41e-02, grad_scale: 32.0 2023-06-17 22:50:43,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.26 vs. limit=6.0 2023-06-17 22:51:14,745 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-17 22:51:18,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2023-06-17 22:51:24,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=64080.0, ans=0.1 2023-06-17 22:51:43,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=64140.0, ans=0.125 2023-06-17 22:52:14,283 INFO [train.py:996] (1/4) Epoch 1, batch 10700, loss[loss=0.4254, simple_loss=0.4495, pruned_loss=0.2007, over 21431.00 frames. ], tot_loss[loss=0.3577, simple_loss=0.3916, pruned_loss=0.1619, over 4270622.90 frames. ], batch size: 549, lr: 3.41e-02, grad_scale: 32.0 2023-06-17 22:52:55,085 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.608e+02 4.131e+02 5.113e+02 6.555e+02 1.006e+03, threshold=1.023e+03, percent-clipped=2.0 2023-06-17 22:53:53,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=64440.0, ans=0.1 2023-06-17 22:53:59,504 INFO [train.py:996] (1/4) Epoch 1, batch 10750, loss[loss=0.4521, simple_loss=0.484, pruned_loss=0.2101, over 21447.00 frames. ], tot_loss[loss=0.3718, simple_loss=0.4047, pruned_loss=0.1695, over 4271355.13 frames. ], batch size: 194, lr: 3.40e-02, grad_scale: 32.0 2023-06-17 22:54:33,638 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:55:36,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.63 vs. limit=6.0 2023-06-17 22:55:37,135 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.96 vs. limit=22.5 2023-06-17 22:55:48,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.43 vs. limit=6.0 2023-06-17 22:55:49,307 INFO [train.py:996] (1/4) Epoch 1, batch 10800, loss[loss=0.3744, simple_loss=0.4356, pruned_loss=0.1566, over 21310.00 frames. ], tot_loss[loss=0.3753, simple_loss=0.4103, pruned_loss=0.1702, over 4280239.46 frames. ], batch size: 548, lr: 3.40e-02, grad_scale: 32.0 2023-06-17 22:55:56,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64800.0, ans=0.1 2023-06-17 22:56:03,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64800.0, ans=0.125 2023-06-17 22:56:06,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=64860.0, ans=0.0 2023-06-17 22:56:11,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=64860.0, ans=0.0 2023-06-17 22:56:30,670 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.758e+02 4.502e+02 5.308e+02 7.377e+02 1.430e+03, threshold=1.062e+03, percent-clipped=5.0 2023-06-17 22:57:33,923 INFO [train.py:996] (1/4) Epoch 1, batch 10850, loss[loss=0.349, simple_loss=0.3927, pruned_loss=0.1527, over 21859.00 frames. ], tot_loss[loss=0.3747, simple_loss=0.4102, pruned_loss=0.1696, over 4275279.50 frames. ], batch size: 317, lr: 3.39e-02, grad_scale: 32.0 2023-06-17 22:58:29,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=65220.0, ans=0.125 2023-06-17 22:59:08,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=65340.0, ans=0.0 2023-06-17 22:59:17,687 INFO [train.py:996] (1/4) Epoch 1, batch 10900, loss[loss=0.3148, simple_loss=0.3726, pruned_loss=0.1286, over 21243.00 frames. ], tot_loss[loss=0.3686, simple_loss=0.4033, pruned_loss=0.167, over 4274455.91 frames. ], batch size: 176, lr: 3.39e-02, grad_scale: 32.0 2023-06-17 22:59:19,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=65400.0, ans=0.1 2023-06-17 22:59:21,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=65400.0, ans=0.0 2023-06-17 22:59:24,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=65400.0, ans=0.125 2023-06-17 22:59:59,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.489e+02 3.764e+02 4.430e+02 5.513e+02 1.224e+03, threshold=8.861e+02, percent-clipped=2.0 2023-06-17 23:01:01,580 INFO [train.py:996] (1/4) Epoch 1, batch 10950, loss[loss=0.3568, simple_loss=0.3821, pruned_loss=0.1658, over 21368.00 frames. ], tot_loss[loss=0.3659, simple_loss=0.3994, pruned_loss=0.1662, over 4265775.02 frames. ], batch size: 144, lr: 3.38e-02, grad_scale: 32.0 2023-06-17 23:02:33,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=65940.0, ans=0.0 2023-06-17 23:02:44,521 INFO [train.py:996] (1/4) Epoch 1, batch 11000, loss[loss=0.3524, simple_loss=0.3824, pruned_loss=0.1611, over 21453.00 frames. ], tot_loss[loss=0.3652, simple_loss=0.3974, pruned_loss=0.1665, over 4263824.31 frames. ], batch size: 548, lr: 3.38e-02, grad_scale: 32.0 2023-06-17 23:03:19,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=66060.0, ans=0.05 2023-06-17 23:03:25,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.455e+02 4.369e+02 5.427e+02 7.022e+02 1.248e+03, threshold=1.085e+03, percent-clipped=10.0 2023-06-17 23:04:27,713 INFO [train.py:996] (1/4) Epoch 1, batch 11050, loss[loss=0.3435, simple_loss=0.3933, pruned_loss=0.1469, over 20012.00 frames. ], tot_loss[loss=0.3658, simple_loss=0.3962, pruned_loss=0.1677, over 4263216.76 frames. ], batch size: 703, lr: 3.37e-02, grad_scale: 32.0 2023-06-17 23:04:38,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=66300.0, ans=0.125 2023-06-17 23:04:41,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=66300.0, ans=0.0 2023-06-17 23:05:04,568 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:05:07,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66420.0, ans=0.1 2023-06-17 23:05:43,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=66480.0, ans=0.0 2023-06-17 23:06:01,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=66540.0, ans=0.125 2023-06-17 23:06:11,009 INFO [train.py:996] (1/4) Epoch 1, batch 11100, loss[loss=0.3701, simple_loss=0.3907, pruned_loss=0.1747, over 21684.00 frames. ], tot_loss[loss=0.3648, simple_loss=0.3944, pruned_loss=0.1676, over 4266585.59 frames. ], batch size: 316, lr: 3.37e-02, grad_scale: 32.0 2023-06-17 23:06:40,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=66660.0, ans=0.125 2023-06-17 23:06:50,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=66720.0, ans=0.09899494936611666 2023-06-17 23:06:58,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.675e+02 3.924e+02 4.981e+02 6.262e+02 1.185e+03, threshold=9.963e+02, percent-clipped=1.0 2023-06-17 23:07:48,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=66840.0, ans=0.0 2023-06-17 23:07:55,919 INFO [train.py:996] (1/4) Epoch 1, batch 11150, loss[loss=0.3419, simple_loss=0.4112, pruned_loss=0.1363, over 21686.00 frames. ], tot_loss[loss=0.3603, simple_loss=0.3906, pruned_loss=0.165, over 4265816.35 frames. ], batch size: 298, lr: 3.36e-02, grad_scale: 32.0 2023-06-17 23:08:30,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=66960.0, ans=0.125 2023-06-17 23:08:54,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=67020.0, ans=0.0 2023-06-17 23:09:25,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=67140.0, ans=0.0 2023-06-17 23:09:28,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=67140.0, ans=0.125 2023-06-17 23:09:38,646 INFO [train.py:996] (1/4) Epoch 1, batch 11200, loss[loss=0.3357, simple_loss=0.3514, pruned_loss=0.16, over 21577.00 frames. ], tot_loss[loss=0.3569, simple_loss=0.3878, pruned_loss=0.163, over 4256724.07 frames. ], batch size: 247, lr: 3.36e-02, grad_scale: 32.0 2023-06-17 23:10:02,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=67260.0, ans=0.2 2023-06-17 23:10:25,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 3.969e+02 4.814e+02 6.139e+02 9.199e+02, threshold=9.628e+02, percent-clipped=0.0 2023-06-17 23:10:47,571 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:11:21,151 INFO [train.py:996] (1/4) Epoch 1, batch 11250, loss[loss=0.3425, simple_loss=0.3825, pruned_loss=0.1513, over 21797.00 frames. ], tot_loss[loss=0.358, simple_loss=0.3885, pruned_loss=0.1638, over 4256506.22 frames. ], batch size: 102, lr: 3.35e-02, grad_scale: 32.0 2023-06-17 23:11:57,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=67620.0, ans=0.125 2023-06-17 23:13:04,109 INFO [train.py:996] (1/4) Epoch 1, batch 11300, loss[loss=0.3523, simple_loss=0.3881, pruned_loss=0.1583, over 21197.00 frames. ], tot_loss[loss=0.3596, simple_loss=0.3912, pruned_loss=0.164, over 4268050.62 frames. ], batch size: 159, lr: 3.35e-02, grad_scale: 32.0 2023-06-17 23:13:27,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=67860.0, ans=0.125 2023-06-17 23:13:51,112 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.668e+02 4.732e+02 6.264e+02 1.219e+03, threshold=9.465e+02, percent-clipped=6.0 2023-06-17 23:14:35,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=68040.0, ans=0.0 2023-06-17 23:14:35,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=68040.0, ans=0.0 2023-06-17 23:14:49,569 INFO [train.py:996] (1/4) Epoch 1, batch 11350, loss[loss=0.3415, simple_loss=0.3661, pruned_loss=0.1584, over 16274.00 frames. ], tot_loss[loss=0.3619, simple_loss=0.3947, pruned_loss=0.1646, over 4270347.92 frames. ], batch size: 61, lr: 3.34e-02, grad_scale: 32.0 2023-06-17 23:15:24,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-17 23:15:36,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=20.50 vs. limit=15.0 2023-06-17 23:15:58,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=68280.0, ans=0.125 2023-06-17 23:16:18,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=68340.0, ans=0.125 2023-06-17 23:16:41,423 INFO [train.py:996] (1/4) Epoch 1, batch 11400, loss[loss=0.3846, simple_loss=0.4364, pruned_loss=0.1665, over 21773.00 frames. ], tot_loss[loss=0.3668, simple_loss=0.4001, pruned_loss=0.1667, over 4270940.13 frames. ], batch size: 333, lr: 3.34e-02, grad_scale: 32.0 2023-06-17 23:17:28,741 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.138e+02 5.254e+02 6.973e+02 1.408e+03, threshold=1.051e+03, percent-clipped=10.0 2023-06-17 23:18:27,536 INFO [train.py:996] (1/4) Epoch 1, batch 11450, loss[loss=0.3195, simple_loss=0.376, pruned_loss=0.1315, over 21568.00 frames. ], tot_loss[loss=0.3671, simple_loss=0.4033, pruned_loss=0.1655, over 4266593.13 frames. ], batch size: 230, lr: 3.33e-02, grad_scale: 32.0 2023-06-17 23:18:50,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=68700.0, ans=0.125 2023-06-17 23:18:52,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=68700.0, ans=0.125 2023-06-17 23:19:01,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-17 23:20:13,495 INFO [train.py:996] (1/4) Epoch 1, batch 11500, loss[loss=0.417, simple_loss=0.4429, pruned_loss=0.1956, over 21595.00 frames. ], tot_loss[loss=0.3712, simple_loss=0.4078, pruned_loss=0.1673, over 4273336.34 frames. ], batch size: 414, lr: 3.33e-02, grad_scale: 32.0 2023-06-17 23:21:00,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 4.282e+02 5.552e+02 6.865e+02 1.531e+03, threshold=1.110e+03, percent-clipped=3.0 2023-06-17 23:21:16,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=69120.0, ans=0.125 2023-06-17 23:21:38,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=69180.0, ans=0.125 2023-06-17 23:21:41,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=69240.0, ans=0.125 2023-06-17 23:22:09,453 INFO [train.py:996] (1/4) Epoch 1, batch 11550, loss[loss=0.5179, simple_loss=0.5926, pruned_loss=0.2215, over 21156.00 frames. ], tot_loss[loss=0.3773, simple_loss=0.4172, pruned_loss=0.1687, over 4274501.01 frames. ], batch size: 548, lr: 3.32e-02, grad_scale: 32.0 2023-06-17 23:22:13,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=69300.0, ans=0.1 2023-06-17 23:23:54,948 INFO [train.py:996] (1/4) Epoch 1, batch 11600, loss[loss=0.3797, simple_loss=0.4527, pruned_loss=0.1534, over 21759.00 frames. ], tot_loss[loss=0.3815, simple_loss=0.426, pruned_loss=0.1685, over 4273062.47 frames. ], batch size: 351, lr: 3.32e-02, grad_scale: 32.0 2023-06-17 23:23:58,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=69600.0, ans=0.125 2023-06-17 23:24:22,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=69660.0, ans=0.125 2023-06-17 23:24:27,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=12.0 2023-06-17 23:24:39,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.776e+02 4.538e+02 6.004e+02 8.984e+02 1.767e+03, threshold=1.201e+03, percent-clipped=15.0 2023-06-17 23:25:16,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=69840.0, ans=0.125 2023-06-17 23:25:22,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=69840.0, ans=0.125 2023-06-17 23:25:24,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=69840.0, ans=0.025 2023-06-17 23:25:31,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=69900.0, ans=15.0 2023-06-17 23:25:32,105 INFO [train.py:996] (1/4) Epoch 1, batch 11650, loss[loss=0.3746, simple_loss=0.4201, pruned_loss=0.1646, over 21264.00 frames. ], tot_loss[loss=0.3838, simple_loss=0.4305, pruned_loss=0.1685, over 4279963.76 frames. ], batch size: 549, lr: 3.31e-02, grad_scale: 16.0 2023-06-17 23:25:44,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.29 vs. limit=15.0 2023-06-17 23:27:11,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70140.0, ans=0.1 2023-06-17 23:27:15,723 INFO [train.py:996] (1/4) Epoch 1, batch 11700, loss[loss=0.4547, simple_loss=0.4927, pruned_loss=0.2083, over 19744.00 frames. ], tot_loss[loss=0.3795, simple_loss=0.4221, pruned_loss=0.1684, over 4276361.63 frames. ], batch size: 702, lr: 3.31e-02, grad_scale: 16.0 2023-06-17 23:28:00,019 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.900e+02 4.129e+02 5.507e+02 7.167e+02 1.590e+03, threshold=1.101e+03, percent-clipped=1.0 2023-06-17 23:28:00,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=70320.0, ans=0.04949747468305833 2023-06-17 23:28:43,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=70440.0, ans=0.0 2023-06-17 23:28:48,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=70440.0, ans=0.0 2023-06-17 23:28:52,869 INFO [train.py:996] (1/4) Epoch 1, batch 11750, loss[loss=0.4131, simple_loss=0.4062, pruned_loss=0.21, over 21400.00 frames. ], tot_loss[loss=0.3732, simple_loss=0.411, pruned_loss=0.1677, over 4266910.84 frames. ], batch size: 475, lr: 3.30e-02, grad_scale: 16.0 2023-06-17 23:30:38,192 INFO [train.py:996] (1/4) Epoch 1, batch 11800, loss[loss=0.3227, simple_loss=0.3757, pruned_loss=0.1348, over 20819.00 frames. ], tot_loss[loss=0.3791, simple_loss=0.4143, pruned_loss=0.172, over 4261809.49 frames. ], batch size: 609, lr: 3.30e-02, grad_scale: 16.0 2023-06-17 23:30:54,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=70800.0, ans=0.0 2023-06-17 23:31:29,233 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.727e+02 4.871e+02 6.879e+02 1.447e+03, threshold=9.741e+02, percent-clipped=5.0 2023-06-17 23:31:36,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=70920.0, ans=0.125 2023-06-17 23:31:58,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=70980.0, ans=0.1 2023-06-17 23:32:15,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=71040.0, ans=0.1 2023-06-17 23:32:22,081 INFO [train.py:996] (1/4) Epoch 1, batch 11850, loss[loss=0.3492, simple_loss=0.4104, pruned_loss=0.144, over 21841.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4164, pruned_loss=0.1717, over 4268227.99 frames. ], batch size: 371, lr: 3.29e-02, grad_scale: 16.0 2023-06-17 23:32:23,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=22.5 2023-06-17 23:32:54,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=71160.0, ans=0.0 2023-06-17 23:34:10,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=71400.0, ans=0.04949747468305833 2023-06-17 23:34:12,205 INFO [train.py:996] (1/4) Epoch 1, batch 11900, loss[loss=0.3453, simple_loss=0.3964, pruned_loss=0.1471, over 21739.00 frames. ], tot_loss[loss=0.3749, simple_loss=0.4153, pruned_loss=0.1673, over 4270863.07 frames. ], batch size: 298, lr: 3.29e-02, grad_scale: 16.0 2023-06-17 23:34:16,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=71400.0, ans=15.0 2023-06-17 23:34:22,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=71400.0, ans=0.125 2023-06-17 23:34:44,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=71460.0, ans=0.0 2023-06-17 23:34:46,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=71460.0, ans=0.125 2023-06-17 23:34:53,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.42 vs. limit=6.0 2023-06-17 23:35:02,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=71520.0, ans=0.0 2023-06-17 23:35:08,161 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.592e+02 4.787e+02 5.877e+02 1.275e+03, threshold=9.575e+02, percent-clipped=4.0 2023-06-17 23:35:42,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=19.83 vs. limit=15.0 2023-06-17 23:35:43,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=71640.0, ans=0.5 2023-06-17 23:35:55,441 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=12.0 2023-06-17 23:35:56,077 INFO [train.py:996] (1/4) Epoch 1, batch 11950, loss[loss=0.3322, simple_loss=0.395, pruned_loss=0.1347, over 21704.00 frames. ], tot_loss[loss=0.3671, simple_loss=0.4118, pruned_loss=0.1612, over 4270042.66 frames. ], batch size: 351, lr: 3.28e-02, grad_scale: 16.0 2023-06-17 23:37:05,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=71880.0, ans=0.1 2023-06-17 23:37:33,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=71940.0, ans=0.125 2023-06-17 23:37:39,769 INFO [train.py:996] (1/4) Epoch 1, batch 12000, loss[loss=0.3289, simple_loss=0.3372, pruned_loss=0.1603, over 20023.00 frames. ], tot_loss[loss=0.363, simple_loss=0.4047, pruned_loss=0.1606, over 4266627.88 frames. ], batch size: 703, lr: 3.28e-02, grad_scale: 32.0 2023-06-17 23:37:39,770 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-17 23:37:57,338 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.3348, simple_loss=0.4196, pruned_loss=0.125, over 1796401.00 frames. 2023-06-17 23:37:57,339 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24375MB 2023-06-17 23:38:14,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=72000.0, ans=0.125 2023-06-17 23:38:33,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=72060.0, ans=0.125 2023-06-17 23:38:48,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=72120.0, ans=0.1 2023-06-17 23:38:50,066 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:38:52,938 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 3.693e+02 4.861e+02 6.052e+02 1.192e+03, threshold=9.721e+02, percent-clipped=3.0 2023-06-17 23:39:38,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=72240.0, ans=0.125 2023-06-17 23:39:41,327 INFO [train.py:996] (1/4) Epoch 1, batch 12050, loss[loss=0.4084, simple_loss=0.4185, pruned_loss=0.1991, over 21918.00 frames. ], tot_loss[loss=0.3648, simple_loss=0.4036, pruned_loss=0.163, over 4261209.33 frames. ], batch size: 316, lr: 3.27e-02, grad_scale: 32.0 2023-06-17 23:39:41,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=72300.0, ans=0.125 2023-06-17 23:39:53,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=72300.0, ans=0.125 2023-06-17 23:39:56,004 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-17 23:40:54,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=72480.0, ans=0.125 2023-06-17 23:41:32,420 INFO [train.py:996] (1/4) Epoch 1, batch 12100, loss[loss=0.4417, simple_loss=0.4614, pruned_loss=0.211, over 21301.00 frames. ], tot_loss[loss=0.38, simple_loss=0.4172, pruned_loss=0.1714, over 4259575.86 frames. ], batch size: 159, lr: 3.27e-02, grad_scale: 16.0 2023-06-17 23:42:26,171 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.143e+02 4.439e+02 6.434e+02 8.417e+02 1.460e+03, threshold=1.287e+03, percent-clipped=16.0 2023-06-17 23:42:43,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=72780.0, ans=0.125 2023-06-17 23:43:17,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=72840.0, ans=0.0 2023-06-17 23:43:17,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72840.0, ans=0.1 2023-06-17 23:43:23,506 INFO [train.py:996] (1/4) Epoch 1, batch 12150, loss[loss=0.3226, simple_loss=0.3663, pruned_loss=0.1394, over 21258.00 frames. ], tot_loss[loss=0.3802, simple_loss=0.4178, pruned_loss=0.1712, over 4254881.86 frames. ], batch size: 176, lr: 3.26e-02, grad_scale: 16.0 2023-06-17 23:43:50,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.65 vs. limit=22.5 2023-06-17 23:44:11,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-17 23:44:38,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=73080.0, ans=0.125 2023-06-17 23:44:41,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=73080.0, ans=0.0 2023-06-17 23:44:59,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=73200.0, ans=0.125 2023-06-17 23:45:00,496 INFO [train.py:996] (1/4) Epoch 1, batch 12200, loss[loss=0.3236, simple_loss=0.3638, pruned_loss=0.1417, over 21807.00 frames. ], tot_loss[loss=0.3774, simple_loss=0.4141, pruned_loss=0.1704, over 4247963.91 frames. ], batch size: 112, lr: 3.26e-02, grad_scale: 16.0 2023-06-17 23:45:33,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2023-06-17 23:45:33,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.35 vs. limit=15.0 2023-06-17 23:45:45,897 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.575e+02 3.853e+02 4.664e+02 5.869e+02 1.070e+03, threshold=9.327e+02, percent-clipped=0.0 2023-06-17 23:45:46,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=73320.0, ans=0.2 2023-06-17 23:46:23,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=73380.0, ans=0.025 2023-06-17 23:46:23,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=73380.0, ans=0.125 2023-06-17 23:46:42,495 INFO [train.py:996] (1/4) Epoch 1, batch 12250, loss[loss=0.2422, simple_loss=0.2977, pruned_loss=0.0934, over 21734.00 frames. ], tot_loss[loss=0.3635, simple_loss=0.4013, pruned_loss=0.1628, over 4242649.37 frames. ], batch size: 124, lr: 3.25e-02, grad_scale: 16.0 2023-06-17 23:46:51,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=73500.0, ans=0.125 2023-06-17 23:47:00,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=73500.0, ans=0.125 2023-06-17 23:47:35,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73680.0, ans=0.1 2023-06-17 23:48:25,658 INFO [train.py:996] (1/4) Epoch 1, batch 12300, loss[loss=0.2454, simple_loss=0.3064, pruned_loss=0.0922, over 21528.00 frames. ], tot_loss[loss=0.3511, simple_loss=0.3932, pruned_loss=0.1544, over 4242897.73 frames. ], batch size: 195, lr: 3.25e-02, grad_scale: 16.0 2023-06-17 23:49:12,338 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.740e+02 4.870e+02 6.587e+02 1.091e+03, threshold=9.740e+02, percent-clipped=4.0 2023-06-17 23:50:08,209 INFO [train.py:996] (1/4) Epoch 1, batch 12350, loss[loss=0.416, simple_loss=0.4387, pruned_loss=0.1966, over 21757.00 frames. ], tot_loss[loss=0.3508, simple_loss=0.3952, pruned_loss=0.1532, over 4249778.16 frames. ], batch size: 389, lr: 3.24e-02, grad_scale: 16.0 2023-06-17 23:50:10,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74100.0, ans=0.1 2023-06-17 23:50:43,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=74220.0, ans=0.0 2023-06-17 23:50:48,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=74220.0, ans=0.125 2023-06-17 23:51:02,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74280.0, ans=0.1 2023-06-17 23:51:05,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-06-17 23:51:49,147 INFO [train.py:996] (1/4) Epoch 1, batch 12400, loss[loss=0.4046, simple_loss=0.425, pruned_loss=0.1921, over 21700.00 frames. ], tot_loss[loss=0.3585, simple_loss=0.399, pruned_loss=0.159, over 4263971.96 frames. ], batch size: 389, lr: 3.24e-02, grad_scale: 32.0 2023-06-17 23:51:53,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=74400.0, ans=0.0 2023-06-17 23:52:34,930 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.026e+02 5.096e+02 6.661e+02 1.103e+03, threshold=1.019e+03, percent-clipped=2.0 2023-06-17 23:53:26,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=74640.0, ans=0.125 2023-06-17 23:53:28,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=74640.0, ans=0.0 2023-06-17 23:53:31,316 INFO [train.py:996] (1/4) Epoch 1, batch 12450, loss[loss=0.3683, simple_loss=0.4016, pruned_loss=0.1675, over 21321.00 frames. ], tot_loss[loss=0.3652, simple_loss=0.4038, pruned_loss=0.1633, over 4264723.70 frames. ], batch size: 159, lr: 3.23e-02, grad_scale: 32.0 2023-06-17 23:53:51,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=74760.0, ans=0.125 2023-06-17 23:53:53,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=74760.0, ans=0.04949747468305833 2023-06-17 23:54:03,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=74760.0, ans=0.1 2023-06-17 23:54:55,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=74940.0, ans=0.125 2023-06-17 23:55:16,022 INFO [train.py:996] (1/4) Epoch 1, batch 12500, loss[loss=0.4582, simple_loss=0.5043, pruned_loss=0.206, over 21922.00 frames. ], tot_loss[loss=0.3786, simple_loss=0.4169, pruned_loss=0.1702, over 4271601.54 frames. ], batch size: 372, lr: 3.23e-02, grad_scale: 32.0 2023-06-17 23:55:19,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=75000.0, ans=0.2 2023-06-17 23:56:14,398 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.070e+02 4.603e+02 5.505e+02 7.191e+02 1.270e+03, threshold=1.101e+03, percent-clipped=4.0 2023-06-17 23:56:14,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=75120.0, ans=0.125 2023-06-17 23:56:43,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=75180.0, ans=0.0 2023-06-17 23:57:02,625 INFO [train.py:996] (1/4) Epoch 1, batch 12550, loss[loss=0.4199, simple_loss=0.4577, pruned_loss=0.1911, over 21639.00 frames. ], tot_loss[loss=0.3864, simple_loss=0.4247, pruned_loss=0.1741, over 4276541.25 frames. ], batch size: 389, lr: 3.22e-02, grad_scale: 32.0 2023-06-17 23:57:06,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=75300.0, ans=0.0 2023-06-17 23:57:22,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=75360.0, ans=0.2 2023-06-17 23:57:40,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.23 vs. limit=10.0 2023-06-17 23:57:41,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=75360.0, ans=0.1 2023-06-17 23:57:53,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-17 23:58:13,101 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:58:44,727 INFO [train.py:996] (1/4) Epoch 1, batch 12600, loss[loss=0.338, simple_loss=0.3927, pruned_loss=0.1416, over 21811.00 frames. ], tot_loss[loss=0.3805, simple_loss=0.4209, pruned_loss=0.17, over 4265677.50 frames. ], batch size: 333, lr: 3.22e-02, grad_scale: 32.0 2023-06-17 23:59:09,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=75660.0, ans=0.125 2023-06-17 23:59:41,391 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.279e+02 3.697e+02 4.571e+02 5.714e+02 1.241e+03, threshold=9.141e+02, percent-clipped=1.0 2023-06-18 00:00:21,815 INFO [train.py:996] (1/4) Epoch 1, batch 12650, loss[loss=0.4107, simple_loss=0.4324, pruned_loss=0.1945, over 21892.00 frames. ], tot_loss[loss=0.3659, simple_loss=0.4085, pruned_loss=0.1616, over 4263308.67 frames. ], batch size: 118, lr: 3.21e-02, grad_scale: 32.0 2023-06-18 00:00:39,699 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=22.5 2023-06-18 00:01:01,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=75960.0, ans=0.0 2023-06-18 00:01:01,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=75960.0, ans=0.0 2023-06-18 00:01:40,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=76140.0, ans=0.125 2023-06-18 00:02:03,320 INFO [train.py:996] (1/4) Epoch 1, batch 12700, loss[loss=0.3672, simple_loss=0.396, pruned_loss=0.1692, over 21584.00 frames. ], tot_loss[loss=0.3708, simple_loss=0.4091, pruned_loss=0.1662, over 4269202.54 frames. ], batch size: 263, lr: 3.21e-02, grad_scale: 32.0 2023-06-18 00:02:29,138 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:02:54,848 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.709e+02 4.032e+02 5.068e+02 7.064e+02 1.461e+03, threshold=1.014e+03, percent-clipped=9.0 2023-06-18 00:03:08,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=76380.0, ans=0.125 2023-06-18 00:03:40,085 INFO [train.py:996] (1/4) Epoch 1, batch 12750, loss[loss=0.3441, simple_loss=0.391, pruned_loss=0.1486, over 21792.00 frames. ], tot_loss[loss=0.3741, simple_loss=0.4122, pruned_loss=0.168, over 4279533.48 frames. ], batch size: 282, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 00:05:01,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76740.0, ans=0.1 2023-06-18 00:05:02,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=76740.0, ans=0.025 2023-06-18 00:05:19,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=76740.0, ans=0.125 2023-06-18 00:05:32,631 INFO [train.py:996] (1/4) Epoch 1, batch 12800, loss[loss=0.3799, simple_loss=0.4221, pruned_loss=0.1689, over 21217.00 frames. ], tot_loss[loss=0.3752, simple_loss=0.4121, pruned_loss=0.1692, over 4272390.92 frames. ], batch size: 143, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 00:06:07,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=76860.0, ans=0.0 2023-06-18 00:06:10,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=76920.0, ans=0.2 2023-06-18 00:06:20,452 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 3.969e+02 4.961e+02 6.426e+02 1.503e+03, threshold=9.923e+02, percent-clipped=9.0 2023-06-18 00:06:49,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=22.5 2023-06-18 00:07:09,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=77040.0, ans=0.125 2023-06-18 00:07:12,460 INFO [train.py:996] (1/4) Epoch 1, batch 12850, loss[loss=0.3344, simple_loss=0.3762, pruned_loss=0.1463, over 21328.00 frames. ], tot_loss[loss=0.3811, simple_loss=0.417, pruned_loss=0.1726, over 4274242.12 frames. ], batch size: 159, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:07:13,382 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.31 vs. limit=22.5 2023-06-18 00:07:28,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=77100.0, ans=0.0 2023-06-18 00:07:59,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2023-06-18 00:08:21,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=77280.0, ans=0.125 2023-06-18 00:08:47,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=77340.0, ans=0.0 2023-06-18 00:09:00,983 INFO [train.py:996] (1/4) Epoch 1, batch 12900, loss[loss=0.3437, simple_loss=0.3992, pruned_loss=0.1441, over 21772.00 frames. ], tot_loss[loss=0.3719, simple_loss=0.4127, pruned_loss=0.1656, over 4270280.81 frames. ], batch size: 316, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:09:02,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-18 00:09:09,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=77400.0, ans=0.125 2023-06-18 00:09:19,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=77460.0, ans=0.09899494936611666 2023-06-18 00:09:46,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.853e+02 4.882e+02 6.013e+02 9.581e+02, threshold=9.764e+02, percent-clipped=0.0 2023-06-18 00:10:43,800 INFO [train.py:996] (1/4) Epoch 1, batch 12950, loss[loss=0.3214, simple_loss=0.3715, pruned_loss=0.1356, over 21621.00 frames. ], tot_loss[loss=0.3664, simple_loss=0.4087, pruned_loss=0.1621, over 4269976.58 frames. ], batch size: 263, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:10:44,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=77700.0, ans=0.0 2023-06-18 00:10:47,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=77700.0, ans=0.2 2023-06-18 00:11:04,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-18 00:11:23,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.57 vs. limit=6.0 2023-06-18 00:12:07,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=77880.0, ans=0.125 2023-06-18 00:12:17,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=77940.0, ans=0.2 2023-06-18 00:12:27,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=78000.0, ans=0.07 2023-06-18 00:12:28,705 INFO [train.py:996] (1/4) Epoch 1, batch 13000, loss[loss=0.515, simple_loss=0.5974, pruned_loss=0.2163, over 19676.00 frames. ], tot_loss[loss=0.3719, simple_loss=0.4129, pruned_loss=0.1654, over 4261465.45 frames. ], batch size: 703, lr: 3.18e-02, grad_scale: 16.0 2023-06-18 00:12:30,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=78000.0, ans=0.125 2023-06-18 00:12:44,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=78060.0, ans=0.05 2023-06-18 00:13:03,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=78060.0, ans=0.2 2023-06-18 00:13:20,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 4.234e+02 5.570e+02 6.916e+02 1.204e+03, threshold=1.114e+03, percent-clipped=4.0 2023-06-18 00:13:42,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=78180.0, ans=0.0 2023-06-18 00:14:08,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=78300.0, ans=0.05 2023-06-18 00:14:09,919 INFO [train.py:996] (1/4) Epoch 1, batch 13050, loss[loss=0.3747, simple_loss=0.4066, pruned_loss=0.1714, over 21913.00 frames. ], tot_loss[loss=0.365, simple_loss=0.4071, pruned_loss=0.1615, over 4263372.89 frames. ], batch size: 124, lr: 3.18e-02, grad_scale: 16.0 2023-06-18 00:14:13,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=78300.0, ans=0.1 2023-06-18 00:14:27,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=78360.0, ans=0.2 2023-06-18 00:15:31,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2023-06-18 00:15:55,015 INFO [train.py:996] (1/4) Epoch 1, batch 13100, loss[loss=0.4119, simple_loss=0.4434, pruned_loss=0.1902, over 21588.00 frames. ], tot_loss[loss=0.3675, simple_loss=0.41, pruned_loss=0.1625, over 4270666.89 frames. ], batch size: 414, lr: 3.17e-02, grad_scale: 16.0 2023-06-18 00:16:14,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=78600.0, ans=0.05 2023-06-18 00:16:34,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.67 vs. limit=15.0 2023-06-18 00:16:34,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=78660.0, ans=0.125 2023-06-18 00:16:38,185 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:16:41,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=78720.0, ans=0.125 2023-06-18 00:16:54,426 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.696e+02 5.724e+02 7.991e+02 1.405e+03, threshold=1.145e+03, percent-clipped=4.0 2023-06-18 00:16:57,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=22.5 2023-06-18 00:17:34,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=78840.0, ans=0.2 2023-06-18 00:17:34,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=78840.0, ans=0.0 2023-06-18 00:17:45,899 INFO [train.py:996] (1/4) Epoch 1, batch 13150, loss[loss=0.4516, simple_loss=0.4531, pruned_loss=0.2251, over 21646.00 frames. ], tot_loss[loss=0.3731, simple_loss=0.4127, pruned_loss=0.1667, over 4266304.56 frames. ], batch size: 441, lr: 3.17e-02, grad_scale: 16.0 2023-06-18 00:18:37,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=79020.0, ans=0.015 2023-06-18 00:19:03,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=79080.0, ans=0.0 2023-06-18 00:19:18,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=79140.0, ans=0.125 2023-06-18 00:19:30,324 INFO [train.py:996] (1/4) Epoch 1, batch 13200, loss[loss=0.3688, simple_loss=0.4151, pruned_loss=0.1613, over 21301.00 frames. ], tot_loss[loss=0.3706, simple_loss=0.4101, pruned_loss=0.1655, over 4269794.80 frames. ], batch size: 549, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 00:20:27,587 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.628e+02 3.872e+02 4.776e+02 6.394e+02 8.489e+02, threshold=9.552e+02, percent-clipped=0.0 2023-06-18 00:20:32,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=79380.0, ans=0.125 2023-06-18 00:20:43,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=79380.0, ans=0.2 2023-06-18 00:20:50,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.75 vs. limit=22.5 2023-06-18 00:21:02,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.92 vs. limit=10.0 2023-06-18 00:21:10,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=79440.0, ans=0.0 2023-06-18 00:21:18,096 INFO [train.py:996] (1/4) Epoch 1, batch 13250, loss[loss=0.381, simple_loss=0.3987, pruned_loss=0.1816, over 21809.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.4092, pruned_loss=0.1663, over 4262296.50 frames. ], batch size: 298, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 00:21:20,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=79500.0, ans=0.0 2023-06-18 00:21:45,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79560.0, ans=0.1 2023-06-18 00:22:01,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.26 vs. limit=15.0 2023-06-18 00:22:07,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=79620.0, ans=0.2 2023-06-18 00:22:14,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=79620.0, ans=0.5 2023-06-18 00:22:22,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=79680.0, ans=0.125 2023-06-18 00:23:04,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=79740.0, ans=0.125 2023-06-18 00:23:07,121 INFO [train.py:996] (1/4) Epoch 1, batch 13300, loss[loss=0.3457, simple_loss=0.3908, pruned_loss=0.1503, over 21444.00 frames. ], tot_loss[loss=0.374, simple_loss=0.4133, pruned_loss=0.1674, over 4264509.11 frames. ], batch size: 194, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 00:23:25,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=79800.0, ans=0.1 2023-06-18 00:23:42,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=79860.0, ans=0.0 2023-06-18 00:23:55,797 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.809e+02 3.934e+02 5.014e+02 6.811e+02 1.186e+03, threshold=1.003e+03, percent-clipped=5.0 2023-06-18 00:24:14,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79980.0, ans=0.1 2023-06-18 00:24:21,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=79980.0, ans=0.125 2023-06-18 00:24:38,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=80040.0, ans=0.0 2023-06-18 00:24:51,639 INFO [train.py:996] (1/4) Epoch 1, batch 13350, loss[loss=0.432, simple_loss=0.4605, pruned_loss=0.2018, over 21805.00 frames. ], tot_loss[loss=0.3792, simple_loss=0.417, pruned_loss=0.1707, over 4269636.40 frames. ], batch size: 441, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 00:25:07,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=80100.0, ans=0.2 2023-06-18 00:25:43,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=80220.0, ans=0.0 2023-06-18 00:25:51,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=80280.0, ans=0.125 2023-06-18 00:26:08,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=80280.0, ans=0.0 2023-06-18 00:26:40,419 INFO [train.py:996] (1/4) Epoch 1, batch 13400, loss[loss=0.4168, simple_loss=0.4403, pruned_loss=0.1966, over 21406.00 frames. ], tot_loss[loss=0.3811, simple_loss=0.419, pruned_loss=0.1716, over 4265917.49 frames. ], batch size: 176, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:26:58,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=80460.0, ans=0.125 2023-06-18 00:27:27,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.979e+02 4.393e+02 5.548e+02 7.060e+02 1.249e+03, threshold=1.110e+03, percent-clipped=4.0 2023-06-18 00:27:52,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-18 00:28:23,654 INFO [train.py:996] (1/4) Epoch 1, batch 13450, loss[loss=0.455, simple_loss=0.4602, pruned_loss=0.2249, over 21430.00 frames. ], tot_loss[loss=0.3876, simple_loss=0.4219, pruned_loss=0.1767, over 4268406.00 frames. ], batch size: 509, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:28:23,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=80700.0, ans=0.0 2023-06-18 00:28:34,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=80700.0, ans=0.125 2023-06-18 00:29:13,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.51 vs. limit=15.0 2023-06-18 00:29:34,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=80880.0, ans=0.125 2023-06-18 00:29:46,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80940.0, ans=0.1 2023-06-18 00:30:08,360 INFO [train.py:996] (1/4) Epoch 1, batch 13500, loss[loss=0.3328, simple_loss=0.377, pruned_loss=0.1443, over 21786.00 frames. ], tot_loss[loss=0.3738, simple_loss=0.408, pruned_loss=0.1698, over 4259981.61 frames. ], batch size: 333, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:30:12,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=81000.0, ans=0.1 2023-06-18 00:30:53,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=81120.0, ans=0.035 2023-06-18 00:31:07,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 4.002e+02 4.680e+02 6.090e+02 1.151e+03, threshold=9.360e+02, percent-clipped=1.0 2023-06-18 00:31:19,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=81180.0, ans=0.07 2023-06-18 00:31:43,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-18 00:31:44,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=81240.0, ans=0.125 2023-06-18 00:31:52,220 INFO [train.py:996] (1/4) Epoch 1, batch 13550, loss[loss=0.352, simple_loss=0.3901, pruned_loss=0.157, over 21372.00 frames. ], tot_loss[loss=0.3766, simple_loss=0.4136, pruned_loss=0.1698, over 4267276.91 frames. ], batch size: 144, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 00:32:04,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=81300.0, ans=0.0 2023-06-18 00:32:07,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=81300.0, ans=0.125 2023-06-18 00:32:55,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81480.0, ans=0.1 2023-06-18 00:33:01,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=81480.0, ans=0.125 2023-06-18 00:33:34,648 INFO [train.py:996] (1/4) Epoch 1, batch 13600, loss[loss=0.4085, simple_loss=0.446, pruned_loss=0.1855, over 21576.00 frames. ], tot_loss[loss=0.3787, simple_loss=0.4161, pruned_loss=0.1707, over 4271492.38 frames. ], batch size: 471, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 00:34:02,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=81660.0, ans=0.07 2023-06-18 00:34:27,094 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 4.484e+02 6.125e+02 7.575e+02 1.688e+03, threshold=1.225e+03, percent-clipped=13.0 2023-06-18 00:34:48,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.65 vs. limit=15.0 2023-06-18 00:34:49,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=81780.0, ans=0.1 2023-06-18 00:34:59,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=81840.0, ans=0.2 2023-06-18 00:35:11,050 INFO [train.py:996] (1/4) Epoch 1, batch 13650, loss[loss=0.33, simple_loss=0.3549, pruned_loss=0.1526, over 21512.00 frames. ], tot_loss[loss=0.3697, simple_loss=0.4089, pruned_loss=0.1653, over 4278736.22 frames. ], batch size: 230, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 00:35:32,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=81960.0, ans=0.125 2023-06-18 00:35:58,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82020.0, ans=0.1 2023-06-18 00:36:33,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=82080.0, ans=0.125 2023-06-18 00:36:59,440 INFO [train.py:996] (1/4) Epoch 1, batch 13700, loss[loss=0.2751, simple_loss=0.3046, pruned_loss=0.1228, over 21789.00 frames. ], tot_loss[loss=0.3633, simple_loss=0.4003, pruned_loss=0.1631, over 4269060.20 frames. ], batch size: 124, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 00:37:10,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=82200.0, ans=0.125 2023-06-18 00:37:10,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=82200.0, ans=0.125 2023-06-18 00:37:34,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-18 00:37:53,349 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.850e+02 5.196e+02 6.756e+02 1.127e+03, threshold=1.039e+03, percent-clipped=0.0 2023-06-18 00:38:07,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82380.0, ans=0.1 2023-06-18 00:38:39,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=82440.0, ans=0.125 2023-06-18 00:38:43,653 INFO [train.py:996] (1/4) Epoch 1, batch 13750, loss[loss=0.3129, simple_loss=0.3675, pruned_loss=0.1291, over 21699.00 frames. ], tot_loss[loss=0.3585, simple_loss=0.3963, pruned_loss=0.1603, over 4269559.03 frames. ], batch size: 298, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:39:15,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=82560.0, ans=0.0 2023-06-18 00:39:51,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=82680.0, ans=0.125 2023-06-18 00:40:40,335 INFO [train.py:996] (1/4) Epoch 1, batch 13800, loss[loss=0.4295, simple_loss=0.4346, pruned_loss=0.2123, over 20180.00 frames. ], tot_loss[loss=0.3622, simple_loss=0.4033, pruned_loss=0.1605, over 4264723.00 frames. ], batch size: 703, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:41:11,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2023-06-18 00:41:13,205 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.20 vs. limit=6.0 2023-06-18 00:41:28,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.821e+02 3.900e+02 5.256e+02 6.721e+02 1.169e+03, threshold=1.051e+03, percent-clipped=1.0 2023-06-18 00:42:08,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=83040.0, ans=0.125 2023-06-18 00:42:22,642 INFO [train.py:996] (1/4) Epoch 1, batch 13850, loss[loss=0.361, simple_loss=0.4116, pruned_loss=0.1552, over 21003.00 frames. ], tot_loss[loss=0.3661, simple_loss=0.409, pruned_loss=0.1616, over 4263895.76 frames. ], batch size: 607, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:42:24,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=83100.0, ans=0.125 2023-06-18 00:43:19,489 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-18 00:43:25,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=83280.0, ans=0.125 2023-06-18 00:44:06,028 INFO [train.py:996] (1/4) Epoch 1, batch 13900, loss[loss=0.3795, simple_loss=0.4048, pruned_loss=0.1771, over 21918.00 frames. ], tot_loss[loss=0.3759, simple_loss=0.4156, pruned_loss=0.1681, over 4269186.56 frames. ], batch size: 316, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 00:44:20,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=83400.0, ans=0.0 2023-06-18 00:44:29,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=83460.0, ans=0.125 2023-06-18 00:44:58,290 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.781e+02 4.127e+02 5.100e+02 6.768e+02 1.105e+03, threshold=1.020e+03, percent-clipped=2.0 2023-06-18 00:45:12,536 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-18 00:45:24,420 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=10.0 2023-06-18 00:45:42,913 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-18 00:45:43,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=83640.0, ans=0.0 2023-06-18 00:45:48,143 INFO [train.py:996] (1/4) Epoch 1, batch 13950, loss[loss=0.4539, simple_loss=0.4661, pruned_loss=0.2208, over 21860.00 frames. ], tot_loss[loss=0.3812, simple_loss=0.4177, pruned_loss=0.1723, over 4276228.07 frames. ], batch size: 414, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 00:45:49,055 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-18 00:46:34,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=83820.0, ans=0.05 2023-06-18 00:46:44,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=83820.0, ans=0.125 2023-06-18 00:46:55,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-18 00:47:05,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=83880.0, ans=0.1 2023-06-18 00:47:30,505 INFO [train.py:996] (1/4) Epoch 1, batch 14000, loss[loss=0.3021, simple_loss=0.3615, pruned_loss=0.1213, over 21467.00 frames. ], tot_loss[loss=0.3705, simple_loss=0.4093, pruned_loss=0.1658, over 4270865.20 frames. ], batch size: 194, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 00:47:34,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-18 00:47:41,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=84000.0, ans=0.125 2023-06-18 00:47:41,358 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.548e-03 2023-06-18 00:47:56,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=84060.0, ans=0.035 2023-06-18 00:48:28,672 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.708e+02 4.933e+02 6.099e+02 9.890e+02, threshold=9.866e+02, percent-clipped=0.0 2023-06-18 00:48:29,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=84120.0, ans=0.2 2023-06-18 00:48:43,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=84180.0, ans=0.125 2023-06-18 00:49:12,116 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:49:18,906 INFO [train.py:996] (1/4) Epoch 1, batch 14050, loss[loss=0.3665, simple_loss=0.3785, pruned_loss=0.1772, over 21283.00 frames. ], tot_loss[loss=0.3605, simple_loss=0.4036, pruned_loss=0.1588, over 4268911.04 frames. ], batch size: 471, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 00:49:19,876 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-06-18 00:49:25,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=84300.0, ans=0.125 2023-06-18 00:49:42,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=84360.0, ans=0.0 2023-06-18 00:50:00,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=84420.0, ans=0.125 2023-06-18 00:50:48,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84540.0, ans=0.1 2023-06-18 00:50:52,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=84540.0, ans=0.125 2023-06-18 00:51:01,821 INFO [train.py:996] (1/4) Epoch 1, batch 14100, loss[loss=0.429, simple_loss=0.4504, pruned_loss=0.2038, over 20679.00 frames. ], tot_loss[loss=0.3593, simple_loss=0.3987, pruned_loss=0.1599, over 4269589.16 frames. ], batch size: 607, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:51:05,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=84600.0, ans=0.0 2023-06-18 00:51:54,167 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 4.143e+02 4.965e+02 6.574e+02 1.166e+03, threshold=9.930e+02, percent-clipped=2.0 2023-06-18 00:52:31,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=84840.0, ans=0.1 2023-06-18 00:52:37,613 INFO [train.py:996] (1/4) Epoch 1, batch 14150, loss[loss=0.3993, simple_loss=0.431, pruned_loss=0.1838, over 21232.00 frames. ], tot_loss[loss=0.3613, simple_loss=0.4009, pruned_loss=0.1609, over 4271107.27 frames. ], batch size: 143, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:52:40,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.98 vs. limit=6.0 2023-06-18 00:52:53,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=84900.0, ans=0.125 2023-06-18 00:53:04,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=84960.0, ans=0.125 2023-06-18 00:53:26,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=85020.0, ans=0.125 2023-06-18 00:53:28,627 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.04 vs. limit=10.0 2023-06-18 00:53:59,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85080.0, ans=0.1 2023-06-18 00:54:18,328 INFO [train.py:996] (1/4) Epoch 1, batch 14200, loss[loss=0.3884, simple_loss=0.4065, pruned_loss=0.1851, over 21527.00 frames. ], tot_loss[loss=0.357, simple_loss=0.3973, pruned_loss=0.1583, over 4271286.36 frames. ], batch size: 389, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:54:21,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=85200.0, ans=0.1 2023-06-18 00:55:09,939 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 4.023e+02 4.862e+02 6.439e+02 1.166e+03, threshold=9.724e+02, percent-clipped=3.0 2023-06-18 00:55:12,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=85320.0, ans=0.0 2023-06-18 00:55:32,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.49 vs. limit=6.0 2023-06-18 00:55:59,280 INFO [train.py:996] (1/4) Epoch 1, batch 14250, loss[loss=0.3299, simple_loss=0.3611, pruned_loss=0.1494, over 21717.00 frames. ], tot_loss[loss=0.353, simple_loss=0.3914, pruned_loss=0.1573, over 4256707.05 frames. ], batch size: 112, lr: 3.07e-02, grad_scale: 32.0 2023-06-18 00:56:52,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=85620.0, ans=0.125 2023-06-18 00:57:02,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=85620.0, ans=0.0 2023-06-18 00:57:07,641 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:57:36,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=85740.0, ans=0.2 2023-06-18 00:57:43,759 INFO [train.py:996] (1/4) Epoch 1, batch 14300, loss[loss=0.3293, simple_loss=0.3898, pruned_loss=0.1344, over 21694.00 frames. ], tot_loss[loss=0.3515, simple_loss=0.3916, pruned_loss=0.1557, over 4237271.93 frames. ], batch size: 298, lr: 3.07e-02, grad_scale: 32.0 2023-06-18 00:58:11,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-18 00:58:19,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=85860.0, ans=0.035 2023-06-18 00:58:38,738 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.854e+02 5.533e+02 8.207e+02 1.409e+03, threshold=1.107e+03, percent-clipped=13.0 2023-06-18 00:59:26,773 INFO [train.py:996] (1/4) Epoch 1, batch 14350, loss[loss=0.3809, simple_loss=0.4091, pruned_loss=0.1763, over 21901.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.4014, pruned_loss=0.1597, over 4244008.89 frames. ], batch size: 118, lr: 3.06e-02, grad_scale: 16.0 2023-06-18 00:59:28,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=86100.0, ans=0.125 2023-06-18 00:59:30,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86100.0, ans=0.1 2023-06-18 00:59:50,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-18 01:00:03,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=86220.0, ans=0.125 2023-06-18 01:00:56,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=86340.0, ans=0.125 2023-06-18 01:01:00,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=86340.0, ans=0.09899494936611666 2023-06-18 01:01:08,517 INFO [train.py:996] (1/4) Epoch 1, batch 14400, loss[loss=0.3268, simple_loss=0.3615, pruned_loss=0.146, over 21398.00 frames. ], tot_loss[loss=0.3603, simple_loss=0.3998, pruned_loss=0.1604, over 4257045.82 frames. ], batch size: 131, lr: 3.06e-02, grad_scale: 32.0 2023-06-18 01:01:15,483 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:01:24,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=86400.0, ans=10.0 2023-06-18 01:01:27,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=86400.0, ans=0.1 2023-06-18 01:02:08,112 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.871e+02 4.703e+02 5.738e+02 1.217e+03, threshold=9.407e+02, percent-clipped=2.0 2023-06-18 01:02:08,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=86520.0, ans=0.125 2023-06-18 01:02:22,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=86580.0, ans=0.125 2023-06-18 01:02:22,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=86580.0, ans=0.0 2023-06-18 01:02:24,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=86580.0, ans=0.125 2023-06-18 01:02:42,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=86640.0, ans=0.125 2023-06-18 01:02:50,398 INFO [train.py:996] (1/4) Epoch 1, batch 14450, loss[loss=0.3578, simple_loss=0.3889, pruned_loss=0.1634, over 21761.00 frames. ], tot_loss[loss=0.3578, simple_loss=0.3941, pruned_loss=0.1607, over 4263376.00 frames. ], batch size: 112, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 01:03:25,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.11 vs. limit=22.5 2023-06-18 01:03:56,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=86880.0, ans=0.025 2023-06-18 01:04:33,707 INFO [train.py:996] (1/4) Epoch 1, batch 14500, loss[loss=0.3733, simple_loss=0.4153, pruned_loss=0.1657, over 21640.00 frames. ], tot_loss[loss=0.3542, simple_loss=0.3906, pruned_loss=0.1589, over 4261709.23 frames. ], batch size: 414, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 01:05:09,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=87060.0, ans=0.2 2023-06-18 01:05:31,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=87120.0, ans=0.125 2023-06-18 01:05:36,048 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.944e+02 5.039e+02 7.491e+02 1.788e+03, threshold=1.008e+03, percent-clipped=13.0 2023-06-18 01:06:06,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=87240.0, ans=0.125 2023-06-18 01:06:13,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=87240.0, ans=0.125 2023-06-18 01:06:16,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=87300.0, ans=0.0 2023-06-18 01:06:17,592 INFO [train.py:996] (1/4) Epoch 1, batch 14550, loss[loss=0.4019, simple_loss=0.4356, pruned_loss=0.1841, over 21866.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.3991, pruned_loss=0.163, over 4266973.55 frames. ], batch size: 371, lr: 3.05e-02, grad_scale: 16.0 2023-06-18 01:07:20,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=87420.0, ans=0.125 2023-06-18 01:08:01,694 INFO [train.py:996] (1/4) Epoch 1, batch 14600, loss[loss=0.3794, simple_loss=0.4347, pruned_loss=0.1621, over 21764.00 frames. ], tot_loss[loss=0.372, simple_loss=0.4073, pruned_loss=0.1683, over 4268571.06 frames. ], batch size: 332, lr: 3.04e-02, grad_scale: 16.0 2023-06-18 01:09:02,461 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 4.017e+02 5.030e+02 6.430e+02 1.157e+03, threshold=1.006e+03, percent-clipped=2.0 2023-06-18 01:09:05,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-18 01:09:24,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=87780.0, ans=0.0 2023-06-18 01:09:31,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=87840.0, ans=0.0 2023-06-18 01:09:37,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=87840.0, ans=0.125 2023-06-18 01:09:43,514 INFO [train.py:996] (1/4) Epoch 1, batch 14650, loss[loss=0.2976, simple_loss=0.3199, pruned_loss=0.1377, over 20752.00 frames. ], tot_loss[loss=0.3665, simple_loss=0.4045, pruned_loss=0.1643, over 4252224.84 frames. ], batch size: 608, lr: 3.04e-02, grad_scale: 16.0 2023-06-18 01:10:24,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88020.0, ans=0.1 2023-06-18 01:10:50,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=88080.0, ans=0.0 2023-06-18 01:11:09,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88140.0, ans=0.1 2023-06-18 01:11:30,874 INFO [train.py:996] (1/4) Epoch 1, batch 14700, loss[loss=0.262, simple_loss=0.3382, pruned_loss=0.09296, over 21686.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.3941, pruned_loss=0.1542, over 4244443.83 frames. ], batch size: 298, lr: 3.03e-02, grad_scale: 16.0 2023-06-18 01:12:21,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=88320.0, ans=22.5 2023-06-18 01:12:22,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.06 vs. limit=22.5 2023-06-18 01:12:32,567 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 3.580e+02 4.552e+02 5.267e+02 1.016e+03, threshold=9.103e+02, percent-clipped=1.0 2023-06-18 01:12:54,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=88440.0, ans=0.125 2023-06-18 01:13:14,648 INFO [train.py:996] (1/4) Epoch 1, batch 14750, loss[loss=0.371, simple_loss=0.4093, pruned_loss=0.1664, over 21615.00 frames. ], tot_loss[loss=0.3597, simple_loss=0.4023, pruned_loss=0.1585, over 4255191.52 frames. ], batch size: 263, lr: 3.03e-02, grad_scale: 16.0 2023-06-18 01:13:16,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=88500.0, ans=0.04949747468305833 2023-06-18 01:13:54,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=88560.0, ans=0.125 2023-06-18 01:14:50,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=88740.0, ans=0.125 2023-06-18 01:14:59,893 INFO [train.py:996] (1/4) Epoch 1, batch 14800, loss[loss=0.3583, simple_loss=0.3832, pruned_loss=0.1667, over 21492.00 frames. ], tot_loss[loss=0.3754, simple_loss=0.416, pruned_loss=0.1674, over 4257162.97 frames. ], batch size: 230, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 01:15:05,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=88800.0, ans=0.125 2023-06-18 01:15:40,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=88860.0, ans=0.1 2023-06-18 01:15:53,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=88920.0, ans=0.2 2023-06-18 01:15:56,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=88920.0, ans=0.1 2023-06-18 01:16:02,064 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.486e+02 4.489e+02 5.229e+02 7.110e+02 1.407e+03, threshold=1.046e+03, percent-clipped=11.0 2023-06-18 01:16:45,479 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:16:55,415 INFO [train.py:996] (1/4) Epoch 1, batch 14850, loss[loss=0.3569, simple_loss=0.3955, pruned_loss=0.1592, over 21868.00 frames. ], tot_loss[loss=0.372, simple_loss=0.4099, pruned_loss=0.1671, over 4258810.78 frames. ], batch size: 317, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 01:16:55,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=89100.0, ans=0.125 2023-06-18 01:18:00,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=89280.0, ans=0.125 2023-06-18 01:18:41,314 INFO [train.py:996] (1/4) Epoch 1, batch 14900, loss[loss=0.3751, simple_loss=0.4222, pruned_loss=0.164, over 21396.00 frames. ], tot_loss[loss=0.3747, simple_loss=0.412, pruned_loss=0.1687, over 4258476.59 frames. ], batch size: 131, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 01:19:19,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=89520.0, ans=0.1 2023-06-18 01:19:34,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 4.060e+02 5.286e+02 6.323e+02 1.154e+03, threshold=1.057e+03, percent-clipped=2.0 2023-06-18 01:19:46,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=89580.0, ans=0.125 2023-06-18 01:19:55,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=89580.0, ans=0.125 2023-06-18 01:19:57,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=89640.0, ans=0.0 2023-06-18 01:20:17,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-18 01:20:21,531 INFO [train.py:996] (1/4) Epoch 1, batch 14950, loss[loss=0.3084, simple_loss=0.3331, pruned_loss=0.1419, over 20752.00 frames. ], tot_loss[loss=0.3761, simple_loss=0.4135, pruned_loss=0.1694, over 4255367.43 frames. ], batch size: 607, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:21:19,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=89880.0, ans=0.125 2023-06-18 01:21:51,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=89940.0, ans=0.125 2023-06-18 01:22:04,792 INFO [train.py:996] (1/4) Epoch 1, batch 15000, loss[loss=0.3685, simple_loss=0.4032, pruned_loss=0.1669, over 21655.00 frames. ], tot_loss[loss=0.3798, simple_loss=0.4167, pruned_loss=0.1715, over 4262388.58 frames. ], batch size: 263, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:22:04,792 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 01:22:23,153 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.3215, simple_loss=0.4085, pruned_loss=0.1173, over 1796401.00 frames. 2023-06-18 01:22:23,153 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24375MB 2023-06-18 01:22:45,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=90060.0, ans=0.0 2023-06-18 01:23:25,154 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.727e+02 3.992e+02 4.836e+02 5.829e+02 8.010e+02, threshold=9.672e+02, percent-clipped=0.0 2023-06-18 01:23:25,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=90120.0, ans=0.05 2023-06-18 01:23:33,647 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-18 01:23:39,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=90180.0, ans=0.2 2023-06-18 01:23:42,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=90180.0, ans=0.125 2023-06-18 01:24:12,192 INFO [train.py:996] (1/4) Epoch 1, batch 15050, loss[loss=0.3153, simple_loss=0.3462, pruned_loss=0.1422, over 21352.00 frames. ], tot_loss[loss=0.3759, simple_loss=0.4121, pruned_loss=0.1699, over 4253726.76 frames. ], batch size: 131, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:24:17,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-18 01:24:26,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-18 01:24:27,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=90360.0, ans=0.0 2023-06-18 01:25:25,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=90480.0, ans=0.125 2023-06-18 01:25:55,088 INFO [train.py:996] (1/4) Epoch 1, batch 15100, loss[loss=0.3966, simple_loss=0.428, pruned_loss=0.1826, over 21322.00 frames. ], tot_loss[loss=0.3792, simple_loss=0.4167, pruned_loss=0.1709, over 4260726.47 frames. ], batch size: 548, lr: 3.00e-02, grad_scale: 32.0 2023-06-18 01:26:18,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=90660.0, ans=0.0 2023-06-18 01:26:50,646 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 4.036e+02 5.408e+02 6.449e+02 1.241e+03, threshold=1.082e+03, percent-clipped=5.0 2023-06-18 01:26:56,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-18 01:27:32,769 INFO [train.py:996] (1/4) Epoch 1, batch 15150, loss[loss=0.3573, simple_loss=0.388, pruned_loss=0.1633, over 21827.00 frames. ], tot_loss[loss=0.3782, simple_loss=0.413, pruned_loss=0.1717, over 4268361.88 frames. ], batch size: 107, lr: 3.00e-02, grad_scale: 32.0 2023-06-18 01:27:38,010 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:27:41,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=90900.0, ans=0.125 2023-06-18 01:28:11,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=90960.0, ans=0.0 2023-06-18 01:28:36,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=91080.0, ans=0.2 2023-06-18 01:28:40,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=91080.0, ans=0.125 2023-06-18 01:28:56,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=91080.0, ans=0.125 2023-06-18 01:29:15,452 INFO [train.py:996] (1/4) Epoch 1, batch 15200, loss[loss=0.2828, simple_loss=0.3345, pruned_loss=0.1155, over 21148.00 frames. ], tot_loss[loss=0.366, simple_loss=0.4025, pruned_loss=0.1648, over 4265356.92 frames. ], batch size: 159, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:30:11,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=91320.0, ans=0.125 2023-06-18 01:30:15,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.952e+02 4.981e+02 6.119e+02 1.167e+03, threshold=9.963e+02, percent-clipped=1.0 2023-06-18 01:30:20,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=91380.0, ans=0.0 2023-06-18 01:30:37,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.90 vs. limit=22.5 2023-06-18 01:30:56,063 INFO [train.py:996] (1/4) Epoch 1, batch 15250, loss[loss=0.4127, simple_loss=0.4369, pruned_loss=0.1942, over 21411.00 frames. ], tot_loss[loss=0.3582, simple_loss=0.3937, pruned_loss=0.1614, over 4267915.43 frames. ], batch size: 131, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:31:45,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=91620.0, ans=0.0 2023-06-18 01:32:27,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=91740.0, ans=0.2 2023-06-18 01:32:38,854 INFO [train.py:996] (1/4) Epoch 1, batch 15300, loss[loss=0.3939, simple_loss=0.4182, pruned_loss=0.1848, over 21605.00 frames. ], tot_loss[loss=0.3673, simple_loss=0.3994, pruned_loss=0.1676, over 4271018.34 frames. ], batch size: 263, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:32:39,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=91800.0, ans=0.125 2023-06-18 01:33:46,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.623e+02 4.254e+02 5.015e+02 5.905e+02 1.167e+03, threshold=1.003e+03, percent-clipped=1.0 2023-06-18 01:34:10,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=92040.0, ans=0.125 2023-06-18 01:34:28,055 INFO [train.py:996] (1/4) Epoch 1, batch 15350, loss[loss=0.4114, simple_loss=0.4625, pruned_loss=0.1802, over 21666.00 frames. ], tot_loss[loss=0.3739, simple_loss=0.4061, pruned_loss=0.1709, over 4271904.73 frames. ], batch size: 414, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 01:35:29,255 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-18 01:35:31,020 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-18 01:35:36,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=92280.0, ans=0.0 2023-06-18 01:36:01,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=92340.0, ans=0.2 2023-06-18 01:36:04,743 INFO [train.py:996] (1/4) Epoch 1, batch 15400, loss[loss=0.3194, simple_loss=0.3688, pruned_loss=0.135, over 21815.00 frames. ], tot_loss[loss=0.3725, simple_loss=0.4084, pruned_loss=0.1683, over 4266201.17 frames. ], batch size: 282, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 01:36:44,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.64 vs. limit=15.0 2023-06-18 01:37:10,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.998e+02 4.934e+02 5.907e+02 9.449e+02, threshold=9.868e+02, percent-clipped=0.0 2023-06-18 01:37:27,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-18 01:37:29,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=92640.0, ans=0.125 2023-06-18 01:37:46,495 INFO [train.py:996] (1/4) Epoch 1, batch 15450, loss[loss=0.3766, simple_loss=0.4091, pruned_loss=0.172, over 21751.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.4065, pruned_loss=0.1677, over 4273236.40 frames. ], batch size: 389, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 01:38:27,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.99 vs. limit=15.0 2023-06-18 01:38:53,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-18 01:39:07,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=92880.0, ans=0.5 2023-06-18 01:39:16,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.16 vs. limit=22.5 2023-06-18 01:39:35,840 INFO [train.py:996] (1/4) Epoch 1, batch 15500, loss[loss=0.4464, simple_loss=0.4619, pruned_loss=0.2154, over 21717.00 frames. ], tot_loss[loss=0.371, simple_loss=0.4089, pruned_loss=0.1666, over 4274744.59 frames. ], batch size: 351, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 01:40:39,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-18 01:40:39,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.584e+02 4.777e+02 6.158e+02 1.272e+03, threshold=9.553e+02, percent-clipped=7.0 2023-06-18 01:41:10,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93240.0, ans=0.1 2023-06-18 01:41:30,349 INFO [train.py:996] (1/4) Epoch 1, batch 15550, loss[loss=0.2987, simple_loss=0.3642, pruned_loss=0.1166, over 21634.00 frames. ], tot_loss[loss=0.3664, simple_loss=0.4078, pruned_loss=0.1625, over 4265743.73 frames. ], batch size: 263, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 01:41:32,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=93300.0, ans=0.125 2023-06-18 01:41:34,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=93300.0, ans=0.125 2023-06-18 01:41:47,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=93360.0, ans=10.0 2023-06-18 01:41:59,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=93360.0, ans=0.2 2023-06-18 01:42:11,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=93420.0, ans=0.0 2023-06-18 01:42:17,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93420.0, ans=0.1 2023-06-18 01:42:33,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-18 01:42:39,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=93480.0, ans=0.0 2023-06-18 01:43:13,881 INFO [train.py:996] (1/4) Epoch 1, batch 15600, loss[loss=0.5293, simple_loss=0.58, pruned_loss=0.2393, over 19787.00 frames. ], tot_loss[loss=0.363, simple_loss=0.4021, pruned_loss=0.1619, over 4260715.17 frames. ], batch size: 702, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 01:43:17,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=93600.0, ans=0.1 2023-06-18 01:43:23,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=93600.0, ans=0.1 2023-06-18 01:44:03,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=93720.0, ans=0.125 2023-06-18 01:44:06,391 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.574e+02 3.769e+02 4.800e+02 6.009e+02 1.224e+03, threshold=9.599e+02, percent-clipped=5.0 2023-06-18 01:44:56,274 INFO [train.py:996] (1/4) Epoch 1, batch 15650, loss[loss=0.3702, simple_loss=0.3859, pruned_loss=0.1773, over 21489.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.4013, pruned_loss=0.1619, over 4261773.16 frames. ], batch size: 441, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 01:45:21,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=93960.0, ans=0.5 2023-06-18 01:45:25,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=93960.0, ans=0.125 2023-06-18 01:46:06,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=94080.0, ans=0.125 2023-06-18 01:46:28,868 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:46:38,851 INFO [train.py:996] (1/4) Epoch 1, batch 15700, loss[loss=0.3657, simple_loss=0.4031, pruned_loss=0.1642, over 20736.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.395, pruned_loss=0.1601, over 4266593.01 frames. ], batch size: 608, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:46:42,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=94200.0, ans=0.125 2023-06-18 01:46:47,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=94200.0, ans=0.0 2023-06-18 01:47:31,849 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.759e+02 5.241e+02 6.627e+02 1.144e+03, threshold=1.048e+03, percent-clipped=4.0 2023-06-18 01:47:32,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=94380.0, ans=0.1 2023-06-18 01:48:21,242 INFO [train.py:996] (1/4) Epoch 1, batch 15750, loss[loss=0.3314, simple_loss=0.3544, pruned_loss=0.1543, over 21517.00 frames. ], tot_loss[loss=0.3527, simple_loss=0.3892, pruned_loss=0.1582, over 4269410.52 frames. ], batch size: 230, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:48:23,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=94500.0, ans=0.035 2023-06-18 01:49:05,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94620.0, ans=0.1 2023-06-18 01:50:03,620 INFO [train.py:996] (1/4) Epoch 1, batch 15800, loss[loss=0.3774, simple_loss=0.3988, pruned_loss=0.178, over 21181.00 frames. ], tot_loss[loss=0.3475, simple_loss=0.3825, pruned_loss=0.1562, over 4264858.98 frames. ], batch size: 159, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:50:21,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.96 vs. limit=6.0 2023-06-18 01:50:52,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=94920.0, ans=0.0 2023-06-18 01:51:07,128 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.489e+02 4.310e+02 5.461e+02 1.002e+03, threshold=8.621e+02, percent-clipped=0.0 2023-06-18 01:51:14,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=94980.0, ans=0.125 2023-06-18 01:51:47,045 INFO [train.py:996] (1/4) Epoch 1, batch 15850, loss[loss=0.3617, simple_loss=0.3925, pruned_loss=0.1654, over 21363.00 frames. ], tot_loss[loss=0.3527, simple_loss=0.3855, pruned_loss=0.16, over 4271187.93 frames. ], batch size: 549, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:53:07,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-18 01:53:10,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=95340.0, ans=10.0 2023-06-18 01:53:25,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=95340.0, ans=0.07 2023-06-18 01:53:31,113 INFO [train.py:996] (1/4) Epoch 1, batch 15900, loss[loss=0.3997, simple_loss=0.4306, pruned_loss=0.1843, over 21858.00 frames. ], tot_loss[loss=0.3534, simple_loss=0.3848, pruned_loss=0.161, over 4270894.86 frames. ], batch size: 372, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:54:00,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=95460.0, ans=15.0 2023-06-18 01:54:04,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=95460.0, ans=0.125 2023-06-18 01:54:28,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.552e+02 4.278e+02 5.211e+02 7.153e+02 1.346e+03, threshold=1.042e+03, percent-clipped=13.0 2023-06-18 01:55:07,512 INFO [train.py:996] (1/4) Epoch 1, batch 15950, loss[loss=0.3454, simple_loss=0.3943, pruned_loss=0.1483, over 21763.00 frames. ], tot_loss[loss=0.3518, simple_loss=0.3865, pruned_loss=0.1585, over 4254005.50 frames. ], batch size: 441, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:55:55,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=95820.0, ans=0.04949747468305833 2023-06-18 01:56:48,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=95940.0, ans=0.125 2023-06-18 01:56:54,419 INFO [train.py:996] (1/4) Epoch 1, batch 16000, loss[loss=0.276, simple_loss=0.3349, pruned_loss=0.1085, over 16362.00 frames. ], tot_loss[loss=0.3466, simple_loss=0.386, pruned_loss=0.1536, over 4249933.22 frames. ], batch size: 63, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 01:57:06,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=96000.0, ans=0.125 2023-06-18 01:57:10,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-18 01:57:49,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=96120.0, ans=0.125 2023-06-18 01:57:52,170 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 3.630e+02 4.249e+02 5.497e+02 1.232e+03, threshold=8.498e+02, percent-clipped=2.0 2023-06-18 01:58:13,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=96240.0, ans=0.125 2023-06-18 01:58:35,934 INFO [train.py:996] (1/4) Epoch 1, batch 16050, loss[loss=0.3337, simple_loss=0.3978, pruned_loss=0.1348, over 21404.00 frames. ], tot_loss[loss=0.3443, simple_loss=0.3881, pruned_loss=0.1502, over 4263429.49 frames. ], batch size: 194, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 01:59:18,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=96420.0, ans=0.0 2023-06-18 01:59:50,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=96480.0, ans=0.125 2023-06-18 01:59:51,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=96480.0, ans=0.0 2023-06-18 01:59:59,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=96540.0, ans=0.0 2023-06-18 02:00:17,304 INFO [train.py:996] (1/4) Epoch 1, batch 16100, loss[loss=0.4074, simple_loss=0.4249, pruned_loss=0.195, over 21784.00 frames. ], tot_loss[loss=0.3557, simple_loss=0.3979, pruned_loss=0.1568, over 4271799.69 frames. ], batch size: 441, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:01:13,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=96720.0, ans=0.0 2023-06-18 02:01:14,752 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.670e+02 4.719e+02 5.843e+02 1.104e+03, threshold=9.438e+02, percent-clipped=5.0 2023-06-18 02:01:59,173 INFO [train.py:996] (1/4) Epoch 1, batch 16150, loss[loss=0.3808, simple_loss=0.4309, pruned_loss=0.1653, over 17782.00 frames. ], tot_loss[loss=0.3599, simple_loss=0.4001, pruned_loss=0.1599, over 4273303.01 frames. ], batch size: 60, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:01:59,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=96900.0, ans=0.125 2023-06-18 02:02:58,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=97020.0, ans=0.125 2023-06-18 02:03:21,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=97080.0, ans=0.125 2023-06-18 02:03:45,784 INFO [train.py:996] (1/4) Epoch 1, batch 16200, loss[loss=0.3424, simple_loss=0.3732, pruned_loss=0.1558, over 21663.00 frames. ], tot_loss[loss=0.3642, simple_loss=0.4041, pruned_loss=0.1621, over 4282497.05 frames. ], batch size: 263, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:04:13,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=97260.0, ans=0.125 2023-06-18 02:04:40,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=97320.0, ans=0.125 2023-06-18 02:04:49,154 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.675e+02 4.045e+02 5.128e+02 6.271e+02 1.195e+03, threshold=1.026e+03, percent-clipped=3.0 2023-06-18 02:04:52,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=97380.0, ans=0.0 2023-06-18 02:04:59,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=97380.0, ans=0.2 2023-06-18 02:05:34,460 INFO [train.py:996] (1/4) Epoch 1, batch 16250, loss[loss=0.2529, simple_loss=0.3131, pruned_loss=0.09634, over 21600.00 frames. ], tot_loss[loss=0.3607, simple_loss=0.4011, pruned_loss=0.1601, over 4273667.63 frames. ], batch size: 230, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:06:56,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=97740.0, ans=0.125 2023-06-18 02:07:23,956 INFO [train.py:996] (1/4) Epoch 1, batch 16300, loss[loss=0.2994, simple_loss=0.3607, pruned_loss=0.1191, over 21780.00 frames. ], tot_loss[loss=0.3501, simple_loss=0.3927, pruned_loss=0.1538, over 4273100.69 frames. ], batch size: 316, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:07:37,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=97800.0, ans=0.125 2023-06-18 02:07:39,457 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2023-06-18 02:08:17,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 3.435e+02 4.309e+02 5.263e+02 1.274e+03, threshold=8.618e+02, percent-clipped=4.0 2023-06-18 02:09:08,878 INFO [train.py:996] (1/4) Epoch 1, batch 16350, loss[loss=0.3912, simple_loss=0.4063, pruned_loss=0.188, over 20039.00 frames. ], tot_loss[loss=0.3544, simple_loss=0.395, pruned_loss=0.1569, over 4272604.43 frames. ], batch size: 702, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:10:51,183 INFO [train.py:996] (1/4) Epoch 1, batch 16400, loss[loss=0.3297, simple_loss=0.3684, pruned_loss=0.1455, over 21475.00 frames. ], tot_loss[loss=0.3574, simple_loss=0.3975, pruned_loss=0.1586, over 4278167.70 frames. ], batch size: 194, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 02:10:58,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=98400.0, ans=0.1 2023-06-18 02:11:12,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-18 02:11:13,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=98460.0, ans=0.0 2023-06-18 02:11:45,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=98520.0, ans=0.0 2023-06-18 02:11:46,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=98520.0, ans=0.0 2023-06-18 02:11:47,761 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 3.743e+02 5.331e+02 6.637e+02 1.239e+03, threshold=1.066e+03, percent-clipped=10.0 2023-06-18 02:12:08,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=98640.0, ans=0.125 2023-06-18 02:12:33,613 INFO [train.py:996] (1/4) Epoch 1, batch 16450, loss[loss=0.3824, simple_loss=0.4138, pruned_loss=0.1755, over 21895.00 frames. ], tot_loss[loss=0.3567, simple_loss=0.3961, pruned_loss=0.1586, over 4286516.24 frames. ], batch size: 414, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 02:12:50,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=98760.0, ans=0.125 2023-06-18 02:12:54,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=98760.0, ans=0.0 2023-06-18 02:12:59,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=98760.0, ans=0.04949747468305833 2023-06-18 02:13:31,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-18 02:13:40,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98880.0, ans=0.1 2023-06-18 02:14:17,006 INFO [train.py:996] (1/4) Epoch 1, batch 16500, loss[loss=0.244, simple_loss=0.2769, pruned_loss=0.1056, over 21107.00 frames. ], tot_loss[loss=0.3552, simple_loss=0.3944, pruned_loss=0.158, over 4281869.98 frames. ], batch size: 143, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:14:46,270 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:14:47,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=99060.0, ans=0.1 2023-06-18 02:14:52,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=99060.0, ans=0.0 2023-06-18 02:15:16,392 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.870e+02 4.027e+02 4.822e+02 5.863e+02 1.078e+03, threshold=9.645e+02, percent-clipped=1.0 2023-06-18 02:15:16,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=99180.0, ans=0.0 2023-06-18 02:15:52,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=99240.0, ans=0.0 2023-06-18 02:15:55,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-18 02:16:01,035 INFO [train.py:996] (1/4) Epoch 1, batch 16550, loss[loss=0.3636, simple_loss=0.4243, pruned_loss=0.1514, over 21221.00 frames. ], tot_loss[loss=0.3468, simple_loss=0.3889, pruned_loss=0.1523, over 4268959.55 frames. ], batch size: 548, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:16:38,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=99360.0, ans=0.2 2023-06-18 02:16:43,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=99420.0, ans=0.125 2023-06-18 02:16:50,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=99420.0, ans=0.2 2023-06-18 02:17:42,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=99540.0, ans=0.0 2023-06-18 02:17:45,744 INFO [train.py:996] (1/4) Epoch 1, batch 16600, loss[loss=0.5604, simple_loss=0.5778, pruned_loss=0.2715, over 21450.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.4024, pruned_loss=0.1592, over 4276798.61 frames. ], batch size: 507, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:18:52,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=99720.0, ans=0.0 2023-06-18 02:18:55,018 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.830e+02 4.483e+02 5.513e+02 7.810e+02 1.353e+03, threshold=1.103e+03, percent-clipped=10.0 2023-06-18 02:19:04,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=99780.0, ans=0.125 2023-06-18 02:19:29,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=99840.0, ans=0.2 2023-06-18 02:19:40,963 INFO [train.py:996] (1/4) Epoch 1, batch 16650, loss[loss=0.3574, simple_loss=0.4134, pruned_loss=0.1507, over 20732.00 frames. ], tot_loss[loss=0.3684, simple_loss=0.413, pruned_loss=0.1619, over 4280645.00 frames. ], batch size: 607, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:19:45,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=99900.0, ans=0.2 2023-06-18 02:20:17,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=99960.0, ans=0.0 2023-06-18 02:21:28,207 INFO [train.py:996] (1/4) Epoch 1, batch 16700, loss[loss=0.3815, simple_loss=0.4131, pruned_loss=0.1749, over 20008.00 frames. ], tot_loss[loss=0.3684, simple_loss=0.4129, pruned_loss=0.1619, over 4274731.97 frames. ], batch size: 702, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:21:40,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=100200.0, ans=0.125 2023-06-18 02:22:34,500 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 4.065e+02 5.046e+02 6.706e+02 1.129e+03, threshold=1.009e+03, percent-clipped=1.0 2023-06-18 02:23:24,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=100440.0, ans=0.2 2023-06-18 02:23:27,851 INFO [train.py:996] (1/4) Epoch 1, batch 16750, loss[loss=0.3712, simple_loss=0.409, pruned_loss=0.1667, over 21382.00 frames. ], tot_loss[loss=0.3743, simple_loss=0.4172, pruned_loss=0.1657, over 4270023.56 frames. ], batch size: 176, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:23:28,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=100500.0, ans=0.025 2023-06-18 02:23:35,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=100500.0, ans=0.09899494936611666 2023-06-18 02:23:40,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=100500.0, ans=0.0 2023-06-18 02:23:42,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=100500.0, ans=0.125 2023-06-18 02:24:04,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=100560.0, ans=0.0 2023-06-18 02:24:19,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=100620.0, ans=0.1 2023-06-18 02:24:19,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=100620.0, ans=0.0 2023-06-18 02:24:36,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100680.0, ans=0.1 2023-06-18 02:25:09,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.57 vs. limit=6.0 2023-06-18 02:25:10,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=100800.0, ans=0.0 2023-06-18 02:25:11,907 INFO [train.py:996] (1/4) Epoch 1, batch 16800, loss[loss=0.3369, simple_loss=0.3765, pruned_loss=0.1486, over 21803.00 frames. ], tot_loss[loss=0.3759, simple_loss=0.4197, pruned_loss=0.166, over 4261061.52 frames. ], batch size: 112, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:25:44,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=100860.0, ans=0.125 2023-06-18 02:26:08,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 4.332e+02 5.464e+02 7.061e+02 1.204e+03, threshold=1.093e+03, percent-clipped=8.0 2023-06-18 02:26:40,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=101040.0, ans=0.125 2023-06-18 02:26:54,572 INFO [train.py:996] (1/4) Epoch 1, batch 16850, loss[loss=0.4261, simple_loss=0.4256, pruned_loss=0.2133, over 21811.00 frames. ], tot_loss[loss=0.376, simple_loss=0.4177, pruned_loss=0.1671, over 4269169.62 frames. ], batch size: 508, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:27:38,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-18 02:27:38,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-18 02:27:38,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=101220.0, ans=15.0 2023-06-18 02:27:52,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.66 vs. limit=15.0 2023-06-18 02:28:13,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=101280.0, ans=0.09899494936611666 2023-06-18 02:28:37,451 INFO [train.py:996] (1/4) Epoch 1, batch 16900, loss[loss=0.3486, simple_loss=0.374, pruned_loss=0.1616, over 19862.00 frames. ], tot_loss[loss=0.3679, simple_loss=0.4093, pruned_loss=0.1632, over 4265004.00 frames. ], batch size: 704, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:29:08,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=101460.0, ans=0.07 2023-06-18 02:29:29,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=101520.0, ans=0.0 2023-06-18 02:29:39,424 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.564e+02 4.278e+02 5.386e+02 1.254e+03, threshold=8.556e+02, percent-clipped=1.0 2023-06-18 02:30:14,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=101640.0, ans=0.04949747468305833 2023-06-18 02:30:19,202 INFO [train.py:996] (1/4) Epoch 1, batch 16950, loss[loss=0.3299, simple_loss=0.3719, pruned_loss=0.1439, over 21881.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.4024, pruned_loss=0.1612, over 4265558.73 frames. ], batch size: 351, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:30:19,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=101700.0, ans=0.0 2023-06-18 02:31:35,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-18 02:32:00,686 INFO [train.py:996] (1/4) Epoch 1, batch 17000, loss[loss=0.3976, simple_loss=0.4098, pruned_loss=0.1927, over 21626.00 frames. ], tot_loss[loss=0.3613, simple_loss=0.3986, pruned_loss=0.162, over 4274480.01 frames. ], batch size: 548, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:32:57,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.34 vs. limit=10.0 2023-06-18 02:33:09,870 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.976e+02 5.576e+02 8.538e+02 1.340e+03, threshold=1.115e+03, percent-clipped=23.0 2023-06-18 02:33:15,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=102180.0, ans=0.125 2023-06-18 02:33:43,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=102300.0, ans=0.0 2023-06-18 02:33:44,511 INFO [train.py:996] (1/4) Epoch 1, batch 17050, loss[loss=0.4332, simple_loss=0.5025, pruned_loss=0.1819, over 21211.00 frames. ], tot_loss[loss=0.3696, simple_loss=0.4076, pruned_loss=0.1658, over 4278001.13 frames. ], batch size: 548, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:34:09,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=102360.0, ans=0.125 2023-06-18 02:35:06,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=102480.0, ans=0.125 2023-06-18 02:35:09,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=102540.0, ans=0.2 2023-06-18 02:35:23,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=102540.0, ans=0.2 2023-06-18 02:35:26,137 INFO [train.py:996] (1/4) Epoch 1, batch 17100, loss[loss=0.3515, simple_loss=0.3876, pruned_loss=0.1577, over 21857.00 frames. ], tot_loss[loss=0.3693, simple_loss=0.4067, pruned_loss=0.166, over 4287619.56 frames. ], batch size: 118, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 02:36:28,701 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.900e+02 4.752e+02 6.955e+02 1.664e+03, threshold=9.503e+02, percent-clipped=6.0 2023-06-18 02:36:32,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=102780.0, ans=0.0 2023-06-18 02:36:43,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102780.0, ans=0.1 2023-06-18 02:36:49,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=102840.0, ans=15.0 2023-06-18 02:36:57,235 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.85 vs. limit=22.5 2023-06-18 02:36:58,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=102840.0, ans=0.2 2023-06-18 02:37:07,579 INFO [train.py:996] (1/4) Epoch 1, batch 17150, loss[loss=0.2914, simple_loss=0.3495, pruned_loss=0.1167, over 21827.00 frames. ], tot_loss[loss=0.3653, simple_loss=0.4014, pruned_loss=0.1646, over 4292223.96 frames. ], batch size: 282, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 02:37:38,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102960.0, ans=0.1 2023-06-18 02:37:43,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=102960.0, ans=0.125 2023-06-18 02:37:58,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=103020.0, ans=0.125 2023-06-18 02:37:59,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-18 02:38:52,543 INFO [train.py:996] (1/4) Epoch 1, batch 17200, loss[loss=0.3515, simple_loss=0.3961, pruned_loss=0.1534, over 21712.00 frames. ], tot_loss[loss=0.365, simple_loss=0.4013, pruned_loss=0.1644, over 4291552.00 frames. ], batch size: 332, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:38:53,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=103200.0, ans=0.125 2023-06-18 02:39:01,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=103200.0, ans=0.2 2023-06-18 02:39:14,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-06-18 02:39:45,757 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 3.944e+02 4.838e+02 6.225e+02 9.968e+02, threshold=9.676e+02, percent-clipped=1.0 2023-06-18 02:40:00,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=103380.0, ans=0.125 2023-06-18 02:40:31,372 INFO [train.py:996] (1/4) Epoch 1, batch 17250, loss[loss=0.5207, simple_loss=0.519, pruned_loss=0.2612, over 21325.00 frames. ], tot_loss[loss=0.3708, simple_loss=0.4063, pruned_loss=0.1677, over 4292758.60 frames. ], batch size: 507, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:41:04,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-18 02:41:44,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=12.0 2023-06-18 02:42:06,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2023-06-18 02:42:10,759 INFO [train.py:996] (1/4) Epoch 1, batch 17300, loss[loss=0.4643, simple_loss=0.4764, pruned_loss=0.2261, over 21414.00 frames. ], tot_loss[loss=0.3805, simple_loss=0.4158, pruned_loss=0.1726, over 4294640.06 frames. ], batch size: 471, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:42:37,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=103860.0, ans=0.0 2023-06-18 02:42:40,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=103860.0, ans=0.125 2023-06-18 02:43:02,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=103920.0, ans=0.0 2023-06-18 02:43:06,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=103920.0, ans=0.125 2023-06-18 02:43:17,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.798e+02 4.894e+02 6.344e+02 1.044e+03, threshold=9.789e+02, percent-clipped=2.0 2023-06-18 02:44:02,183 INFO [train.py:996] (1/4) Epoch 1, batch 17350, loss[loss=0.4902, simple_loss=0.5139, pruned_loss=0.2332, over 21523.00 frames. ], tot_loss[loss=0.381, simple_loss=0.4174, pruned_loss=0.1723, over 4289841.37 frames. ], batch size: 508, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:45:10,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=104280.0, ans=0.125 2023-06-18 02:45:14,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=104280.0, ans=0.0 2023-06-18 02:45:45,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=104400.0, ans=0.125 2023-06-18 02:45:47,344 INFO [train.py:996] (1/4) Epoch 1, batch 17400, loss[loss=0.2829, simple_loss=0.3239, pruned_loss=0.1209, over 21261.00 frames. ], tot_loss[loss=0.3701, simple_loss=0.409, pruned_loss=0.1656, over 4277103.72 frames. ], batch size: 176, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:45:51,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=104400.0, ans=0.125 2023-06-18 02:46:21,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=104460.0, ans=0.0 2023-06-18 02:46:26,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=104460.0, ans=0.125 2023-06-18 02:46:34,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=104520.0, ans=0.125 2023-06-18 02:46:34,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=104520.0, ans=0.125 2023-06-18 02:46:47,229 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 3.665e+02 5.087e+02 7.278e+02 1.204e+03, threshold=1.017e+03, percent-clipped=4.0 2023-06-18 02:47:01,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=104580.0, ans=0.2 2023-06-18 02:47:30,810 INFO [train.py:996] (1/4) Epoch 1, batch 17450, loss[loss=0.2507, simple_loss=0.3147, pruned_loss=0.09334, over 21388.00 frames. ], tot_loss[loss=0.3606, simple_loss=0.4029, pruned_loss=0.1592, over 4272776.94 frames. ], batch size: 194, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:47:34,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=104700.0, ans=0.0 2023-06-18 02:48:17,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=104820.0, ans=0.125 2023-06-18 02:48:39,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=104880.0, ans=0.0 2023-06-18 02:49:07,403 INFO [train.py:996] (1/4) Epoch 1, batch 17500, loss[loss=0.4367, simple_loss=0.442, pruned_loss=0.2157, over 21824.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.3972, pruned_loss=0.1559, over 4268260.17 frames. ], batch size: 441, lr: 2.82e-02, grad_scale: 16.0 2023-06-18 02:49:38,121 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:49:51,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-18 02:50:12,881 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 3.016e+02 3.973e+02 5.519e+02 1.327e+03, threshold=7.947e+02, percent-clipped=4.0 2023-06-18 02:50:49,637 INFO [train.py:996] (1/4) Epoch 1, batch 17550, loss[loss=0.2726, simple_loss=0.3511, pruned_loss=0.09703, over 21832.00 frames. ], tot_loss[loss=0.3505, simple_loss=0.3957, pruned_loss=0.1527, over 4264489.62 frames. ], batch size: 107, lr: 2.82e-02, grad_scale: 16.0 2023-06-18 02:51:15,041 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-18 02:52:13,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=105540.0, ans=0.125 2023-06-18 02:52:32,327 INFO [train.py:996] (1/4) Epoch 1, batch 17600, loss[loss=0.3073, simple_loss=0.3744, pruned_loss=0.1201, over 21591.00 frames. ], tot_loss[loss=0.3502, simple_loss=0.397, pruned_loss=0.1517, over 4253755.60 frames. ], batch size: 112, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 02:52:38,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-18 02:52:41,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2023-06-18 02:52:44,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=105600.0, ans=0.125 2023-06-18 02:52:52,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=105660.0, ans=0.2 2023-06-18 02:53:05,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=105660.0, ans=0.125 2023-06-18 02:53:06,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.44 vs. limit=6.0 2023-06-18 02:53:32,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=105720.0, ans=0.0 2023-06-18 02:53:35,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=105780.0, ans=0.0 2023-06-18 02:53:36,618 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 3.619e+02 4.865e+02 6.950e+02 1.496e+03, threshold=9.730e+02, percent-clipped=22.0 2023-06-18 02:54:19,852 INFO [train.py:996] (1/4) Epoch 1, batch 17650, loss[loss=0.2636, simple_loss=0.3227, pruned_loss=0.1022, over 21728.00 frames. ], tot_loss[loss=0.3505, simple_loss=0.3954, pruned_loss=0.1528, over 4255400.98 frames. ], batch size: 332, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:54:28,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105900.0, ans=0.1 2023-06-18 02:55:20,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=106080.0, ans=0.015 2023-06-18 02:55:41,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=106140.0, ans=0.2 2023-06-18 02:55:43,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=106140.0, ans=0.2 2023-06-18 02:55:55,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=106140.0, ans=0.125 2023-06-18 02:56:02,784 INFO [train.py:996] (1/4) Epoch 1, batch 17700, loss[loss=0.2489, simple_loss=0.2871, pruned_loss=0.1053, over 21248.00 frames. ], tot_loss[loss=0.3424, simple_loss=0.389, pruned_loss=0.1479, over 4258399.01 frames. ], batch size: 159, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:56:45,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=22.76 vs. limit=22.5 2023-06-18 02:57:07,757 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.749e+02 4.413e+02 5.536e+02 1.027e+03, threshold=8.827e+02, percent-clipped=1.0 2023-06-18 02:57:08,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=106380.0, ans=0.0 2023-06-18 02:57:13,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=106380.0, ans=0.125 2023-06-18 02:57:45,715 INFO [train.py:996] (1/4) Epoch 1, batch 17750, loss[loss=0.4015, simple_loss=0.4487, pruned_loss=0.1771, over 21767.00 frames. ], tot_loss[loss=0.3562, simple_loss=0.4011, pruned_loss=0.1557, over 4266875.06 frames. ], batch size: 124, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:57:52,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=106500.0, ans=0.0 2023-06-18 02:59:00,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=106680.0, ans=0.1 2023-06-18 02:59:19,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=106740.0, ans=0.2 2023-06-18 02:59:27,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=106740.0, ans=6.0 2023-06-18 02:59:30,612 INFO [train.py:996] (1/4) Epoch 1, batch 17800, loss[loss=0.2995, simple_loss=0.3528, pruned_loss=0.1231, over 21547.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.3985, pruned_loss=0.1538, over 4263030.39 frames. ], batch size: 230, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 03:00:01,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=106860.0, ans=0.05 2023-06-18 03:00:06,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=106860.0, ans=0.125 2023-06-18 03:00:32,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-18 03:00:41,769 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.616e+02 4.998e+02 5.902e+02 1.082e+03, threshold=9.996e+02, percent-clipped=5.0 2023-06-18 03:01:25,746 INFO [train.py:996] (1/4) Epoch 1, batch 17850, loss[loss=0.4146, simple_loss=0.4334, pruned_loss=0.1979, over 21729.00 frames. ], tot_loss[loss=0.3532, simple_loss=0.3995, pruned_loss=0.1535, over 4266837.12 frames. ], batch size: 298, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 03:01:45,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.23 vs. limit=22.5 2023-06-18 03:03:01,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=107340.0, ans=0.04949747468305833 2023-06-18 03:03:11,352 INFO [train.py:996] (1/4) Epoch 1, batch 17900, loss[loss=0.3537, simple_loss=0.4284, pruned_loss=0.1395, over 20792.00 frames. ], tot_loss[loss=0.3624, simple_loss=0.4075, pruned_loss=0.1587, over 4268619.49 frames. ], batch size: 608, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 03:03:15,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=107400.0, ans=0.5 2023-06-18 03:03:40,712 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.84 vs. limit=12.0 2023-06-18 03:03:59,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=107520.0, ans=0.025 2023-06-18 03:04:10,892 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.685e+02 4.154e+02 5.194e+02 6.786e+02 1.159e+03, threshold=1.039e+03, percent-clipped=5.0 2023-06-18 03:04:16,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=107580.0, ans=0.0 2023-06-18 03:04:55,225 INFO [train.py:996] (1/4) Epoch 1, batch 17950, loss[loss=0.2824, simple_loss=0.3597, pruned_loss=0.1026, over 21704.00 frames. ], tot_loss[loss=0.3564, simple_loss=0.405, pruned_loss=0.1539, over 4267464.47 frames. ], batch size: 298, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:04:55,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=107700.0, ans=0.95 2023-06-18 03:05:07,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.60 vs. limit=6.0 2023-06-18 03:05:23,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=107760.0, ans=0.125 2023-06-18 03:05:51,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-18 03:06:38,438 INFO [train.py:996] (1/4) Epoch 1, batch 18000, loss[loss=0.4304, simple_loss=0.4213, pruned_loss=0.2197, over 21354.00 frames. ], tot_loss[loss=0.3499, simple_loss=0.3967, pruned_loss=0.1515, over 4269281.65 frames. ], batch size: 473, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:06:38,438 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 03:06:48,051 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.9065, 3.2435, 1.6027, 1.9600], device='cuda:1') 2023-06-18 03:06:57,885 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.3324, simple_loss=0.4216, pruned_loss=0.1216, over 1796401.00 frames. 2023-06-18 03:06:57,886 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 03:07:37,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=108120.0, ans=0.0 2023-06-18 03:08:02,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=108180.0, ans=0.125 2023-06-18 03:08:03,602 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 3.501e+02 4.751e+02 6.240e+02 1.819e+03, threshold=9.502e+02, percent-clipped=6.0 2023-06-18 03:08:41,366 INFO [train.py:996] (1/4) Epoch 1, batch 18050, loss[loss=0.3198, simple_loss=0.3661, pruned_loss=0.1368, over 21723.00 frames. ], tot_loss[loss=0.3451, simple_loss=0.3896, pruned_loss=0.1503, over 4272723.73 frames. ], batch size: 247, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:09:28,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=108420.0, ans=0.0 2023-06-18 03:09:31,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.08 vs. limit=6.0 2023-06-18 03:09:57,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=108480.0, ans=0.125 2023-06-18 03:10:13,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=108540.0, ans=0.0 2023-06-18 03:10:13,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=108540.0, ans=0.0 2023-06-18 03:10:16,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=108540.0, ans=0.125 2023-06-18 03:10:19,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=108540.0, ans=0.2 2023-06-18 03:10:33,497 INFO [train.py:996] (1/4) Epoch 1, batch 18100, loss[loss=0.3371, simple_loss=0.415, pruned_loss=0.1296, over 21725.00 frames. ], tot_loss[loss=0.3497, simple_loss=0.3938, pruned_loss=0.1528, over 4272158.03 frames. ], batch size: 298, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:10:51,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=108660.0, ans=0.0 2023-06-18 03:11:29,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=108720.0, ans=0.0 2023-06-18 03:11:33,719 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.544e+02 4.148e+02 5.231e+02 6.853e+02 1.250e+03, threshold=1.046e+03, percent-clipped=5.0 2023-06-18 03:11:49,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=108780.0, ans=0.2 2023-06-18 03:11:51,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=108780.0, ans=0.2 2023-06-18 03:12:17,527 INFO [train.py:996] (1/4) Epoch 1, batch 18150, loss[loss=0.3306, simple_loss=0.364, pruned_loss=0.1486, over 21631.00 frames. ], tot_loss[loss=0.3478, simple_loss=0.3933, pruned_loss=0.1511, over 4271731.89 frames. ], batch size: 247, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:13:24,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=109080.0, ans=0.0 2023-06-18 03:13:58,919 INFO [train.py:996] (1/4) Epoch 1, batch 18200, loss[loss=0.3344, simple_loss=0.3676, pruned_loss=0.1506, over 21587.00 frames. ], tot_loss[loss=0.3447, simple_loss=0.3869, pruned_loss=0.1512, over 4261921.19 frames. ], batch size: 391, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:14:28,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=109320.0, ans=0.125 2023-06-18 03:14:40,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-18 03:14:49,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-18 03:14:56,475 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 3.703e+02 5.000e+02 6.238e+02 9.945e+02, threshold=1.000e+03, percent-clipped=0.0 2023-06-18 03:15:33,205 INFO [train.py:996] (1/4) Epoch 1, batch 18250, loss[loss=0.2943, simple_loss=0.3483, pruned_loss=0.1202, over 21797.00 frames. ], tot_loss[loss=0.3316, simple_loss=0.3747, pruned_loss=0.1443, over 4259513.08 frames. ], batch size: 118, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:15:38,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=109500.0, ans=0.125 2023-06-18 03:16:23,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109620.0, ans=0.1 2023-06-18 03:16:49,467 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.05 vs. limit=22.5 2023-06-18 03:16:53,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=109740.0, ans=0.1 2023-06-18 03:17:17,158 INFO [train.py:996] (1/4) Epoch 1, batch 18300, loss[loss=0.367, simple_loss=0.4014, pruned_loss=0.1663, over 21324.00 frames. ], tot_loss[loss=0.3368, simple_loss=0.3782, pruned_loss=0.1477, over 4262623.53 frames. ], batch size: 176, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:17:23,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=109800.0, ans=0.035 2023-06-18 03:17:23,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=109800.0, ans=0.125 2023-06-18 03:17:55,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=109920.0, ans=0.125 2023-06-18 03:18:07,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=109920.0, ans=0.0 2023-06-18 03:18:21,685 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 3.385e+02 4.237e+02 5.760e+02 9.388e+02, threshold=8.473e+02, percent-clipped=0.0 2023-06-18 03:19:00,049 INFO [train.py:996] (1/4) Epoch 1, batch 18350, loss[loss=0.4494, simple_loss=0.4423, pruned_loss=0.2283, over 21295.00 frames. ], tot_loss[loss=0.3425, simple_loss=0.3868, pruned_loss=0.1491, over 4259909.45 frames. ], batch size: 471, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:19:33,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=110160.0, ans=0.1 2023-06-18 03:19:37,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.13 vs. limit=22.5 2023-06-18 03:19:42,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=110220.0, ans=0.0 2023-06-18 03:19:42,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-18 03:20:04,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=110280.0, ans=0.2 2023-06-18 03:20:06,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=110280.0, ans=0.125 2023-06-18 03:20:41,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110340.0, ans=0.1 2023-06-18 03:20:44,040 INFO [train.py:996] (1/4) Epoch 1, batch 18400, loss[loss=0.314, simple_loss=0.3603, pruned_loss=0.1338, over 21549.00 frames. ], tot_loss[loss=0.3391, simple_loss=0.383, pruned_loss=0.1476, over 4265560.89 frames. ], batch size: 263, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:21:07,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=110460.0, ans=0.0 2023-06-18 03:21:20,810 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:21:48,645 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.132e+02 3.866e+02 5.099e+02 1.496e+03, threshold=7.733e+02, percent-clipped=6.0 2023-06-18 03:22:05,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=110580.0, ans=0.0 2023-06-18 03:22:17,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-18 03:22:22,269 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.13 vs. limit=6.0 2023-06-18 03:22:23,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=110640.0, ans=0.2 2023-06-18 03:22:31,353 INFO [train.py:996] (1/4) Epoch 1, batch 18450, loss[loss=0.3394, simple_loss=0.4083, pruned_loss=0.1353, over 21501.00 frames. ], tot_loss[loss=0.3281, simple_loss=0.3767, pruned_loss=0.1397, over 4251155.13 frames. ], batch size: 473, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:22:45,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-18 03:22:46,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=12.0 2023-06-18 03:22:57,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=110760.0, ans=0.125 2023-06-18 03:23:40,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=110880.0, ans=0.0 2023-06-18 03:24:10,264 INFO [train.py:996] (1/4) Epoch 1, batch 18500, loss[loss=0.3321, simple_loss=0.3874, pruned_loss=0.1384, over 21697.00 frames. ], tot_loss[loss=0.3244, simple_loss=0.3722, pruned_loss=0.1383, over 4242573.78 frames. ], batch size: 332, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:24:17,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=111000.0, ans=0.125 2023-06-18 03:24:48,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-18 03:25:14,308 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 3.536e+02 4.866e+02 6.375e+02 1.291e+03, threshold=9.732e+02, percent-clipped=16.0 2023-06-18 03:25:48,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=111240.0, ans=0.0 2023-06-18 03:25:52,422 INFO [train.py:996] (1/4) Epoch 1, batch 18550, loss[loss=0.3045, simple_loss=0.3534, pruned_loss=0.1278, over 21781.00 frames. ], tot_loss[loss=0.3235, simple_loss=0.3705, pruned_loss=0.1383, over 4249923.08 frames. ], batch size: 316, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:25:59,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=111300.0, ans=0.125 2023-06-18 03:26:01,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=111300.0, ans=0.0 2023-06-18 03:26:47,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-06-18 03:27:16,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=111480.0, ans=0.0 2023-06-18 03:27:44,738 INFO [train.py:996] (1/4) Epoch 1, batch 18600, loss[loss=0.384, simple_loss=0.4328, pruned_loss=0.1676, over 20880.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.369, pruned_loss=0.1393, over 4255732.71 frames. ], batch size: 609, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:27:52,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-18 03:28:44,869 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.656e+02 4.245e+02 5.529e+02 8.990e+02, threshold=8.491e+02, percent-clipped=0.0 2023-06-18 03:29:01,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=111780.0, ans=0.125 2023-06-18 03:29:03,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=111780.0, ans=0.0 2023-06-18 03:29:10,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=111840.0, ans=0.125 2023-06-18 03:29:27,665 INFO [train.py:996] (1/4) Epoch 1, batch 18650, loss[loss=0.2963, simple_loss=0.3594, pruned_loss=0.1166, over 21672.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3674, pruned_loss=0.1397, over 4253653.14 frames. ], batch size: 298, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:31:04,845 INFO [train.py:996] (1/4) Epoch 1, batch 18700, loss[loss=0.2909, simple_loss=0.3476, pruned_loss=0.1171, over 19939.00 frames. ], tot_loss[loss=0.3245, simple_loss=0.366, pruned_loss=0.1415, over 4255876.56 frames. ], batch size: 703, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:31:05,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.98 vs. limit=15.0 2023-06-18 03:31:43,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=112320.0, ans=15.0 2023-06-18 03:31:53,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=112320.0, ans=0.125 2023-06-18 03:32:03,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.596e+02 4.888e+02 6.142e+02 1.184e+03, threshold=9.776e+02, percent-clipped=7.0 2023-06-18 03:32:25,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=112380.0, ans=0.125 2023-06-18 03:32:45,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-18 03:32:47,377 INFO [train.py:996] (1/4) Epoch 1, batch 18750, loss[loss=0.3456, simple_loss=0.3636, pruned_loss=0.1638, over 21208.00 frames. ], tot_loss[loss=0.3305, simple_loss=0.3697, pruned_loss=0.1457, over 4255355.56 frames. ], batch size: 176, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:33:02,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=112500.0, ans=0.125 2023-06-18 03:33:15,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=112560.0, ans=0.5 2023-06-18 03:33:25,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-18 03:33:42,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=112620.0, ans=0.1 2023-06-18 03:33:45,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-18 03:33:56,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=112680.0, ans=0.125 2023-06-18 03:34:26,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=112740.0, ans=0.0 2023-06-18 03:34:31,338 INFO [train.py:996] (1/4) Epoch 1, batch 18800, loss[loss=0.2074, simple_loss=0.2618, pruned_loss=0.07647, over 16159.00 frames. ], tot_loss[loss=0.3347, simple_loss=0.3758, pruned_loss=0.1468, over 4257577.64 frames. ], batch size: 60, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:34:37,269 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-18 03:34:51,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=112860.0, ans=0.125 2023-06-18 03:34:53,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=112860.0, ans=0.025 2023-06-18 03:35:38,116 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 3.151e+02 4.208e+02 5.876e+02 1.169e+03, threshold=8.416e+02, percent-clipped=1.0 2023-06-18 03:36:07,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=113040.0, ans=0.125 2023-06-18 03:36:15,756 INFO [train.py:996] (1/4) Epoch 1, batch 18850, loss[loss=0.2666, simple_loss=0.3337, pruned_loss=0.09974, over 21494.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3692, pruned_loss=0.1385, over 4255931.18 frames. ], batch size: 211, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:37:02,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-18 03:37:39,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=113280.0, ans=0.04949747468305833 2023-06-18 03:37:54,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=113340.0, ans=0.0 2023-06-18 03:37:57,690 INFO [train.py:996] (1/4) Epoch 1, batch 18900, loss[loss=0.3222, simple_loss=0.3542, pruned_loss=0.1451, over 21802.00 frames. ], tot_loss[loss=0.325, simple_loss=0.3685, pruned_loss=0.1408, over 4263731.81 frames. ], batch size: 316, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:38:01,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113400.0, ans=0.1 2023-06-18 03:38:04,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=113400.0, ans=0.125 2023-06-18 03:38:50,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=113520.0, ans=0.0 2023-06-18 03:38:57,640 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.397e+02 4.704e+02 6.198e+02 1.365e+03, threshold=9.409e+02, percent-clipped=10.0 2023-06-18 03:39:27,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=113640.0, ans=0.125 2023-06-18 03:39:36,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113640.0, ans=0.1 2023-06-18 03:39:41,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.00 vs. limit=10.0 2023-06-18 03:39:48,132 INFO [train.py:996] (1/4) Epoch 1, batch 18950, loss[loss=0.4359, simple_loss=0.4888, pruned_loss=0.1914, over 21572.00 frames. ], tot_loss[loss=0.3325, simple_loss=0.3725, pruned_loss=0.1462, over 4265856.76 frames. ], batch size: 508, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:40:05,641 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-18 03:41:08,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=113940.0, ans=0.2 2023-06-18 03:41:31,605 INFO [train.py:996] (1/4) Epoch 1, batch 19000, loss[loss=0.331, simple_loss=0.3763, pruned_loss=0.1429, over 19930.00 frames. ], tot_loss[loss=0.3417, simple_loss=0.3825, pruned_loss=0.1505, over 4265775.99 frames. ], batch size: 702, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:41:40,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114000.0, ans=0.1 2023-06-18 03:42:10,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114120.0, ans=0.1 2023-06-18 03:42:37,049 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 4.163e+02 4.936e+02 6.528e+02 1.667e+03, threshold=9.873e+02, percent-clipped=8.0 2023-06-18 03:43:07,970 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.57 vs. limit=10.0 2023-06-18 03:43:15,045 INFO [train.py:996] (1/4) Epoch 1, batch 19050, loss[loss=0.3974, simple_loss=0.4326, pruned_loss=0.1812, over 21563.00 frames. ], tot_loss[loss=0.3511, simple_loss=0.3902, pruned_loss=0.156, over 4275672.91 frames. ], batch size: 414, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:43:59,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-18 03:44:58,922 INFO [train.py:996] (1/4) Epoch 1, batch 19100, loss[loss=0.3862, simple_loss=0.394, pruned_loss=0.1892, over 21306.00 frames. ], tot_loss[loss=0.3513, simple_loss=0.3885, pruned_loss=0.1571, over 4271758.41 frames. ], batch size: 471, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:45:03,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=114600.0, ans=0.125 2023-06-18 03:45:58,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=114720.0, ans=0.125 2023-06-18 03:46:04,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.557e+02 3.849e+02 4.992e+02 6.577e+02 2.048e+03, threshold=9.985e+02, percent-clipped=3.0 2023-06-18 03:46:06,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-18 03:46:09,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=114780.0, ans=0.0 2023-06-18 03:46:29,328 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:46:43,843 INFO [train.py:996] (1/4) Epoch 1, batch 19150, loss[loss=0.3475, simple_loss=0.3974, pruned_loss=0.1488, over 21600.00 frames. ], tot_loss[loss=0.3528, simple_loss=0.3904, pruned_loss=0.1576, over 4272292.95 frames. ], batch size: 263, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:47:15,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=114960.0, ans=0.07 2023-06-18 03:48:00,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=115080.0, ans=0.125 2023-06-18 03:48:09,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=115080.0, ans=0.0 2023-06-18 03:48:18,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=115140.0, ans=0.2 2023-06-18 03:48:29,954 INFO [train.py:996] (1/4) Epoch 1, batch 19200, loss[loss=0.3684, simple_loss=0.4452, pruned_loss=0.1458, over 21740.00 frames. ], tot_loss[loss=0.3568, simple_loss=0.4002, pruned_loss=0.1567, over 4280652.95 frames. ], batch size: 351, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:48:37,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=115200.0, ans=0.0 2023-06-18 03:48:48,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115200.0, ans=0.1 2023-06-18 03:49:16,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115320.0, ans=0.1 2023-06-18 03:49:24,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=115320.0, ans=0.1 2023-06-18 03:49:30,834 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.474e+02 4.215e+02 5.397e+02 9.229e+02, threshold=8.431e+02, percent-clipped=0.0 2023-06-18 03:49:32,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=115380.0, ans=0.125 2023-06-18 03:49:57,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=115440.0, ans=0.05 2023-06-18 03:49:59,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=115440.0, ans=0.125 2023-06-18 03:50:08,859 INFO [train.py:996] (1/4) Epoch 1, batch 19250, loss[loss=0.3454, simple_loss=0.3988, pruned_loss=0.146, over 21591.00 frames. ], tot_loss[loss=0.3459, simple_loss=0.3967, pruned_loss=0.1476, over 4272658.15 frames. ], batch size: 471, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:50:21,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115500.0, ans=0.1 2023-06-18 03:50:47,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=115560.0, ans=0.125 2023-06-18 03:51:01,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=115620.0, ans=0.0 2023-06-18 03:51:30,286 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=22.5 2023-06-18 03:51:34,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=115740.0, ans=0.125 2023-06-18 03:51:36,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=115740.0, ans=10.0 2023-06-18 03:51:52,366 INFO [train.py:996] (1/4) Epoch 1, batch 19300, loss[loss=0.3014, simple_loss=0.3441, pruned_loss=0.1293, over 21824.00 frames. ], tot_loss[loss=0.3422, simple_loss=0.393, pruned_loss=0.1458, over 4280965.11 frames. ], batch size: 124, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:52:01,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=115800.0, ans=0.125 2023-06-18 03:52:45,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=22.5 2023-06-18 03:53:02,590 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 3.240e+02 4.224e+02 5.313e+02 1.250e+03, threshold=8.447e+02, percent-clipped=7.0 2023-06-18 03:53:13,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=115980.0, ans=0.125 2023-06-18 03:53:39,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=19.02 vs. limit=15.0 2023-06-18 03:53:40,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116100.0, ans=0.1 2023-06-18 03:53:41,266 INFO [train.py:996] (1/4) Epoch 1, batch 19350, loss[loss=0.2298, simple_loss=0.2866, pruned_loss=0.08653, over 21805.00 frames. ], tot_loss[loss=0.3344, simple_loss=0.3868, pruned_loss=0.141, over 4283817.58 frames. ], batch size: 112, lr: 2.71e-02, grad_scale: 64.0 2023-06-18 03:54:22,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-18 03:54:24,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-18 03:54:41,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=116220.0, ans=0.1 2023-06-18 03:55:21,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=116340.0, ans=0.02 2023-06-18 03:55:23,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-18 03:55:24,566 INFO [train.py:996] (1/4) Epoch 1, batch 19400, loss[loss=0.4053, simple_loss=0.422, pruned_loss=0.1943, over 21605.00 frames. ], tot_loss[loss=0.3318, simple_loss=0.3844, pruned_loss=0.1396, over 4280983.40 frames. ], batch size: 471, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:55:26,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=116400.0, ans=0.0 2023-06-18 03:55:29,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-18 03:55:51,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=116460.0, ans=0.0 2023-06-18 03:56:04,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=116460.0, ans=0.125 2023-06-18 03:56:29,455 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.749e+02 4.636e+02 5.829e+02 1.066e+03, threshold=9.272e+02, percent-clipped=6.0 2023-06-18 03:56:32,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.48 vs. limit=10.0 2023-06-18 03:56:49,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=22.5 2023-06-18 03:56:56,481 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.11 vs. limit=6.0 2023-06-18 03:57:05,434 INFO [train.py:996] (1/4) Epoch 1, batch 19450, loss[loss=0.3053, simple_loss=0.3363, pruned_loss=0.1372, over 21612.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3843, pruned_loss=0.145, over 4286907.13 frames. ], batch size: 247, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:58:26,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=116940.0, ans=0.125 2023-06-18 03:58:45,876 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:58:56,288 INFO [train.py:996] (1/4) Epoch 1, batch 19500, loss[loss=0.2852, simple_loss=0.338, pruned_loss=0.1162, over 21755.00 frames. ], tot_loss[loss=0.3361, simple_loss=0.3792, pruned_loss=0.1465, over 4286900.99 frames. ], batch size: 282, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:59:13,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=117000.0, ans=0.125 2023-06-18 03:59:24,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=117060.0, ans=0.0 2023-06-18 03:59:47,604 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:59:49,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.11 vs. limit=6.0 2023-06-18 03:59:57,080 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.820e+02 4.726e+02 6.793e+02 1.461e+03, threshold=9.451e+02, percent-clipped=7.0 2023-06-18 04:00:00,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117180.0, ans=0.1 2023-06-18 04:00:32,867 INFO [train.py:996] (1/4) Epoch 1, batch 19550, loss[loss=0.2331, simple_loss=0.2776, pruned_loss=0.09436, over 21488.00 frames. ], tot_loss[loss=0.3293, simple_loss=0.3736, pruned_loss=0.1425, over 4289270.66 frames. ], batch size: 195, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:01:04,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=117360.0, ans=0.95 2023-06-18 04:01:06,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.79 vs. limit=22.5 2023-06-18 04:01:15,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=117420.0, ans=0.125 2023-06-18 04:01:17,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=15.0 2023-06-18 04:01:54,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-18 04:02:13,851 INFO [train.py:996] (1/4) Epoch 1, batch 19600, loss[loss=0.3154, simple_loss=0.3541, pruned_loss=0.1384, over 21828.00 frames. ], tot_loss[loss=0.3296, simple_loss=0.3739, pruned_loss=0.1427, over 4293555.28 frames. ], batch size: 247, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:03:20,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.478e+02 4.292e+02 5.648e+02 1.125e+03, threshold=8.585e+02, percent-clipped=2.0 2023-06-18 04:03:20,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=117780.0, ans=0.04949747468305833 2023-06-18 04:03:38,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=117840.0, ans=0.0 2023-06-18 04:04:03,563 INFO [train.py:996] (1/4) Epoch 1, batch 19650, loss[loss=0.3731, simple_loss=0.4185, pruned_loss=0.1639, over 21788.00 frames. ], tot_loss[loss=0.3415, simple_loss=0.3828, pruned_loss=0.1501, over 4295323.47 frames. ], batch size: 118, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:05:19,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=118080.0, ans=0.0 2023-06-18 04:05:51,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.09 vs. limit=22.5 2023-06-18 04:05:52,057 INFO [train.py:996] (1/4) Epoch 1, batch 19700, loss[loss=0.311, simple_loss=0.3781, pruned_loss=0.1219, over 21704.00 frames. ], tot_loss[loss=0.3444, simple_loss=0.3873, pruned_loss=0.1508, over 4294166.82 frames. ], batch size: 298, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:06:10,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=118200.0, ans=0.0 2023-06-18 04:06:12,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=118260.0, ans=0.2 2023-06-18 04:06:32,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=118320.0, ans=0.0 2023-06-18 04:06:49,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=118320.0, ans=0.1 2023-06-18 04:06:59,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.705e+02 3.771e+02 4.552e+02 5.763e+02 1.165e+03, threshold=9.104e+02, percent-clipped=3.0 2023-06-18 04:07:01,988 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:07:27,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=118440.0, ans=0.0 2023-06-18 04:07:30,614 INFO [train.py:996] (1/4) Epoch 1, batch 19750, loss[loss=0.4547, simple_loss=0.4947, pruned_loss=0.2073, over 21802.00 frames. ], tot_loss[loss=0.352, simple_loss=0.3976, pruned_loss=0.1532, over 4293869.72 frames. ], batch size: 414, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:07:51,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=118560.0, ans=0.2 2023-06-18 04:07:58,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=118560.0, ans=0.125 2023-06-18 04:08:53,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=118680.0, ans=0.125 2023-06-18 04:09:17,492 INFO [train.py:996] (1/4) Epoch 1, batch 19800, loss[loss=0.3208, simple_loss=0.3711, pruned_loss=0.1353, over 21742.00 frames. ], tot_loss[loss=0.3541, simple_loss=0.3981, pruned_loss=0.155, over 4296725.52 frames. ], batch size: 298, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:09:45,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=118860.0, ans=0.125 2023-06-18 04:10:23,926 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.634e+02 4.450e+02 5.874e+02 9.997e+02, threshold=8.899e+02, percent-clipped=2.0 2023-06-18 04:10:30,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.79 vs. limit=15.0 2023-06-18 04:10:31,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=118980.0, ans=0.2 2023-06-18 04:10:35,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=118980.0, ans=0.0 2023-06-18 04:11:00,144 INFO [train.py:996] (1/4) Epoch 1, batch 19850, loss[loss=0.2977, simple_loss=0.3627, pruned_loss=0.1164, over 21597.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3877, pruned_loss=0.1456, over 4297287.29 frames. ], batch size: 263, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:11:07,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-18 04:11:11,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119100.0, ans=0.1 2023-06-18 04:12:35,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=119340.0, ans=0.0 2023-06-18 04:12:45,760 INFO [train.py:996] (1/4) Epoch 1, batch 19900, loss[loss=0.3379, simple_loss=0.3834, pruned_loss=0.1462, over 21621.00 frames. ], tot_loss[loss=0.3336, simple_loss=0.3853, pruned_loss=0.141, over 4291040.23 frames. ], batch size: 332, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:12:50,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=119400.0, ans=0.1 2023-06-18 04:13:09,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=22.5 2023-06-18 04:13:13,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=119460.0, ans=0.125 2023-06-18 04:13:51,585 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.596e+02 4.410e+02 6.393e+02 1.239e+03, threshold=8.821e+02, percent-clipped=7.0 2023-06-18 04:14:27,885 INFO [train.py:996] (1/4) Epoch 1, batch 19950, loss[loss=0.2985, simple_loss=0.3522, pruned_loss=0.1224, over 21620.00 frames. ], tot_loss[loss=0.3329, simple_loss=0.3811, pruned_loss=0.1423, over 4278261.64 frames. ], batch size: 230, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:14:49,266 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-18 04:14:50,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=119760.0, ans=0.0 2023-06-18 04:15:00,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=119760.0, ans=0.125 2023-06-18 04:15:22,572 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:15:33,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=119880.0, ans=0.0 2023-06-18 04:15:34,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=119880.0, ans=0.0 2023-06-18 04:15:57,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=119940.0, ans=0.0 2023-06-18 04:16:12,227 INFO [train.py:996] (1/4) Epoch 1, batch 20000, loss[loss=0.475, simple_loss=0.4728, pruned_loss=0.2386, over 21660.00 frames. ], tot_loss[loss=0.3361, simple_loss=0.3842, pruned_loss=0.1439, over 4276470.93 frames. ], batch size: 508, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:16:48,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.76 vs. limit=22.5 2023-06-18 04:16:52,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=120060.0, ans=0.125 2023-06-18 04:16:53,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=120120.0, ans=0.2 2023-06-18 04:17:01,065 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.30 vs. limit=15.0 2023-06-18 04:17:18,779 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.616e+02 4.426e+02 6.098e+02 1.164e+03, threshold=8.852e+02, percent-clipped=3.0 2023-06-18 04:17:30,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=120180.0, ans=0.125 2023-06-18 04:17:40,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=120240.0, ans=0.0 2023-06-18 04:17:51,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=120300.0, ans=0.125 2023-06-18 04:17:52,847 INFO [train.py:996] (1/4) Epoch 1, batch 20050, loss[loss=0.3665, simple_loss=0.3953, pruned_loss=0.1688, over 21515.00 frames. ], tot_loss[loss=0.3416, simple_loss=0.3871, pruned_loss=0.148, over 4281811.41 frames. ], batch size: 195, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:18:30,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=120360.0, ans=10.0 2023-06-18 04:18:57,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=120480.0, ans=0.2 2023-06-18 04:19:17,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=120480.0, ans=0.2 2023-06-18 04:19:26,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=120540.0, ans=0.2 2023-06-18 04:19:37,714 INFO [train.py:996] (1/4) Epoch 1, batch 20100, loss[loss=0.4025, simple_loss=0.4485, pruned_loss=0.1783, over 21846.00 frames. ], tot_loss[loss=0.347, simple_loss=0.3905, pruned_loss=0.1518, over 4289421.51 frames. ], batch size: 332, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:20:09,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=120660.0, ans=0.05 2023-06-18 04:20:25,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=120720.0, ans=0.125 2023-06-18 04:20:37,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=120720.0, ans=0.125 2023-06-18 04:20:40,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=120720.0, ans=0.2 2023-06-18 04:20:51,573 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.978e+02 4.839e+02 6.470e+02 1.176e+03, threshold=9.678e+02, percent-clipped=4.0 2023-06-18 04:21:32,572 INFO [train.py:996] (1/4) Epoch 1, batch 20150, loss[loss=0.3654, simple_loss=0.4085, pruned_loss=0.1612, over 21630.00 frames. ], tot_loss[loss=0.358, simple_loss=0.4016, pruned_loss=0.1571, over 4291348.50 frames. ], batch size: 263, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:21:55,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=120960.0, ans=0.1 2023-06-18 04:22:22,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=121020.0, ans=0.025 2023-06-18 04:22:27,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=121020.0, ans=0.0 2023-06-18 04:23:23,941 INFO [train.py:996] (1/4) Epoch 1, batch 20200, loss[loss=0.3803, simple_loss=0.4586, pruned_loss=0.151, over 20726.00 frames. ], tot_loss[loss=0.365, simple_loss=0.4076, pruned_loss=0.1612, over 4286812.47 frames. ], batch size: 607, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:23:36,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=121200.0, ans=0.07 2023-06-18 04:23:55,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=121260.0, ans=0.1 2023-06-18 04:24:04,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=121320.0, ans=0.125 2023-06-18 04:24:26,082 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 4.003e+02 5.201e+02 6.811e+02 1.420e+03, threshold=1.040e+03, percent-clipped=11.0 2023-06-18 04:25:04,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=121500.0, ans=0.025 2023-06-18 04:25:05,801 INFO [train.py:996] (1/4) Epoch 1, batch 20250, loss[loss=0.313, simple_loss=0.3666, pruned_loss=0.1298, over 21444.00 frames. ], tot_loss[loss=0.3602, simple_loss=0.4069, pruned_loss=0.1568, over 4282130.75 frames. ], batch size: 194, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:25:58,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=121620.0, ans=0.5 2023-06-18 04:26:47,830 INFO [train.py:996] (1/4) Epoch 1, batch 20300, loss[loss=0.2665, simple_loss=0.3283, pruned_loss=0.1024, over 21308.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.4018, pruned_loss=0.1514, over 4275547.70 frames. ], batch size: 131, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:27:42,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=121920.0, ans=0.2 2023-06-18 04:27:45,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=121980.0, ans=0.125 2023-06-18 04:27:55,661 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 3.171e+02 3.718e+02 4.802e+02 8.828e+02, threshold=7.436e+02, percent-clipped=0.0 2023-06-18 04:28:28,651 INFO [train.py:996] (1/4) Epoch 1, batch 20350, loss[loss=0.352, simple_loss=0.3974, pruned_loss=0.1533, over 21465.00 frames. ], tot_loss[loss=0.3532, simple_loss=0.4014, pruned_loss=0.1525, over 4260430.98 frames. ], batch size: 131, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:28:41,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=122100.0, ans=0.2 2023-06-18 04:29:02,093 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=22.5 2023-06-18 04:29:06,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=122220.0, ans=0.0 2023-06-18 04:29:12,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=122220.0, ans=0.05 2023-06-18 04:29:23,774 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-18 04:29:58,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=122340.0, ans=0.2 2023-06-18 04:30:16,413 INFO [train.py:996] (1/4) Epoch 1, batch 20400, loss[loss=0.3665, simple_loss=0.406, pruned_loss=0.1635, over 21629.00 frames. ], tot_loss[loss=0.3606, simple_loss=0.4059, pruned_loss=0.1577, over 4267601.22 frames. ], batch size: 263, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:30:32,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-06-18 04:30:36,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=122460.0, ans=0.1 2023-06-18 04:30:44,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=122460.0, ans=0.125 2023-06-18 04:30:49,211 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:30:49,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=122520.0, ans=0.125 2023-06-18 04:30:53,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=122520.0, ans=0.125 2023-06-18 04:31:01,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=22.5 2023-06-18 04:31:13,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 4.014e+02 4.909e+02 5.768e+02 1.154e+03, threshold=9.817e+02, percent-clipped=10.0 2023-06-18 04:31:46,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=15.0 2023-06-18 04:31:48,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=122640.0, ans=0.2 2023-06-18 04:31:53,063 INFO [train.py:996] (1/4) Epoch 1, batch 20450, loss[loss=0.3072, simple_loss=0.3438, pruned_loss=0.1353, over 20942.00 frames. ], tot_loss[loss=0.3637, simple_loss=0.4063, pruned_loss=0.1606, over 4251329.84 frames. ], batch size: 608, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:32:04,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=122700.0, ans=0.125 2023-06-18 04:32:43,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-18 04:33:34,853 INFO [train.py:996] (1/4) Epoch 1, batch 20500, loss[loss=0.3494, simple_loss=0.3838, pruned_loss=0.1575, over 21405.00 frames. ], tot_loss[loss=0.3612, simple_loss=0.4012, pruned_loss=0.1606, over 4267386.45 frames. ], batch size: 194, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:34:43,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.920e+02 3.898e+02 4.731e+02 5.915e+02 1.084e+03, threshold=9.462e+02, percent-clipped=4.0 2023-06-18 04:34:45,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=123180.0, ans=0.125 2023-06-18 04:35:04,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=123240.0, ans=0.2 2023-06-18 04:35:11,338 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-18 04:35:22,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=123300.0, ans=0.04949747468305833 2023-06-18 04:35:23,672 INFO [train.py:996] (1/4) Epoch 1, batch 20550, loss[loss=0.3595, simple_loss=0.437, pruned_loss=0.141, over 19866.00 frames. ], tot_loss[loss=0.3535, simple_loss=0.3923, pruned_loss=0.1574, over 4261865.17 frames. ], batch size: 702, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:35:25,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=123300.0, ans=0.0 2023-06-18 04:35:30,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=123300.0, ans=0.125 2023-06-18 04:35:42,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=123360.0, ans=0.125 2023-06-18 04:35:56,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=123420.0, ans=0.0 2023-06-18 04:36:47,802 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:37:02,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=123540.0, ans=0.125 2023-06-18 04:37:02,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.23 vs. limit=10.0 2023-06-18 04:37:05,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=123600.0, ans=0.125 2023-06-18 04:37:06,850 INFO [train.py:996] (1/4) Epoch 1, batch 20600, loss[loss=0.3271, simple_loss=0.3663, pruned_loss=0.144, over 21756.00 frames. ], tot_loss[loss=0.3514, simple_loss=0.3936, pruned_loss=0.1546, over 4257076.61 frames. ], batch size: 247, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:37:10,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.07 vs. limit=6.0 2023-06-18 04:37:20,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=123600.0, ans=0.2 2023-06-18 04:37:20,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.59 vs. limit=6.0 2023-06-18 04:38:03,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=123720.0, ans=0.0 2023-06-18 04:38:03,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=123720.0, ans=0.04949747468305833 2023-06-18 04:38:05,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=123780.0, ans=22.5 2023-06-18 04:38:09,148 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.464e+02 4.439e+02 5.514e+02 9.400e+02, threshold=8.878e+02, percent-clipped=0.0 2023-06-18 04:38:26,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=123840.0, ans=0.125 2023-06-18 04:38:46,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-18 04:38:48,796 INFO [train.py:996] (1/4) Epoch 1, batch 20650, loss[loss=0.3385, simple_loss=0.3649, pruned_loss=0.156, over 21197.00 frames. ], tot_loss[loss=0.3502, simple_loss=0.3894, pruned_loss=0.1555, over 4266813.29 frames. ], batch size: 608, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:38:54,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=123900.0, ans=0.0 2023-06-18 04:39:01,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-18 04:39:05,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=123960.0, ans=0.0 2023-06-18 04:39:50,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=124080.0, ans=0.125 2023-06-18 04:40:28,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=124140.0, ans=0.2 2023-06-18 04:40:31,323 INFO [train.py:996] (1/4) Epoch 1, batch 20700, loss[loss=0.2974, simple_loss=0.3468, pruned_loss=0.124, over 21786.00 frames. ], tot_loss[loss=0.3413, simple_loss=0.3815, pruned_loss=0.1505, over 4262490.85 frames. ], batch size: 371, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:41:17,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=124320.0, ans=0.2 2023-06-18 04:41:29,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2023-06-18 04:41:38,277 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.293e+02 3.859e+02 5.120e+02 8.262e+02, threshold=7.718e+02, percent-clipped=0.0 2023-06-18 04:42:12,819 INFO [train.py:996] (1/4) Epoch 1, batch 20750, loss[loss=0.5547, simple_loss=0.5715, pruned_loss=0.269, over 21469.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3805, pruned_loss=0.1469, over 4257957.13 frames. ], batch size: 507, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:42:23,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=124500.0, ans=0.0 2023-06-18 04:42:42,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=124560.0, ans=0.025 2023-06-18 04:42:47,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-18 04:43:16,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-18 04:43:50,184 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:43:53,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=124740.0, ans=0.125 2023-06-18 04:43:56,107 INFO [train.py:996] (1/4) Epoch 1, batch 20800, loss[loss=0.3125, simple_loss=0.342, pruned_loss=0.1415, over 21205.00 frames. ], tot_loss[loss=0.3443, simple_loss=0.387, pruned_loss=0.1508, over 4264467.47 frames. ], batch size: 176, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:43:58,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=124800.0, ans=0.0 2023-06-18 04:44:05,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=124800.0, ans=0.125 2023-06-18 04:44:12,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=124860.0, ans=0.04949747468305833 2023-06-18 04:44:53,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=124920.0, ans=0.1 2023-06-18 04:45:08,183 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 3.772e+02 4.526e+02 5.632e+02 1.034e+03, threshold=9.051e+02, percent-clipped=9.0 2023-06-18 04:45:36,932 INFO [train.py:996] (1/4) Epoch 1, batch 20850, loss[loss=0.3282, simple_loss=0.3514, pruned_loss=0.1525, over 21240.00 frames. ], tot_loss[loss=0.3358, simple_loss=0.3774, pruned_loss=0.1472, over 4266630.83 frames. ], batch size: 143, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:45:45,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=125100.0, ans=0.125 2023-06-18 04:46:46,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=125280.0, ans=0.95 2023-06-18 04:47:18,183 INFO [train.py:996] (1/4) Epoch 1, batch 20900, loss[loss=0.4047, simple_loss=0.4408, pruned_loss=0.1843, over 21627.00 frames. ], tot_loss[loss=0.3415, simple_loss=0.3822, pruned_loss=0.1504, over 4271086.30 frames. ], batch size: 473, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:48:06,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=125520.0, ans=0.0 2023-06-18 04:48:23,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125580.0, ans=0.125 2023-06-18 04:48:25,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.309e+02 3.915e+02 5.105e+02 1.001e+03, threshold=7.830e+02, percent-clipped=2.0 2023-06-18 04:48:53,588 INFO [train.py:996] (1/4) Epoch 1, batch 20950, loss[loss=0.2902, simple_loss=0.341, pruned_loss=0.1197, over 20909.00 frames. ], tot_loss[loss=0.3298, simple_loss=0.374, pruned_loss=0.1428, over 4270482.02 frames. ], batch size: 608, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:49:03,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=125700.0, ans=10.0 2023-06-18 04:49:22,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=125760.0, ans=0.125 2023-06-18 04:49:33,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=125820.0, ans=0.125 2023-06-18 04:49:37,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.71 vs. limit=22.5 2023-06-18 04:50:33,080 INFO [train.py:996] (1/4) Epoch 1, batch 21000, loss[loss=0.3801, simple_loss=0.4003, pruned_loss=0.1799, over 21879.00 frames. ], tot_loss[loss=0.3326, simple_loss=0.3754, pruned_loss=0.1449, over 4272366.28 frames. ], batch size: 371, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:50:33,081 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 04:50:50,125 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.3151, simple_loss=0.4075, pruned_loss=0.1114, over 1796401.00 frames. 2023-06-18 04:50:50,126 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 04:51:04,014 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.57 vs. limit=22.5 2023-06-18 04:51:19,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=126060.0, ans=0.125 2023-06-18 04:51:43,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-18 04:52:02,297 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 3.430e+02 4.586e+02 6.344e+02 1.913e+03, threshold=9.172e+02, percent-clipped=11.0 2023-06-18 04:52:03,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-18 04:52:30,642 INFO [train.py:996] (1/4) Epoch 1, batch 21050, loss[loss=0.3249, simple_loss=0.3635, pruned_loss=0.1431, over 21507.00 frames. ], tot_loss[loss=0.3325, simple_loss=0.3737, pruned_loss=0.1457, over 4272920.76 frames. ], batch size: 441, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:52:57,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=126360.0, ans=0.125 2023-06-18 04:53:03,724 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 04:53:03,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=126360.0, ans=0.0 2023-06-18 04:53:27,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=126420.0, ans=0.05 2023-06-18 04:53:35,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=126480.0, ans=0.2 2023-06-18 04:53:58,713 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-18 04:54:07,564 INFO [train.py:996] (1/4) Epoch 1, batch 21100, loss[loss=0.2915, simple_loss=0.3251, pruned_loss=0.1289, over 21297.00 frames. ], tot_loss[loss=0.3283, simple_loss=0.3685, pruned_loss=0.144, over 4271603.78 frames. ], batch size: 160, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:54:24,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=126660.0, ans=15.0 2023-06-18 04:54:34,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=126660.0, ans=0.0 2023-06-18 04:55:14,650 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.438e+02 4.271e+02 5.279e+02 9.041e+02, threshold=8.542e+02, percent-clipped=0.0 2023-06-18 04:55:20,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.34 vs. limit=6.0 2023-06-18 04:55:37,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=126840.0, ans=0.125 2023-06-18 04:55:43,336 INFO [train.py:996] (1/4) Epoch 1, batch 21150, loss[loss=0.2868, simple_loss=0.3224, pruned_loss=0.1256, over 21652.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3635, pruned_loss=0.1435, over 4262967.66 frames. ], batch size: 247, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:55:58,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2023-06-18 04:56:13,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-18 04:56:29,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-06-18 04:56:49,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=127080.0, ans=0.0 2023-06-18 04:56:51,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127080.0, ans=0.1 2023-06-18 04:57:07,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=127140.0, ans=0.2 2023-06-18 04:57:20,000 INFO [train.py:996] (1/4) Epoch 1, batch 21200, loss[loss=0.3182, simple_loss=0.3491, pruned_loss=0.1437, over 21738.00 frames. ], tot_loss[loss=0.3226, simple_loss=0.3603, pruned_loss=0.1425, over 4269030.42 frames. ], batch size: 124, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:57:53,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=127260.0, ans=0.125 2023-06-18 04:58:12,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-18 04:58:18,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=127320.0, ans=0.125 2023-06-18 04:58:33,959 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.663e+02 4.545e+02 5.734e+02 1.350e+03, threshold=9.091e+02, percent-clipped=8.0 2023-06-18 04:58:49,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127440.0, ans=0.1 2023-06-18 04:59:02,992 INFO [train.py:996] (1/4) Epoch 1, batch 21250, loss[loss=0.3667, simple_loss=0.4123, pruned_loss=0.1605, over 21679.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3601, pruned_loss=0.1433, over 4271264.46 frames. ], batch size: 298, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:59:08,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-18 04:59:29,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127560.0, ans=0.1 2023-06-18 04:59:59,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=127620.0, ans=0.125 2023-06-18 05:00:36,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=127740.0, ans=0.09899494936611666 2023-06-18 05:00:41,952 INFO [train.py:996] (1/4) Epoch 1, batch 21300, loss[loss=0.4199, simple_loss=0.4335, pruned_loss=0.2031, over 21728.00 frames. ], tot_loss[loss=0.3301, simple_loss=0.3677, pruned_loss=0.1462, over 4274390.78 frames. ], batch size: 473, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:00:50,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=127800.0, ans=0.025 2023-06-18 05:01:04,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=127860.0, ans=0.125 2023-06-18 05:01:12,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=127860.0, ans=0.0 2023-06-18 05:01:14,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-18 05:01:22,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-18 05:01:54,303 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.515e+02 4.385e+02 5.674e+02 1.308e+03, threshold=8.770e+02, percent-clipped=8.0 2023-06-18 05:01:57,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=127980.0, ans=0.2 2023-06-18 05:02:04,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127980.0, ans=0.1 2023-06-18 05:02:20,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=128040.0, ans=0.1 2023-06-18 05:02:23,574 INFO [train.py:996] (1/4) Epoch 1, batch 21350, loss[loss=0.2771, simple_loss=0.3522, pruned_loss=0.101, over 21784.00 frames. ], tot_loss[loss=0.3316, simple_loss=0.3713, pruned_loss=0.1459, over 4283110.10 frames. ], batch size: 282, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:02:26,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-18 05:02:34,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=128100.0, ans=0.125 2023-06-18 05:03:45,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=128280.0, ans=0.125 2023-06-18 05:03:55,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=128340.0, ans=0.125 2023-06-18 05:04:06,116 INFO [train.py:996] (1/4) Epoch 1, batch 21400, loss[loss=0.2759, simple_loss=0.3324, pruned_loss=0.1096, over 21472.00 frames. ], tot_loss[loss=0.3309, simple_loss=0.374, pruned_loss=0.1439, over 4274939.24 frames. ], batch size: 194, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:04:23,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=128400.0, ans=0.2 2023-06-18 05:05:17,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=128580.0, ans=0.125 2023-06-18 05:05:18,993 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 3.244e+02 3.990e+02 4.956e+02 1.756e+03, threshold=7.981e+02, percent-clipped=8.0 2023-06-18 05:05:40,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=128640.0, ans=0.0 2023-06-18 05:05:47,921 INFO [train.py:996] (1/4) Epoch 1, batch 21450, loss[loss=0.3683, simple_loss=0.4014, pruned_loss=0.1676, over 21781.00 frames. ], tot_loss[loss=0.3346, simple_loss=0.3773, pruned_loss=0.146, over 4275620.05 frames. ], batch size: 441, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:06:06,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=128700.0, ans=0.2 2023-06-18 05:06:42,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=128820.0, ans=0.0 2023-06-18 05:06:57,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=128880.0, ans=0.125 2023-06-18 05:07:21,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-18 05:07:23,864 INFO [train.py:996] (1/4) Epoch 1, batch 21500, loss[loss=0.2967, simple_loss=0.3318, pruned_loss=0.1308, over 21606.00 frames. ], tot_loss[loss=0.3363, simple_loss=0.376, pruned_loss=0.1483, over 4282569.13 frames. ], batch size: 263, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:08:10,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=129060.0, ans=0.125 2023-06-18 05:08:28,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=129120.0, ans=0.125 2023-06-18 05:08:35,885 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.287e+02 4.067e+02 5.300e+02 1.405e+03, threshold=8.134e+02, percent-clipped=7.0 2023-06-18 05:08:36,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=129180.0, ans=0.125 2023-06-18 05:08:45,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129180.0, ans=0.1 2023-06-18 05:09:05,147 INFO [train.py:996] (1/4) Epoch 1, batch 21550, loss[loss=0.3621, simple_loss=0.3693, pruned_loss=0.1775, over 21387.00 frames. ], tot_loss[loss=0.3286, simple_loss=0.3678, pruned_loss=0.1447, over 4277747.93 frames. ], batch size: 509, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:09:12,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=129300.0, ans=0.125 2023-06-18 05:09:39,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=129360.0, ans=0.1 2023-06-18 05:09:39,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=129360.0, ans=0.2 2023-06-18 05:10:15,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.55 vs. limit=6.0 2023-06-18 05:10:15,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.21 vs. limit=6.0 2023-06-18 05:10:35,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=129540.0, ans=0.0 2023-06-18 05:10:53,605 INFO [train.py:996] (1/4) Epoch 1, batch 21600, loss[loss=0.3691, simple_loss=0.4183, pruned_loss=0.16, over 21599.00 frames. ], tot_loss[loss=0.322, simple_loss=0.3612, pruned_loss=0.1413, over 4282644.18 frames. ], batch size: 441, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:11:03,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=129600.0, ans=0.125 2023-06-18 05:11:21,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-18 05:11:22,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=129660.0, ans=0.0 2023-06-18 05:12:01,381 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.331e+02 4.167e+02 5.142e+02 1.133e+03, threshold=8.334e+02, percent-clipped=4.0 2023-06-18 05:12:08,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=129780.0, ans=0.2 2023-06-18 05:12:14,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-18 05:12:34,517 INFO [train.py:996] (1/4) Epoch 1, batch 21650, loss[loss=0.3476, simple_loss=0.4336, pruned_loss=0.1308, over 21228.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3682, pruned_loss=0.1398, over 4276276.98 frames. ], batch size: 548, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:13:02,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=129960.0, ans=0.09899494936611666 2023-06-18 05:13:34,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=12.0 2023-06-18 05:13:40,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=130080.0, ans=0.125 2023-06-18 05:13:48,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=130080.0, ans=0.1 2023-06-18 05:14:15,273 INFO [train.py:996] (1/4) Epoch 1, batch 21700, loss[loss=0.2894, simple_loss=0.3545, pruned_loss=0.1121, over 21683.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3657, pruned_loss=0.1354, over 4269972.24 frames. ], batch size: 298, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:14:23,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=130200.0, ans=0.2 2023-06-18 05:15:09,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=130320.0, ans=0.125 2023-06-18 05:15:14,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=130380.0, ans=0.2 2023-06-18 05:15:16,909 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.479e+02 4.448e+02 5.687e+02 1.020e+03, threshold=8.895e+02, percent-clipped=10.0 2023-06-18 05:15:50,771 INFO [train.py:996] (1/4) Epoch 1, batch 21750, loss[loss=0.3336, simple_loss=0.3583, pruned_loss=0.1545, over 21828.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3625, pruned_loss=0.1351, over 4256318.91 frames. ], batch size: 352, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:15:51,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=130500.0, ans=0.125 2023-06-18 05:16:12,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=130560.0, ans=0.07 2023-06-18 05:16:58,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=130680.0, ans=0.125 2023-06-18 05:17:09,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=130680.0, ans=0.2 2023-06-18 05:17:34,236 INFO [train.py:996] (1/4) Epoch 1, batch 21800, loss[loss=0.2906, simple_loss=0.3287, pruned_loss=0.1262, over 21688.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3614, pruned_loss=0.1374, over 4257113.44 frames. ], batch size: 282, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:17:53,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=130800.0, ans=0.125 2023-06-18 05:18:42,236 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.714e+02 4.449e+02 6.326e+02 1.060e+03, threshold=8.898e+02, percent-clipped=3.0 2023-06-18 05:19:16,630 INFO [train.py:996] (1/4) Epoch 1, batch 21850, loss[loss=0.4124, simple_loss=0.4578, pruned_loss=0.1835, over 21526.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.366, pruned_loss=0.1384, over 4259840.38 frames. ], batch size: 507, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:19:55,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-18 05:20:03,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=131220.0, ans=0.0 2023-06-18 05:20:10,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=131220.0, ans=0.125 2023-06-18 05:20:11,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=131220.0, ans=0.125 2023-06-18 05:20:44,183 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:20:55,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=131340.0, ans=0.125 2023-06-18 05:20:58,035 INFO [train.py:996] (1/4) Epoch 1, batch 21900, loss[loss=0.3541, simple_loss=0.3986, pruned_loss=0.1548, over 21618.00 frames. ], tot_loss[loss=0.3282, simple_loss=0.3711, pruned_loss=0.1427, over 4263000.32 frames. ], batch size: 471, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:22:04,614 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.542e+02 3.404e+02 4.082e+02 5.077e+02 9.199e+02, threshold=8.164e+02, percent-clipped=1.0 2023-06-18 05:22:38,125 INFO [train.py:996] (1/4) Epoch 1, batch 21950, loss[loss=0.2071, simple_loss=0.2745, pruned_loss=0.06988, over 21240.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3657, pruned_loss=0.1413, over 4265255.54 frames. ], batch size: 159, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:22:55,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=131700.0, ans=0.2 2023-06-18 05:23:07,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=131760.0, ans=0.125 2023-06-18 05:23:11,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=131760.0, ans=0.2 2023-06-18 05:23:30,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=131820.0, ans=0.0 2023-06-18 05:23:50,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=131880.0, ans=0.2 2023-06-18 05:23:57,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=131940.0, ans=0.05 2023-06-18 05:24:19,969 INFO [train.py:996] (1/4) Epoch 1, batch 22000, loss[loss=0.2345, simple_loss=0.2893, pruned_loss=0.08978, over 21289.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3566, pruned_loss=0.1349, over 4273187.38 frames. ], batch size: 143, lr: 2.56e-02, grad_scale: 64.0 2023-06-18 05:25:11,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=132120.0, ans=0.0 2023-06-18 05:25:30,645 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.717e+02 4.714e+02 6.490e+02 1.072e+03, threshold=9.428e+02, percent-clipped=6.0 2023-06-18 05:25:32,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=132180.0, ans=0.0 2023-06-18 05:26:08,211 INFO [train.py:996] (1/4) Epoch 1, batch 22050, loss[loss=0.3041, simple_loss=0.347, pruned_loss=0.1306, over 21335.00 frames. ], tot_loss[loss=0.32, simple_loss=0.3639, pruned_loss=0.138, over 4266707.34 frames. ], batch size: 194, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:26:08,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=132300.0, ans=0.0 2023-06-18 05:26:26,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=132300.0, ans=0.07 2023-06-18 05:26:30,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-18 05:26:52,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=132420.0, ans=0.125 2023-06-18 05:27:11,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=132480.0, ans=0.95 2023-06-18 05:27:19,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=132480.0, ans=0.125 2023-06-18 05:27:21,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=132540.0, ans=0.2 2023-06-18 05:27:48,446 INFO [train.py:996] (1/4) Epoch 1, batch 22100, loss[loss=0.3363, simple_loss=0.3748, pruned_loss=0.1489, over 21933.00 frames. ], tot_loss[loss=0.3328, simple_loss=0.3754, pruned_loss=0.1451, over 4260548.20 frames. ], batch size: 316, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:28:25,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132660.0, ans=0.1 2023-06-18 05:28:54,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.580e+02 4.025e+02 4.912e+02 6.450e+02 1.246e+03, threshold=9.825e+02, percent-clipped=3.0 2023-06-18 05:29:18,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-18 05:29:32,049 INFO [train.py:996] (1/4) Epoch 1, batch 22150, loss[loss=0.3047, simple_loss=0.3625, pruned_loss=0.1235, over 21847.00 frames. ], tot_loss[loss=0.34, simple_loss=0.3812, pruned_loss=0.1494, over 4263000.82 frames. ], batch size: 332, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:29:44,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=132900.0, ans=0.0 2023-06-18 05:30:21,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=133020.0, ans=0.125 2023-06-18 05:30:30,096 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-18 05:31:13,343 INFO [train.py:996] (1/4) Epoch 1, batch 22200, loss[loss=0.3081, simple_loss=0.3763, pruned_loss=0.1199, over 21309.00 frames. ], tot_loss[loss=0.344, simple_loss=0.386, pruned_loss=0.151, over 4267453.41 frames. ], batch size: 144, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:32:05,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=133320.0, ans=0.2 2023-06-18 05:32:16,576 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 4.029e+02 4.889e+02 6.211e+02 1.093e+03, threshold=9.779e+02, percent-clipped=2.0 2023-06-18 05:32:17,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=133380.0, ans=0.0 2023-06-18 05:32:25,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=133380.0, ans=0.0 2023-06-18 05:32:59,476 INFO [train.py:996] (1/4) Epoch 1, batch 22250, loss[loss=0.3748, simple_loss=0.4224, pruned_loss=0.1636, over 21753.00 frames. ], tot_loss[loss=0.3512, simple_loss=0.3941, pruned_loss=0.1541, over 4272194.06 frames. ], batch size: 298, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:33:05,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.04 vs. limit=6.0 2023-06-18 05:34:17,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.80 vs. limit=6.0 2023-06-18 05:34:39,734 INFO [train.py:996] (1/4) Epoch 1, batch 22300, loss[loss=0.3555, simple_loss=0.3827, pruned_loss=0.1642, over 21468.00 frames. ], tot_loss[loss=0.3564, simple_loss=0.3968, pruned_loss=0.158, over 4279067.18 frames. ], batch size: 194, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:35:04,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=133860.0, ans=0.5 2023-06-18 05:35:37,687 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 3.602e+02 4.280e+02 5.421e+02 8.254e+02, threshold=8.559e+02, percent-clipped=0.0 2023-06-18 05:35:46,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=133980.0, ans=0.0 2023-06-18 05:36:20,522 INFO [train.py:996] (1/4) Epoch 1, batch 22350, loss[loss=0.3505, simple_loss=0.3855, pruned_loss=0.1577, over 21449.00 frames. ], tot_loss[loss=0.3553, simple_loss=0.3939, pruned_loss=0.1584, over 4290723.64 frames. ], batch size: 131, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:36:29,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=134100.0, ans=0.125 2023-06-18 05:36:37,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=134160.0, ans=0.09899494936611666 2023-06-18 05:36:57,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=134220.0, ans=0.5 2023-06-18 05:37:04,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134220.0, ans=0.1 2023-06-18 05:37:35,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=134340.0, ans=0.0 2023-06-18 05:37:57,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=134340.0, ans=0.0 2023-06-18 05:38:03,434 INFO [train.py:996] (1/4) Epoch 1, batch 22400, loss[loss=0.3084, simple_loss=0.3449, pruned_loss=0.1359, over 21776.00 frames. ], tot_loss[loss=0.3467, simple_loss=0.3884, pruned_loss=0.1525, over 4287856.66 frames. ], batch size: 124, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:38:08,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=134400.0, ans=0.125 2023-06-18 05:38:13,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134400.0, ans=0.1 2023-06-18 05:39:02,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=134580.0, ans=0.0 2023-06-18 05:39:07,172 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.472e+02 4.159e+02 5.652e+02 9.879e+02, threshold=8.318e+02, percent-clipped=2.0 2023-06-18 05:39:39,881 INFO [train.py:996] (1/4) Epoch 1, batch 22450, loss[loss=0.29, simple_loss=0.317, pruned_loss=0.1315, over 21449.00 frames. ], tot_loss[loss=0.3403, simple_loss=0.3802, pruned_loss=0.1502, over 4273269.79 frames. ], batch size: 554, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:40:05,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134760.0, ans=0.1 2023-06-18 05:41:00,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=134880.0, ans=0.125 2023-06-18 05:41:26,124 INFO [train.py:996] (1/4) Epoch 1, batch 22500, loss[loss=0.286, simple_loss=0.3157, pruned_loss=0.1281, over 20717.00 frames. ], tot_loss[loss=0.3354, simple_loss=0.3744, pruned_loss=0.1482, over 4264514.30 frames. ], batch size: 607, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:41:41,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=135000.0, ans=0.125 2023-06-18 05:42:40,694 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.574e+02 4.487e+02 5.410e+02 9.033e+02, threshold=8.975e+02, percent-clipped=2.0 2023-06-18 05:42:48,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=135180.0, ans=0.0 2023-06-18 05:43:09,812 INFO [train.py:996] (1/4) Epoch 1, batch 22550, loss[loss=0.3277, simple_loss=0.3687, pruned_loss=0.1434, over 21651.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3804, pruned_loss=0.1493, over 4267534.30 frames. ], batch size: 263, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:43:34,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=135360.0, ans=0.125 2023-06-18 05:43:44,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=135360.0, ans=0.0 2023-06-18 05:43:48,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-18 05:44:10,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135420.0, ans=0.1 2023-06-18 05:44:27,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135480.0, ans=0.1 2023-06-18 05:44:56,414 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.69 vs. limit=10.0 2023-06-18 05:44:59,031 INFO [train.py:996] (1/4) Epoch 1, batch 22600, loss[loss=0.2933, simple_loss=0.3321, pruned_loss=0.1272, over 21822.00 frames. ], tot_loss[loss=0.3439, simple_loss=0.3857, pruned_loss=0.1511, over 4270705.71 frames. ], batch size: 102, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:45:00,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=135600.0, ans=0.125 2023-06-18 05:45:22,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=135660.0, ans=10.0 2023-06-18 05:45:56,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=135720.0, ans=0.125 2023-06-18 05:45:56,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=135720.0, ans=0.2 2023-06-18 05:46:01,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=135780.0, ans=0.125 2023-06-18 05:46:07,722 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 4.199e+02 5.117e+02 6.564e+02 1.237e+03, threshold=1.023e+03, percent-clipped=4.0 2023-06-18 05:46:32,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=135840.0, ans=0.2 2023-06-18 05:46:38,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.60 vs. limit=10.0 2023-06-18 05:46:39,942 INFO [train.py:996] (1/4) Epoch 1, batch 22650, loss[loss=0.3899, simple_loss=0.4698, pruned_loss=0.1551, over 20807.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.3804, pruned_loss=0.1492, over 4266291.90 frames. ], batch size: 607, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:47:12,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=135960.0, ans=0.125 2023-06-18 05:47:45,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=136080.0, ans=0.125 2023-06-18 05:48:18,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=136200.0, ans=0.2 2023-06-18 05:48:18,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=136200.0, ans=0.1 2023-06-18 05:48:20,092 INFO [train.py:996] (1/4) Epoch 1, batch 22700, loss[loss=0.2723, simple_loss=0.3108, pruned_loss=0.1169, over 21555.00 frames. ], tot_loss[loss=0.3338, simple_loss=0.3728, pruned_loss=0.1474, over 4267915.16 frames. ], batch size: 247, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:48:45,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=136260.0, ans=0.125 2023-06-18 05:49:18,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.58 vs. limit=12.0 2023-06-18 05:49:24,142 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.785e+02 4.714e+02 6.670e+02 1.093e+03, threshold=9.427e+02, percent-clipped=5.0 2023-06-18 05:49:26,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=136380.0, ans=0.0 2023-06-18 05:49:57,043 INFO [train.py:996] (1/4) Epoch 1, batch 22750, loss[loss=0.393, simple_loss=0.4237, pruned_loss=0.1811, over 21478.00 frames. ], tot_loss[loss=0.3385, simple_loss=0.3758, pruned_loss=0.1506, over 4272243.01 frames. ], batch size: 131, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:50:35,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=15.0 2023-06-18 05:50:59,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=136680.0, ans=0.125 2023-06-18 05:51:20,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=136740.0, ans=0.2 2023-06-18 05:51:25,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=136740.0, ans=0.0 2023-06-18 05:51:39,022 INFO [train.py:996] (1/4) Epoch 1, batch 22800, loss[loss=0.3692, simple_loss=0.3965, pruned_loss=0.171, over 21588.00 frames. ], tot_loss[loss=0.3457, simple_loss=0.3815, pruned_loss=0.1549, over 4273332.37 frames. ], batch size: 230, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:51:54,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-18 05:51:58,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=136860.0, ans=0.1 2023-06-18 05:52:02,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=22.5 2023-06-18 05:52:31,348 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:52:47,245 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.251e+02 4.256e+02 5.590e+02 8.268e+02 1.334e+03, threshold=1.118e+03, percent-clipped=16.0 2023-06-18 05:53:12,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=137040.0, ans=0.125 2023-06-18 05:53:20,658 INFO [train.py:996] (1/4) Epoch 1, batch 22850, loss[loss=0.2961, simple_loss=0.3371, pruned_loss=0.1276, over 21909.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.377, pruned_loss=0.1533, over 4282887.20 frames. ], batch size: 107, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:53:21,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137100.0, ans=0.1 2023-06-18 05:53:21,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-18 05:54:24,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=137280.0, ans=0.2 2023-06-18 05:54:40,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=137340.0, ans=0.125 2023-06-18 05:54:42,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-18 05:54:47,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=137340.0, ans=0.125 2023-06-18 05:55:09,105 INFO [train.py:996] (1/4) Epoch 1, batch 22900, loss[loss=0.3168, simple_loss=0.3925, pruned_loss=0.1205, over 21685.00 frames. ], tot_loss[loss=0.3414, simple_loss=0.3778, pruned_loss=0.1525, over 4267822.95 frames. ], batch size: 247, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:55:09,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=137400.0, ans=0.125 2023-06-18 05:55:21,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137400.0, ans=0.1 2023-06-18 05:55:23,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.83 vs. limit=6.0 2023-06-18 05:55:42,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.14 vs. limit=10.0 2023-06-18 05:55:49,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-06-18 05:56:12,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-06-18 05:56:13,234 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.811e+02 3.596e+02 4.279e+02 5.382e+02 9.756e+02, threshold=8.557e+02, percent-clipped=0.0 2023-06-18 05:56:41,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=137640.0, ans=0.2 2023-06-18 05:56:42,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-18 05:56:52,901 INFO [train.py:996] (1/4) Epoch 1, batch 22950, loss[loss=0.2663, simple_loss=0.3503, pruned_loss=0.09113, over 21206.00 frames. ], tot_loss[loss=0.3438, simple_loss=0.3882, pruned_loss=0.1497, over 4260856.36 frames. ], batch size: 159, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:56:56,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=137700.0, ans=0.07 2023-06-18 05:57:15,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=19.22 vs. limit=22.5 2023-06-18 05:57:26,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=137760.0, ans=0.0 2023-06-18 05:57:42,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=137820.0, ans=0.125 2023-06-18 05:58:29,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=137940.0, ans=0.2 2023-06-18 05:58:29,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=137940.0, ans=0.125 2023-06-18 05:58:34,346 INFO [train.py:996] (1/4) Epoch 1, batch 23000, loss[loss=0.3048, simple_loss=0.3475, pruned_loss=0.1311, over 21493.00 frames. ], tot_loss[loss=0.3394, simple_loss=0.387, pruned_loss=0.1459, over 4268778.80 frames. ], batch size: 195, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:58:34,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=138000.0, ans=0.125 2023-06-18 05:59:12,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=138060.0, ans=0.0 2023-06-18 05:59:13,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=138060.0, ans=0.125 2023-06-18 05:59:42,670 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.447e+02 4.093e+02 5.344e+02 1.227e+03, threshold=8.186e+02, percent-clipped=4.0 2023-06-18 06:00:09,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=138240.0, ans=0.125 2023-06-18 06:00:15,821 INFO [train.py:996] (1/4) Epoch 1, batch 23050, loss[loss=0.3294, simple_loss=0.3431, pruned_loss=0.1578, over 20371.00 frames. ], tot_loss[loss=0.3437, simple_loss=0.3895, pruned_loss=0.149, over 4271766.34 frames. ], batch size: 703, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:00:17,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=138300.0, ans=0.125 2023-06-18 06:00:38,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=138300.0, ans=0.0 2023-06-18 06:01:42,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=138540.0, ans=0.125 2023-06-18 06:01:50,521 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-18 06:02:02,467 INFO [train.py:996] (1/4) Epoch 1, batch 23100, loss[loss=0.3481, simple_loss=0.3668, pruned_loss=0.1647, over 21533.00 frames. ], tot_loss[loss=0.342, simple_loss=0.3847, pruned_loss=0.1496, over 4273661.48 frames. ], batch size: 441, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:02:23,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=138660.0, ans=0.125 2023-06-18 06:02:28,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-18 06:02:48,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=138720.0, ans=0.0 2023-06-18 06:03:11,214 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 3.616e+02 4.226e+02 5.778e+02 1.152e+03, threshold=8.452e+02, percent-clipped=7.0 2023-06-18 06:03:37,791 INFO [train.py:996] (1/4) Epoch 1, batch 23150, loss[loss=0.3113, simple_loss=0.3468, pruned_loss=0.1379, over 21407.00 frames. ], tot_loss[loss=0.3373, simple_loss=0.3785, pruned_loss=0.1481, over 4272287.53 frames. ], batch size: 176, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:03:59,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=138900.0, ans=0.125 2023-06-18 06:04:17,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=138960.0, ans=0.125 2023-06-18 06:05:23,898 INFO [train.py:996] (1/4) Epoch 1, batch 23200, loss[loss=0.3161, simple_loss=0.3538, pruned_loss=0.1392, over 21919.00 frames. ], tot_loss[loss=0.338, simple_loss=0.3779, pruned_loss=0.149, over 4282279.65 frames. ], batch size: 283, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:06:05,828 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:06:26,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.420e+02 3.602e+02 4.092e+02 5.263e+02 8.445e+02, threshold=8.184e+02, percent-clipped=0.0 2023-06-18 06:06:58,366 INFO [train.py:996] (1/4) Epoch 1, batch 23250, loss[loss=0.3247, simple_loss=0.3767, pruned_loss=0.1364, over 21978.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.3786, pruned_loss=0.1503, over 4292954.85 frames. ], batch size: 113, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:07:14,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=139500.0, ans=0.1 2023-06-18 06:07:26,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-18 06:08:52,722 INFO [train.py:996] (1/4) Epoch 1, batch 23300, loss[loss=0.3768, simple_loss=0.4468, pruned_loss=0.1534, over 21457.00 frames. ], tot_loss[loss=0.3474, simple_loss=0.3883, pruned_loss=0.1533, over 4291746.68 frames. ], batch size: 211, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:09:01,245 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:09:05,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2023-06-18 06:09:57,999 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.429e+02 3.912e+02 5.511e+02 7.628e+02 1.360e+03, threshold=1.102e+03, percent-clipped=20.0 2023-06-18 06:10:14,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=139980.0, ans=0.5 2023-06-18 06:10:31,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=140040.0, ans=15.0 2023-06-18 06:10:38,101 INFO [train.py:996] (1/4) Epoch 1, batch 23350, loss[loss=0.3454, simple_loss=0.3956, pruned_loss=0.1476, over 21613.00 frames. ], tot_loss[loss=0.3502, simple_loss=0.3943, pruned_loss=0.1531, over 4281388.77 frames. ], batch size: 389, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:10:39,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-06-18 06:11:24,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140220.0, ans=0.1 2023-06-18 06:11:30,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140220.0, ans=0.0 2023-06-18 06:11:39,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=140280.0, ans=0.0 2023-06-18 06:11:47,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=140280.0, ans=0.125 2023-06-18 06:12:19,181 INFO [train.py:996] (1/4) Epoch 1, batch 23400, loss[loss=0.3265, simple_loss=0.3682, pruned_loss=0.1424, over 21153.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3835, pruned_loss=0.1454, over 4283915.22 frames. ], batch size: 608, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:12:29,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=140400.0, ans=0.0 2023-06-18 06:12:32,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=140400.0, ans=0.95 2023-06-18 06:13:08,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.34 vs. limit=6.0 2023-06-18 06:13:27,634 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.226e+02 4.219e+02 5.285e+02 8.873e+02, threshold=8.438e+02, percent-clipped=0.0 2023-06-18 06:13:37,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-18 06:13:52,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=140640.0, ans=0.0 2023-06-18 06:14:00,434 INFO [train.py:996] (1/4) Epoch 1, batch 23450, loss[loss=0.3813, simple_loss=0.4132, pruned_loss=0.1747, over 21571.00 frames. ], tot_loss[loss=0.3433, simple_loss=0.3865, pruned_loss=0.15, over 4279943.01 frames. ], batch size: 414, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:14:03,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-06-18 06:14:40,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140760.0, ans=0.1 2023-06-18 06:14:51,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=140820.0, ans=0.125 2023-06-18 06:14:58,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=140880.0, ans=0.0 2023-06-18 06:15:26,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=140940.0, ans=0.0 2023-06-18 06:15:28,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=140940.0, ans=0.1 2023-06-18 06:15:41,732 INFO [train.py:996] (1/4) Epoch 1, batch 23500, loss[loss=0.2989, simple_loss=0.3074, pruned_loss=0.1452, over 20078.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.3864, pruned_loss=0.1521, over 4279985.88 frames. ], batch size: 704, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:16:13,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-18 06:16:23,495 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-06-18 06:16:35,970 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-06-18 06:16:43,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=141180.0, ans=0.125 2023-06-18 06:16:49,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.488e+02 3.727e+02 4.969e+02 6.081e+02 9.256e+02, threshold=9.939e+02, percent-clipped=2.0 2023-06-18 06:17:13,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.74 vs. limit=15.0 2023-06-18 06:17:22,053 INFO [train.py:996] (1/4) Epoch 1, batch 23550, loss[loss=0.2922, simple_loss=0.3273, pruned_loss=0.1286, over 21543.00 frames. ], tot_loss[loss=0.3419, simple_loss=0.3811, pruned_loss=0.1514, over 4284493.54 frames. ], batch size: 231, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:17:24,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=141300.0, ans=0.125 2023-06-18 06:17:56,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=141360.0, ans=0.125 2023-06-18 06:19:01,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=141540.0, ans=0.125 2023-06-18 06:19:05,107 INFO [train.py:996] (1/4) Epoch 1, batch 23600, loss[loss=0.3883, simple_loss=0.4198, pruned_loss=0.1784, over 21675.00 frames. ], tot_loss[loss=0.3437, simple_loss=0.383, pruned_loss=0.1522, over 4281766.61 frames. ], batch size: 351, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:19:37,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=141660.0, ans=0.2 2023-06-18 06:20:01,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=141720.0, ans=0.0 2023-06-18 06:20:14,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-18 06:20:21,067 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.501e+02 3.688e+02 4.463e+02 5.931e+02 8.627e+02, threshold=8.927e+02, percent-clipped=0.0 2023-06-18 06:20:59,607 INFO [train.py:996] (1/4) Epoch 1, batch 23650, loss[loss=0.3665, simple_loss=0.4116, pruned_loss=0.1607, over 21744.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3806, pruned_loss=0.1484, over 4275697.47 frames. ], batch size: 441, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:22:33,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=142140.0, ans=0.125 2023-06-18 06:22:33,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=142140.0, ans=0.125 2023-06-18 06:22:42,918 INFO [train.py:996] (1/4) Epoch 1, batch 23700, loss[loss=0.3357, simple_loss=0.3832, pruned_loss=0.1441, over 21738.00 frames. ], tot_loss[loss=0.3387, simple_loss=0.3833, pruned_loss=0.1471, over 4274481.16 frames. ], batch size: 298, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:22:43,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=142200.0, ans=0.125 2023-06-18 06:23:53,389 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.745e+02 4.445e+02 5.198e+02 9.027e+02, threshold=8.891e+02, percent-clipped=1.0 2023-06-18 06:24:12,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=142440.0, ans=0.0 2023-06-18 06:24:32,856 INFO [train.py:996] (1/4) Epoch 1, batch 23750, loss[loss=0.3836, simple_loss=0.4332, pruned_loss=0.167, over 21408.00 frames. ], tot_loss[loss=0.3419, simple_loss=0.3868, pruned_loss=0.1485, over 4277306.93 frames. ], batch size: 507, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:25:04,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=142560.0, ans=0.1 2023-06-18 06:25:46,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=142680.0, ans=0.1 2023-06-18 06:26:17,119 INFO [train.py:996] (1/4) Epoch 1, batch 23800, loss[loss=0.3761, simple_loss=0.433, pruned_loss=0.1596, over 21613.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.3848, pruned_loss=0.145, over 4280271.66 frames. ], batch size: 230, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:27:13,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=142920.0, ans=0.0 2023-06-18 06:27:27,676 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.344e+02 4.883e+02 6.088e+02 1.077e+03, threshold=9.766e+02, percent-clipped=8.0 2023-06-18 06:28:06,737 INFO [train.py:996] (1/4) Epoch 1, batch 23850, loss[loss=0.3874, simple_loss=0.426, pruned_loss=0.1744, over 21466.00 frames. ], tot_loss[loss=0.3467, simple_loss=0.395, pruned_loss=0.1492, over 4278019.23 frames. ], batch size: 131, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:28:18,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=143100.0, ans=0.0 2023-06-18 06:28:23,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=143160.0, ans=0.07 2023-06-18 06:28:33,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=143160.0, ans=0.125 2023-06-18 06:28:51,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=143220.0, ans=0.2 2023-06-18 06:29:19,135 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.85 vs. limit=10.0 2023-06-18 06:29:19,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=143280.0, ans=0.02 2023-06-18 06:29:41,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=143340.0, ans=0.07 2023-06-18 06:29:45,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=143340.0, ans=0.125 2023-06-18 06:29:48,576 INFO [train.py:996] (1/4) Epoch 1, batch 23900, loss[loss=0.3761, simple_loss=0.4065, pruned_loss=0.1728, over 21601.00 frames. ], tot_loss[loss=0.3543, simple_loss=0.4033, pruned_loss=0.1527, over 4272185.80 frames. ], batch size: 414, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:29:53,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=143400.0, ans=0.025 2023-06-18 06:30:03,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=143400.0, ans=0.125 2023-06-18 06:30:07,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=143400.0, ans=0.1 2023-06-18 06:30:54,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=143580.0, ans=0.0 2023-06-18 06:30:56,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.496e+02 3.761e+02 4.724e+02 6.134e+02 1.060e+03, threshold=9.448e+02, percent-clipped=2.0 2023-06-18 06:31:11,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=143640.0, ans=0.0 2023-06-18 06:31:30,058 INFO [train.py:996] (1/4) Epoch 1, batch 23950, loss[loss=0.3133, simple_loss=0.364, pruned_loss=0.1313, over 21811.00 frames. ], tot_loss[loss=0.3489, simple_loss=0.3948, pruned_loss=0.1515, over 4263723.95 frames. ], batch size: 118, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:32:49,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=143940.0, ans=0.0 2023-06-18 06:33:05,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-18 06:33:13,941 INFO [train.py:996] (1/4) Epoch 1, batch 24000, loss[loss=0.3674, simple_loss=0.413, pruned_loss=0.1609, over 21651.00 frames. ], tot_loss[loss=0.3553, simple_loss=0.3979, pruned_loss=0.1564, over 4265504.37 frames. ], batch size: 351, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:33:13,942 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 06:33:36,582 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.32, simple_loss=0.4122, pruned_loss=0.1139, over 1796401.00 frames. 2023-06-18 06:33:36,583 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 06:34:00,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=144060.0, ans=0.0 2023-06-18 06:34:02,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=144060.0, ans=0.125 2023-06-18 06:34:17,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=144060.0, ans=0.0 2023-06-18 06:34:48,616 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.687e+02 4.611e+02 5.908e+02 1.149e+03, threshold=9.222e+02, percent-clipped=2.0 2023-06-18 06:34:55,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=144180.0, ans=0.125 2023-06-18 06:35:20,177 INFO [train.py:996] (1/4) Epoch 1, batch 24050, loss[loss=0.2579, simple_loss=0.3291, pruned_loss=0.09334, over 21619.00 frames. ], tot_loss[loss=0.3529, simple_loss=0.3967, pruned_loss=0.1546, over 4269035.26 frames. ], batch size: 230, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:36:25,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=144480.0, ans=0.125 2023-06-18 06:36:25,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=144480.0, ans=0.2 2023-06-18 06:36:26,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.70 vs. limit=15.0 2023-06-18 06:37:07,455 INFO [train.py:996] (1/4) Epoch 1, batch 24100, loss[loss=0.3665, simple_loss=0.4202, pruned_loss=0.1564, over 20721.00 frames. ], tot_loss[loss=0.3488, simple_loss=0.3958, pruned_loss=0.1509, over 4268674.22 frames. ], batch size: 607, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:37:42,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=144660.0, ans=0.1 2023-06-18 06:38:12,994 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 3.321e+02 4.048e+02 5.410e+02 1.299e+03, threshold=8.096e+02, percent-clipped=1.0 2023-06-18 06:38:13,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=144780.0, ans=0.1 2023-06-18 06:38:49,121 INFO [train.py:996] (1/4) Epoch 1, batch 24150, loss[loss=0.2954, simple_loss=0.3362, pruned_loss=0.1272, over 21165.00 frames. ], tot_loss[loss=0.3522, simple_loss=0.3964, pruned_loss=0.1539, over 4271187.83 frames. ], batch size: 608, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:38:57,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=144900.0, ans=0.2 2023-06-18 06:39:03,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.96 vs. limit=6.0 2023-06-18 06:40:31,830 INFO [train.py:996] (1/4) Epoch 1, batch 24200, loss[loss=0.3601, simple_loss=0.4143, pruned_loss=0.153, over 21709.00 frames. ], tot_loss[loss=0.3561, simple_loss=0.3998, pruned_loss=0.1563, over 4276374.47 frames. ], batch size: 351, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:40:37,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.35 vs. limit=22.5 2023-06-18 06:40:43,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=145200.0, ans=0.125 2023-06-18 06:41:11,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.77 vs. limit=6.0 2023-06-18 06:41:40,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=145380.0, ans=0.2 2023-06-18 06:41:49,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 3.664e+02 4.494e+02 5.781e+02 1.168e+03, threshold=8.988e+02, percent-clipped=4.0 2023-06-18 06:41:53,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=145380.0, ans=0.125 2023-06-18 06:42:20,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=145500.0, ans=0.0 2023-06-18 06:42:21,370 INFO [train.py:996] (1/4) Epoch 1, batch 24250, loss[loss=0.2902, simple_loss=0.366, pruned_loss=0.1072, over 21404.00 frames. ], tot_loss[loss=0.3435, simple_loss=0.3945, pruned_loss=0.1462, over 4273361.55 frames. ], batch size: 194, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:42:37,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=145500.0, ans=0.1 2023-06-18 06:43:07,369 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-18 06:43:32,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2023-06-18 06:44:01,976 INFO [train.py:996] (1/4) Epoch 1, batch 24300, loss[loss=0.1914, simple_loss=0.2655, pruned_loss=0.05863, over 21631.00 frames. ], tot_loss[loss=0.3317, simple_loss=0.3871, pruned_loss=0.1382, over 4267857.81 frames. ], batch size: 263, lr: 2.44e-02, grad_scale: 16.0 2023-06-18 06:44:02,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=145800.0, ans=0.125 2023-06-18 06:45:08,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=145980.0, ans=0.125 2023-06-18 06:45:13,972 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 3.046e+02 3.863e+02 5.440e+02 1.504e+03, threshold=7.726e+02, percent-clipped=4.0 2023-06-18 06:45:17,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=145980.0, ans=0.125 2023-06-18 06:45:43,151 INFO [train.py:996] (1/4) Epoch 1, batch 24350, loss[loss=0.432, simple_loss=0.4451, pruned_loss=0.2094, over 21588.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.382, pruned_loss=0.1385, over 4280354.45 frames. ], batch size: 471, lr: 2.44e-02, grad_scale: 16.0 2023-06-18 06:45:53,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=146100.0, ans=0.0 2023-06-18 06:45:57,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146100.0, ans=0.1 2023-06-18 06:46:07,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=146160.0, ans=0.0 2023-06-18 06:46:54,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-18 06:47:32,112 INFO [train.py:996] (1/4) Epoch 1, batch 24400, loss[loss=0.2923, simple_loss=0.3316, pruned_loss=0.1265, over 21781.00 frames. ], tot_loss[loss=0.3385, simple_loss=0.3881, pruned_loss=0.1444, over 4281603.42 frames. ], batch size: 102, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 06:47:34,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=146400.0, ans=0.2 2023-06-18 06:48:25,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=146520.0, ans=0.2 2023-06-18 06:48:30,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=146580.0, ans=0.0 2023-06-18 06:48:44,668 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.695e+02 4.218e+02 5.437e+02 7.202e+02 1.402e+03, threshold=1.087e+03, percent-clipped=21.0 2023-06-18 06:48:48,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=146580.0, ans=0.1 2023-06-18 06:49:14,771 INFO [train.py:996] (1/4) Epoch 1, batch 24450, loss[loss=0.2487, simple_loss=0.3139, pruned_loss=0.09177, over 21313.00 frames. ], tot_loss[loss=0.3417, simple_loss=0.3911, pruned_loss=0.1462, over 4276355.01 frames. ], batch size: 160, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 06:49:46,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=146760.0, ans=0.125 2023-06-18 06:50:23,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=146880.0, ans=0.125 2023-06-18 06:50:25,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=146880.0, ans=0.125 2023-06-18 06:50:25,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=146880.0, ans=0.1 2023-06-18 06:50:27,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-18 06:50:42,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.21 vs. limit=15.0 2023-06-18 06:50:44,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-18 06:50:46,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=146940.0, ans=0.125 2023-06-18 06:50:56,343 INFO [train.py:996] (1/4) Epoch 1, batch 24500, loss[loss=0.361, simple_loss=0.4005, pruned_loss=0.1607, over 21881.00 frames. ], tot_loss[loss=0.3384, simple_loss=0.3883, pruned_loss=0.1442, over 4284794.72 frames. ], batch size: 371, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:51:09,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=147000.0, ans=0.0 2023-06-18 06:51:54,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=147120.0, ans=0.0 2023-06-18 06:52:14,940 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.850e+02 5.028e+02 6.051e+02 9.604e+02, threshold=1.006e+03, percent-clipped=0.0 2023-06-18 06:52:25,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=15.0 2023-06-18 06:52:44,322 INFO [train.py:996] (1/4) Epoch 1, batch 24550, loss[loss=0.4438, simple_loss=0.4624, pruned_loss=0.2126, over 21457.00 frames. ], tot_loss[loss=0.3437, simple_loss=0.3915, pruned_loss=0.1479, over 4285405.42 frames. ], batch size: 471, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:53:42,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=147480.0, ans=0.125 2023-06-18 06:53:59,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.00 vs. limit=22.5 2023-06-18 06:54:16,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=147540.0, ans=0.2 2023-06-18 06:54:26,335 INFO [train.py:996] (1/4) Epoch 1, batch 24600, loss[loss=0.3609, simple_loss=0.3796, pruned_loss=0.1711, over 21540.00 frames. ], tot_loss[loss=0.3427, simple_loss=0.3873, pruned_loss=0.1491, over 4285787.89 frames. ], batch size: 441, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:54:41,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=147600.0, ans=0.125 2023-06-18 06:55:32,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=147780.0, ans=0.125 2023-06-18 06:55:34,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147780.0, ans=0.1 2023-06-18 06:55:38,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.603e+02 4.230e+02 5.450e+02 1.074e+03, threshold=8.460e+02, percent-clipped=1.0 2023-06-18 06:56:08,477 INFO [train.py:996] (1/4) Epoch 1, batch 24650, loss[loss=0.3294, simple_loss=0.3582, pruned_loss=0.1503, over 21588.00 frames. ], tot_loss[loss=0.3369, simple_loss=0.3791, pruned_loss=0.1473, over 4278459.41 frames. ], batch size: 298, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:56:21,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=22.5 2023-06-18 06:56:31,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=147960.0, ans=0.0 2023-06-18 06:56:50,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=148020.0, ans=0.0 2023-06-18 06:57:50,923 INFO [train.py:996] (1/4) Epoch 1, batch 24700, loss[loss=0.295, simple_loss=0.3388, pruned_loss=0.1256, over 21218.00 frames. ], tot_loss[loss=0.3315, simple_loss=0.3753, pruned_loss=0.1439, over 4273465.20 frames. ], batch size: 159, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:58:20,670 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-18 06:58:23,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=148260.0, ans=0.025 2023-06-18 06:58:38,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=148320.0, ans=0.125 2023-06-18 06:59:03,007 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.227e+02 3.816e+02 4.904e+02 7.765e+02, threshold=7.633e+02, percent-clipped=0.0 2023-06-18 06:59:12,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.34 vs. limit=10.0 2023-06-18 06:59:32,553 INFO [train.py:996] (1/4) Epoch 1, batch 24750, loss[loss=0.2533, simple_loss=0.2971, pruned_loss=0.1047, over 21434.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3682, pruned_loss=0.1398, over 4273094.06 frames. ], batch size: 195, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 06:59:32,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=148500.0, ans=0.0 2023-06-18 07:00:02,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=148560.0, ans=0.125 2023-06-18 07:00:23,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-18 07:00:25,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=148620.0, ans=0.035 2023-06-18 07:00:25,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=148620.0, ans=0.05 2023-06-18 07:00:59,888 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-06-18 07:01:14,210 INFO [train.py:996] (1/4) Epoch 1, batch 24800, loss[loss=0.3801, simple_loss=0.3849, pruned_loss=0.1877, over 21567.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3644, pruned_loss=0.1397, over 4270708.66 frames. ], batch size: 508, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:01:26,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-18 07:01:30,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.69 vs. limit=22.5 2023-06-18 07:01:42,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=148860.0, ans=0.125 2023-06-18 07:02:27,060 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.589e+02 4.591e+02 5.888e+02 8.855e+02, threshold=9.183e+02, percent-clipped=11.0 2023-06-18 07:02:27,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=148980.0, ans=0.2 2023-06-18 07:02:56,544 INFO [train.py:996] (1/4) Epoch 1, batch 24850, loss[loss=0.2659, simple_loss=0.3031, pruned_loss=0.1144, over 21233.00 frames. ], tot_loss[loss=0.325, simple_loss=0.3659, pruned_loss=0.142, over 4276781.75 frames. ], batch size: 159, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:03:41,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=149220.0, ans=0.0 2023-06-18 07:04:39,611 INFO [train.py:996] (1/4) Epoch 1, batch 24900, loss[loss=0.3818, simple_loss=0.4167, pruned_loss=0.1735, over 21836.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.3695, pruned_loss=0.1427, over 4275208.08 frames. ], batch size: 282, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:04:40,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=149400.0, ans=0.0 2023-06-18 07:04:45,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=149400.0, ans=0.0 2023-06-18 07:04:47,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=149400.0, ans=0.0 2023-06-18 07:05:01,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=149460.0, ans=0.0 2023-06-18 07:05:10,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-18 07:05:41,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=149580.0, ans=0.125 2023-06-18 07:05:53,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.811e+02 4.758e+02 6.118e+02 1.056e+03, threshold=9.515e+02, percent-clipped=2.0 2023-06-18 07:06:15,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149640.0, ans=0.1 2023-06-18 07:06:23,733 INFO [train.py:996] (1/4) Epoch 1, batch 24950, loss[loss=0.4021, simple_loss=0.4332, pruned_loss=0.1855, over 21370.00 frames. ], tot_loss[loss=0.3406, simple_loss=0.3803, pruned_loss=0.1504, over 4279484.64 frames. ], batch size: 549, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:06:35,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=149700.0, ans=0.125 2023-06-18 07:07:12,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=149820.0, ans=0.125 2023-06-18 07:07:16,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=149820.0, ans=0.2 2023-06-18 07:08:08,487 INFO [train.py:996] (1/4) Epoch 1, batch 25000, loss[loss=0.3189, simple_loss=0.359, pruned_loss=0.1394, over 21637.00 frames. ], tot_loss[loss=0.3437, simple_loss=0.3846, pruned_loss=0.1514, over 4280178.28 frames. ], batch size: 298, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:08:27,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=150000.0, ans=0.125 2023-06-18 07:08:30,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=150060.0, ans=0.125 2023-06-18 07:08:50,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=150120.0, ans=0.125 2023-06-18 07:08:51,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=150120.0, ans=0.0 2023-06-18 07:09:07,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=150120.0, ans=0.0 2023-06-18 07:09:27,596 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.657e+02 3.412e+02 4.030e+02 5.230e+02 1.013e+03, threshold=8.059e+02, percent-clipped=2.0 2023-06-18 07:09:47,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=150240.0, ans=0.0 2023-06-18 07:09:57,188 INFO [train.py:996] (1/4) Epoch 1, batch 25050, loss[loss=0.315, simple_loss=0.3448, pruned_loss=0.1426, over 21380.00 frames. ], tot_loss[loss=0.3389, simple_loss=0.378, pruned_loss=0.1499, over 4271259.55 frames. ], batch size: 160, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:10:26,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=150360.0, ans=0.025 2023-06-18 07:10:39,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=150420.0, ans=0.0 2023-06-18 07:11:15,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=150480.0, ans=0.125 2023-06-18 07:11:34,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-18 07:11:40,506 INFO [train.py:996] (1/4) Epoch 1, batch 25100, loss[loss=0.3814, simple_loss=0.4258, pruned_loss=0.1685, over 21459.00 frames. ], tot_loss[loss=0.335, simple_loss=0.3737, pruned_loss=0.1482, over 4270047.57 frames. ], batch size: 508, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:12:05,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=150660.0, ans=0.125 2023-06-18 07:12:52,240 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.624e+02 4.936e+02 6.636e+02 1.221e+03, threshold=9.872e+02, percent-clipped=16.0 2023-06-18 07:13:16,003 INFO [train.py:996] (1/4) Epoch 1, batch 25150, loss[loss=0.2717, simple_loss=0.3386, pruned_loss=0.1024, over 21895.00 frames. ], tot_loss[loss=0.3312, simple_loss=0.3743, pruned_loss=0.144, over 4262356.98 frames. ], batch size: 98, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:14:03,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=151020.0, ans=0.2 2023-06-18 07:14:14,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=151020.0, ans=0.07 2023-06-18 07:14:21,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=151080.0, ans=0.125 2023-06-18 07:14:45,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=151140.0, ans=0.1 2023-06-18 07:14:56,920 INFO [train.py:996] (1/4) Epoch 1, batch 25200, loss[loss=0.3153, simple_loss=0.3817, pruned_loss=0.1245, over 21819.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3719, pruned_loss=0.1399, over 4246259.50 frames. ], batch size: 316, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:15:07,562 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2023-06-18 07:16:14,333 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 3.238e+02 4.132e+02 5.215e+02 8.390e+02, threshold=8.263e+02, percent-clipped=0.0 2023-06-18 07:16:38,470 INFO [train.py:996] (1/4) Epoch 1, batch 25250, loss[loss=0.3345, simple_loss=0.3674, pruned_loss=0.1508, over 21551.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.3699, pruned_loss=0.1379, over 4242917.62 frames. ], batch size: 414, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:17:02,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=151560.0, ans=0.125 2023-06-18 07:17:25,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151620.0, ans=0.1 2023-06-18 07:18:21,293 INFO [train.py:996] (1/4) Epoch 1, batch 25300, loss[loss=0.3769, simple_loss=0.4191, pruned_loss=0.1673, over 21214.00 frames. ], tot_loss[loss=0.322, simple_loss=0.3677, pruned_loss=0.1382, over 4241896.24 frames. ], batch size: 143, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:18:44,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=151860.0, ans=0.125 2023-06-18 07:18:46,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=151860.0, ans=0.125 2023-06-18 07:18:46,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151860.0, ans=0.1 2023-06-18 07:18:58,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151860.0, ans=0.1 2023-06-18 07:19:02,041 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-06-18 07:19:12,027 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-18 07:19:28,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=151980.0, ans=0.125 2023-06-18 07:19:39,719 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.535e+02 3.569e+02 4.461e+02 5.778e+02 9.355e+02, threshold=8.922e+02, percent-clipped=5.0 2023-06-18 07:19:49,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=152040.0, ans=0.2 2023-06-18 07:20:01,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=152040.0, ans=0.0 2023-06-18 07:20:03,966 INFO [train.py:996] (1/4) Epoch 1, batch 25350, loss[loss=0.3085, simple_loss=0.3599, pruned_loss=0.1285, over 21755.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3709, pruned_loss=0.1382, over 4250771.16 frames. ], batch size: 124, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:20:06,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=152100.0, ans=0.04949747468305833 2023-06-18 07:20:19,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=152100.0, ans=0.0 2023-06-18 07:20:56,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=152220.0, ans=0.2 2023-06-18 07:21:17,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.82 vs. limit=6.0 2023-06-18 07:21:39,281 INFO [train.py:996] (1/4) Epoch 1, batch 25400, loss[loss=0.3099, simple_loss=0.3494, pruned_loss=0.1353, over 21691.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.3672, pruned_loss=0.1371, over 4248848.71 frames. ], batch size: 282, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:22:56,311 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.405e+02 3.535e+02 4.232e+02 5.710e+02 1.225e+03, threshold=8.465e+02, percent-clipped=5.0 2023-06-18 07:23:20,781 INFO [train.py:996] (1/4) Epoch 1, batch 25450, loss[loss=0.3188, simple_loss=0.3655, pruned_loss=0.136, over 21908.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3688, pruned_loss=0.1395, over 4253102.64 frames. ], batch size: 333, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:23:24,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=152700.0, ans=0.125 2023-06-18 07:23:44,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=152760.0, ans=0.0 2023-06-18 07:23:46,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=152760.0, ans=0.1 2023-06-18 07:24:03,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=152820.0, ans=0.0 2023-06-18 07:24:13,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=152820.0, ans=0.125 2023-06-18 07:24:52,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-18 07:25:04,150 INFO [train.py:996] (1/4) Epoch 1, batch 25500, loss[loss=0.3046, simple_loss=0.3809, pruned_loss=0.1141, over 21454.00 frames. ], tot_loss[loss=0.3187, simple_loss=0.3683, pruned_loss=0.1345, over 4261405.80 frames. ], batch size: 471, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:25:47,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153120.0, ans=0.1 2023-06-18 07:26:15,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-18 07:26:22,533 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 3.487e+02 4.551e+02 5.429e+02 1.003e+03, threshold=9.102e+02, percent-clipped=2.0 2023-06-18 07:26:52,227 INFO [train.py:996] (1/4) Epoch 1, batch 25550, loss[loss=0.2719, simple_loss=0.3483, pruned_loss=0.09771, over 21632.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3748, pruned_loss=0.134, over 4267156.32 frames. ], batch size: 263, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:26:54,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=153300.0, ans=0.125 2023-06-18 07:27:06,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=153300.0, ans=0.0 2023-06-18 07:27:53,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=153480.0, ans=0.125 2023-06-18 07:27:57,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=153480.0, ans=0.125 2023-06-18 07:28:34,439 INFO [train.py:996] (1/4) Epoch 1, batch 25600, loss[loss=0.401, simple_loss=0.4382, pruned_loss=0.1819, over 21619.00 frames. ], tot_loss[loss=0.3261, simple_loss=0.3801, pruned_loss=0.136, over 4276188.60 frames. ], batch size: 389, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:29:12,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153720.0, ans=0.1 2023-06-18 07:29:36,689 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.502e+02 4.172e+02 4.983e+02 8.051e+02, threshold=8.344e+02, percent-clipped=0.0 2023-06-18 07:30:02,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=153840.0, ans=0.125 2023-06-18 07:30:10,881 INFO [train.py:996] (1/4) Epoch 1, batch 25650, loss[loss=0.2911, simple_loss=0.3286, pruned_loss=0.1267, over 21217.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3827, pruned_loss=0.1398, over 4275306.06 frames. ], batch size: 159, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:31:00,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=27.12 vs. limit=22.5 2023-06-18 07:31:20,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=154080.0, ans=0.2 2023-06-18 07:31:26,948 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.87 vs. limit=15.0 2023-06-18 07:31:46,140 INFO [train.py:996] (1/4) Epoch 1, batch 25700, loss[loss=0.2352, simple_loss=0.2855, pruned_loss=0.0924, over 15731.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3797, pruned_loss=0.1413, over 4260034.91 frames. ], batch size: 60, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:31:55,442 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-18 07:32:17,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-18 07:33:00,070 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.737e+02 4.412e+02 5.511e+02 6.649e+02 1.111e+03, threshold=1.102e+03, percent-clipped=12.0 2023-06-18 07:33:29,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=154500.0, ans=0.125 2023-06-18 07:33:30,715 INFO [train.py:996] (1/4) Epoch 1, batch 25750, loss[loss=0.3753, simple_loss=0.4156, pruned_loss=0.1675, over 21521.00 frames. ], tot_loss[loss=0.3404, simple_loss=0.3875, pruned_loss=0.1467, over 4264978.48 frames. ], batch size: 194, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:33:47,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=154500.0, ans=0.0 2023-06-18 07:33:47,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=154500.0, ans=0.125 2023-06-18 07:34:00,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=154560.0, ans=0.1 2023-06-18 07:34:27,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=154620.0, ans=0.125 2023-06-18 07:34:44,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=154680.0, ans=0.0 2023-06-18 07:35:12,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=154740.0, ans=0.015 2023-06-18 07:35:16,170 INFO [train.py:996] (1/4) Epoch 1, batch 25800, loss[loss=0.3874, simple_loss=0.4433, pruned_loss=0.1657, over 20714.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.3988, pruned_loss=0.1525, over 4260937.52 frames. ], batch size: 607, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:35:16,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=154800.0, ans=0.2 2023-06-18 07:35:21,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=154800.0, ans=0.2 2023-06-18 07:35:26,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=154800.0, ans=0.125 2023-06-18 07:36:19,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=154980.0, ans=0.0 2023-06-18 07:36:28,463 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.698e+02 3.746e+02 4.301e+02 5.401e+02 1.441e+03, threshold=8.601e+02, percent-clipped=2.0 2023-06-18 07:36:28,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=154980.0, ans=0.05 2023-06-18 07:36:53,699 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-18 07:36:57,492 INFO [train.py:996] (1/4) Epoch 1, batch 25850, loss[loss=0.3567, simple_loss=0.4093, pruned_loss=0.152, over 19991.00 frames. ], tot_loss[loss=0.3519, simple_loss=0.4009, pruned_loss=0.1514, over 4269247.26 frames. ], batch size: 702, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:37:27,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=155160.0, ans=0.0 2023-06-18 07:37:46,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=155220.0, ans=0.0 2023-06-18 07:37:48,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-18 07:38:38,178 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:38:46,014 INFO [train.py:996] (1/4) Epoch 1, batch 25900, loss[loss=0.3266, simple_loss=0.3935, pruned_loss=0.1299, over 21378.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.4034, pruned_loss=0.1528, over 4274846.27 frames. ], batch size: 211, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:39:10,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155460.0, ans=0.1 2023-06-18 07:39:16,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=155460.0, ans=0.0 2023-06-18 07:39:43,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=155520.0, ans=0.07 2023-06-18 07:39:55,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-18 07:39:58,792 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.678e+02 3.705e+02 4.419e+02 5.739e+02 1.257e+03, threshold=8.839e+02, percent-clipped=5.0 2023-06-18 07:40:07,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=155640.0, ans=0.125 2023-06-18 07:40:13,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=155640.0, ans=0.125 2023-06-18 07:40:28,218 INFO [train.py:996] (1/4) Epoch 1, batch 25950, loss[loss=0.4225, simple_loss=0.4556, pruned_loss=0.1947, over 21697.00 frames. ], tot_loss[loss=0.3591, simple_loss=0.4078, pruned_loss=0.1551, over 4278029.92 frames. ], batch size: 441, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:40:48,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.86 vs. limit=22.5 2023-06-18 07:42:01,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=155940.0, ans=0.1 2023-06-18 07:42:10,840 INFO [train.py:996] (1/4) Epoch 1, batch 26000, loss[loss=0.4256, simple_loss=0.4593, pruned_loss=0.196, over 21554.00 frames. ], tot_loss[loss=0.3553, simple_loss=0.4062, pruned_loss=0.1522, over 4267941.16 frames. ], batch size: 414, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:42:11,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=156000.0, ans=0.125 2023-06-18 07:42:18,024 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.52 vs. limit=15.0 2023-06-18 07:42:24,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=156000.0, ans=0.0 2023-06-18 07:42:37,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=156060.0, ans=0.1 2023-06-18 07:43:19,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=156180.0, ans=0.05 2023-06-18 07:43:27,127 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.511e+02 4.125e+02 5.678e+02 8.372e+02, threshold=8.249e+02, percent-clipped=0.0 2023-06-18 07:43:40,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=156240.0, ans=0.125 2023-06-18 07:43:51,419 INFO [train.py:996] (1/4) Epoch 1, batch 26050, loss[loss=0.2825, simple_loss=0.4076, pruned_loss=0.07875, over 19773.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.4069, pruned_loss=0.1541, over 4266532.28 frames. ], batch size: 702, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:43:58,097 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:44:23,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=156360.0, ans=0.125 2023-06-18 07:44:34,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=156360.0, ans=0.125 2023-06-18 07:44:37,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=156420.0, ans=0.2 2023-06-18 07:45:08,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=156480.0, ans=0.0 2023-06-18 07:45:19,348 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-18 07:45:31,160 INFO [train.py:996] (1/4) Epoch 1, batch 26100, loss[loss=0.3646, simple_loss=0.3975, pruned_loss=0.1658, over 21908.00 frames. ], tot_loss[loss=0.3551, simple_loss=0.4021, pruned_loss=0.1541, over 4268582.70 frames. ], batch size: 332, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:46:08,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-06-18 07:46:35,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-06-18 07:46:45,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=156780.0, ans=0.125 2023-06-18 07:46:48,614 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.762e+02 4.665e+02 5.349e+02 1.153e+03, threshold=9.330e+02, percent-clipped=6.0 2023-06-18 07:46:50,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=156780.0, ans=0.0 2023-06-18 07:47:12,610 INFO [train.py:996] (1/4) Epoch 1, batch 26150, loss[loss=0.3797, simple_loss=0.429, pruned_loss=0.1652, over 21810.00 frames. ], tot_loss[loss=0.3539, simple_loss=0.398, pruned_loss=0.155, over 4275117.32 frames. ], batch size: 118, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:48:02,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157020.0, ans=0.1 2023-06-18 07:48:39,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=157140.0, ans=0.125 2023-06-18 07:48:56,346 INFO [train.py:996] (1/4) Epoch 1, batch 26200, loss[loss=0.3038, simple_loss=0.3854, pruned_loss=0.1111, over 20727.00 frames. ], tot_loss[loss=0.3525, simple_loss=0.3987, pruned_loss=0.1532, over 4276532.87 frames. ], batch size: 608, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:50:09,829 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.383e+02 4.279e+02 5.483e+02 1.348e+03, threshold=8.558e+02, percent-clipped=4.0 2023-06-18 07:50:12,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=157380.0, ans=0.125 2023-06-18 07:50:50,261 INFO [train.py:996] (1/4) Epoch 1, batch 26250, loss[loss=0.2638, simple_loss=0.3179, pruned_loss=0.1048, over 16474.00 frames. ], tot_loss[loss=0.3498, simple_loss=0.4003, pruned_loss=0.1497, over 4273357.03 frames. ], batch size: 60, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:50:56,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=157500.0, ans=0.2 2023-06-18 07:51:20,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.22 vs. limit=6.0 2023-06-18 07:51:27,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=157620.0, ans=0.0 2023-06-18 07:51:43,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=157680.0, ans=0.125 2023-06-18 07:51:48,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=157680.0, ans=0.0 2023-06-18 07:51:58,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=157740.0, ans=0.0 2023-06-18 07:52:05,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=157740.0, ans=0.125 2023-06-18 07:52:31,374 INFO [train.py:996] (1/4) Epoch 1, batch 26300, loss[loss=0.3351, simple_loss=0.3759, pruned_loss=0.1471, over 21882.00 frames. ], tot_loss[loss=0.3482, simple_loss=0.3959, pruned_loss=0.1502, over 4285427.52 frames. ], batch size: 124, lr: 2.36e-02, grad_scale: 64.0 2023-06-18 07:52:33,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=157800.0, ans=0.0 2023-06-18 07:52:43,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2023-06-18 07:52:57,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=157860.0, ans=0.125 2023-06-18 07:53:38,864 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 3.639e+02 4.284e+02 5.347e+02 9.355e+02, threshold=8.568e+02, percent-clipped=1.0 2023-06-18 07:54:09,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.05 vs. limit=6.0 2023-06-18 07:54:13,133 INFO [train.py:996] (1/4) Epoch 1, batch 26350, loss[loss=0.388, simple_loss=0.4327, pruned_loss=0.1717, over 21378.00 frames. ], tot_loss[loss=0.3502, simple_loss=0.3958, pruned_loss=0.1523, over 4286578.24 frames. ], batch size: 131, lr: 2.35e-02, grad_scale: 64.0 2023-06-18 07:54:29,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=158100.0, ans=0.0 2023-06-18 07:54:29,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=158100.0, ans=0.2 2023-06-18 07:54:31,752 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-18 07:54:39,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-18 07:54:51,864 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:55:29,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=158280.0, ans=0.0 2023-06-18 07:55:55,489 INFO [train.py:996] (1/4) Epoch 1, batch 26400, loss[loss=0.2993, simple_loss=0.3351, pruned_loss=0.1318, over 21627.00 frames. ], tot_loss[loss=0.3449, simple_loss=0.388, pruned_loss=0.1509, over 4282994.99 frames. ], batch size: 298, lr: 2.35e-02, grad_scale: 64.0 2023-06-18 07:55:56,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=158400.0, ans=0.0 2023-06-18 07:56:12,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=158400.0, ans=0.125 2023-06-18 07:56:39,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=158520.0, ans=0.1 2023-06-18 07:56:44,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=158520.0, ans=0.0 2023-06-18 07:56:55,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.86 vs. limit=10.0 2023-06-18 07:57:16,072 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.627e+02 4.358e+02 5.298e+02 1.261e+03, threshold=8.716e+02, percent-clipped=4.0 2023-06-18 07:57:44,267 INFO [train.py:996] (1/4) Epoch 1, batch 26450, loss[loss=0.3743, simple_loss=0.475, pruned_loss=0.1368, over 21195.00 frames. ], tot_loss[loss=0.3432, simple_loss=0.3867, pruned_loss=0.1499, over 4281335.04 frames. ], batch size: 549, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 07:58:16,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=158820.0, ans=0.125 2023-06-18 07:58:28,223 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:58:57,756 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.12 vs. limit=5.0 2023-06-18 07:59:01,965 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:59:28,068 INFO [train.py:996] (1/4) Epoch 1, batch 26500, loss[loss=0.2346, simple_loss=0.2727, pruned_loss=0.09822, over 21742.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.389, pruned_loss=0.1473, over 4270085.51 frames. ], batch size: 124, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 08:00:22,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=159120.0, ans=0.125 2023-06-18 08:00:49,629 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.803e+02 4.749e+02 6.034e+02 1.314e+03, threshold=9.498e+02, percent-clipped=6.0 2023-06-18 08:01:05,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=159240.0, ans=0.0 2023-06-18 08:01:13,483 INFO [train.py:996] (1/4) Epoch 1, batch 26550, loss[loss=0.264, simple_loss=0.3309, pruned_loss=0.09856, over 21711.00 frames. ], tot_loss[loss=0.3344, simple_loss=0.3842, pruned_loss=0.1423, over 4262843.55 frames. ], batch size: 247, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 08:01:44,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=159360.0, ans=0.1 2023-06-18 08:02:32,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=159480.0, ans=0.0 2023-06-18 08:02:33,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=159480.0, ans=0.125 2023-06-18 08:02:38,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=159540.0, ans=0.125 2023-06-18 08:02:49,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=15.0 2023-06-18 08:03:00,174 INFO [train.py:996] (1/4) Epoch 1, batch 26600, loss[loss=0.3454, simple_loss=0.3859, pruned_loss=0.1524, over 21798.00 frames. ], tot_loss[loss=0.3276, simple_loss=0.3806, pruned_loss=0.1372, over 4259819.68 frames. ], batch size: 371, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:03:57,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=159780.0, ans=0.125 2023-06-18 08:03:59,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.24 vs. limit=15.0 2023-06-18 08:04:08,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.521e+02 4.224e+02 5.242e+02 1.118e+03, threshold=8.449e+02, percent-clipped=1.0 2023-06-18 08:04:30,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=159840.0, ans=0.0 2023-06-18 08:04:36,025 INFO [train.py:996] (1/4) Epoch 1, batch 26650, loss[loss=0.2315, simple_loss=0.3056, pruned_loss=0.07871, over 21732.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3739, pruned_loss=0.1361, over 4256059.81 frames. ], batch size: 282, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:04:42,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=159900.0, ans=0.125 2023-06-18 08:04:49,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=159900.0, ans=0.04949747468305833 2023-06-18 08:04:57,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=159960.0, ans=0.125 2023-06-18 08:06:13,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=160140.0, ans=0.125 2023-06-18 08:06:16,319 INFO [train.py:996] (1/4) Epoch 1, batch 26700, loss[loss=0.2936, simple_loss=0.3414, pruned_loss=0.1229, over 21204.00 frames. ], tot_loss[loss=0.3127, simple_loss=0.3646, pruned_loss=0.1304, over 4261486.76 frames. ], batch size: 159, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:06:53,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=160260.0, ans=0.0 2023-06-18 08:07:12,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=160320.0, ans=0.2 2023-06-18 08:07:19,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=160380.0, ans=0.125 2023-06-18 08:07:22,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=160380.0, ans=0.125 2023-06-18 08:07:23,782 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.935e+02 3.510e+02 4.681e+02 9.206e+02, threshold=7.020e+02, percent-clipped=3.0 2023-06-18 08:08:03,378 INFO [train.py:996] (1/4) Epoch 1, batch 26750, loss[loss=0.3168, simple_loss=0.3772, pruned_loss=0.1282, over 21790.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3658, pruned_loss=0.1301, over 4270877.58 frames. ], batch size: 282, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:08:22,338 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-18 08:08:49,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=160620.0, ans=0.025 2023-06-18 08:09:37,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=160740.0, ans=0.125 2023-06-18 08:09:52,190 INFO [train.py:996] (1/4) Epoch 1, batch 26800, loss[loss=0.3363, simple_loss=0.3823, pruned_loss=0.1452, over 21639.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.3778, pruned_loss=0.1389, over 4274147.28 frames. ], batch size: 230, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:09:57,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=160800.0, ans=0.0 2023-06-18 08:10:18,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=160860.0, ans=0.0 2023-06-18 08:10:21,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=160860.0, ans=0.0 2023-06-18 08:10:33,089 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:10:59,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=160980.0, ans=0.125 2023-06-18 08:11:01,909 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.593e+02 3.549e+02 4.364e+02 5.200e+02 1.402e+03, threshold=8.728e+02, percent-clipped=9.0 2023-06-18 08:11:09,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-18 08:11:27,756 INFO [train.py:996] (1/4) Epoch 1, batch 26850, loss[loss=0.2832, simple_loss=0.3263, pruned_loss=0.1201, over 21400.00 frames. ], tot_loss[loss=0.334, simple_loss=0.3807, pruned_loss=0.1436, over 4270242.78 frames. ], batch size: 131, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:11:50,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=161160.0, ans=0.0 2023-06-18 08:12:51,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-18 08:13:02,116 INFO [train.py:996] (1/4) Epoch 1, batch 26900, loss[loss=0.2687, simple_loss=0.3178, pruned_loss=0.1098, over 21367.00 frames. ], tot_loss[loss=0.328, simple_loss=0.3713, pruned_loss=0.1424, over 4268870.08 frames. ], batch size: 131, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:14:06,392 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 3.385e+02 4.142e+02 4.911e+02 9.199e+02, threshold=8.284e+02, percent-clipped=1.0 2023-06-18 08:14:25,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=161640.0, ans=0.2 2023-06-18 08:14:37,077 INFO [train.py:996] (1/4) Epoch 1, batch 26950, loss[loss=0.3148, simple_loss=0.3856, pruned_loss=0.122, over 21618.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3704, pruned_loss=0.1427, over 4265790.92 frames. ], batch size: 230, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:16:10,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=161940.0, ans=0.0 2023-06-18 08:16:13,288 INFO [train.py:996] (1/4) Epoch 1, batch 27000, loss[loss=0.2926, simple_loss=0.3661, pruned_loss=0.1095, over 19781.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3711, pruned_loss=0.1385, over 4271482.27 frames. ], batch size: 702, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:16:13,289 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 08:16:29,104 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.2828, simple_loss=0.3784, pruned_loss=0.09358, over 1796401.00 frames. 2023-06-18 08:16:29,104 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 08:16:42,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162000.0, ans=0.1 2023-06-18 08:16:42,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=162000.0, ans=0.1 2023-06-18 08:16:42,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=162000.0, ans=0.2 2023-06-18 08:16:57,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=162060.0, ans=0.125 2023-06-18 08:17:39,547 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.208e+02 3.737e+02 4.814e+02 7.556e+02, threshold=7.473e+02, percent-clipped=0.0 2023-06-18 08:18:01,096 INFO [train.py:996] (1/4) Epoch 1, batch 27050, loss[loss=0.3121, simple_loss=0.3627, pruned_loss=0.1308, over 21234.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3737, pruned_loss=0.1347, over 4275244.69 frames. ], batch size: 159, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:19:29,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=162540.0, ans=0.0 2023-06-18 08:19:37,654 INFO [train.py:996] (1/4) Epoch 1, batch 27100, loss[loss=0.3232, simple_loss=0.402, pruned_loss=0.1222, over 21806.00 frames. ], tot_loss[loss=0.3249, simple_loss=0.3758, pruned_loss=0.137, over 4280991.96 frames. ], batch size: 332, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:20:47,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=162780.0, ans=0.125 2023-06-18 08:20:52,389 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.734e+02 4.835e+02 6.632e+02 1.398e+03, threshold=9.671e+02, percent-clipped=18.0 2023-06-18 08:21:10,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-18 08:21:14,208 INFO [train.py:996] (1/4) Epoch 1, batch 27150, loss[loss=0.3912, simple_loss=0.4525, pruned_loss=0.165, over 21727.00 frames. ], tot_loss[loss=0.334, simple_loss=0.3879, pruned_loss=0.14, over 4284095.99 frames. ], batch size: 414, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:21:46,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=162960.0, ans=0.0 2023-06-18 08:21:52,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=162960.0, ans=0.04949747468305833 2023-06-18 08:22:19,020 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=23.14 vs. limit=22.5 2023-06-18 08:22:55,442 INFO [train.py:996] (1/4) Epoch 1, batch 27200, loss[loss=0.3524, simple_loss=0.4041, pruned_loss=0.1503, over 21750.00 frames. ], tot_loss[loss=0.3396, simple_loss=0.3947, pruned_loss=0.1423, over 4280751.46 frames. ], batch size: 247, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:23:14,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=163200.0, ans=0.125 2023-06-18 08:23:27,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=163260.0, ans=0.125 2023-06-18 08:24:05,761 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.595e+02 4.705e+02 6.129e+02 1.080e+03, threshold=9.409e+02, percent-clipped=7.0 2023-06-18 08:24:28,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=163440.0, ans=0.0 2023-06-18 08:24:39,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=163440.0, ans=0.0 2023-06-18 08:24:41,922 INFO [train.py:996] (1/4) Epoch 1, batch 27250, loss[loss=0.3072, simple_loss=0.3496, pruned_loss=0.1324, over 21872.00 frames. ], tot_loss[loss=0.3498, simple_loss=0.4004, pruned_loss=0.1496, over 4280733.76 frames. ], batch size: 98, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:25:08,015 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:25:09,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=163560.0, ans=0.125 2023-06-18 08:26:20,974 INFO [train.py:996] (1/4) Epoch 1, batch 27300, loss[loss=0.3189, simple_loss=0.3915, pruned_loss=0.1231, over 21821.00 frames. ], tot_loss[loss=0.3509, simple_loss=0.4016, pruned_loss=0.1501, over 4278462.58 frames. ], batch size: 282, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:26:53,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=163860.0, ans=0.125 2023-06-18 08:27:07,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=163920.0, ans=0.1 2023-06-18 08:27:15,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=163920.0, ans=0.125 2023-06-18 08:27:28,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=163980.0, ans=0.125 2023-06-18 08:27:36,274 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.630e+02 3.615e+02 4.138e+02 5.244e+02 1.044e+03, threshold=8.277e+02, percent-clipped=1.0 2023-06-18 08:27:42,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-18 08:27:57,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-18 08:28:02,465 INFO [train.py:996] (1/4) Epoch 1, batch 27350, loss[loss=0.3736, simple_loss=0.419, pruned_loss=0.164, over 21698.00 frames. ], tot_loss[loss=0.3564, simple_loss=0.4071, pruned_loss=0.1528, over 4279450.33 frames. ], batch size: 389, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:28:14,242 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-18 08:28:27,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=164160.0, ans=0.125 2023-06-18 08:28:56,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.30 vs. limit=6.0 2023-06-18 08:28:57,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=164220.0, ans=0.04949747468305833 2023-06-18 08:28:59,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-18 08:29:14,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=164280.0, ans=10.0 2023-06-18 08:29:35,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.64 vs. limit=22.5 2023-06-18 08:29:37,991 INFO [train.py:996] (1/4) Epoch 1, batch 27400, loss[loss=0.3107, simple_loss=0.3506, pruned_loss=0.1354, over 21803.00 frames. ], tot_loss[loss=0.3533, simple_loss=0.4019, pruned_loss=0.1524, over 4277128.13 frames. ], batch size: 112, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:29:43,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=164400.0, ans=0.125 2023-06-18 08:29:54,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=164460.0, ans=0.0 2023-06-18 08:30:15,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.30 vs. limit=10.0 2023-06-18 08:30:34,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=164520.0, ans=0.1 2023-06-18 08:30:47,961 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.591e+02 4.552e+02 5.428e+02 9.216e+02, threshold=9.104e+02, percent-clipped=2.0 2023-06-18 08:31:13,686 INFO [train.py:996] (1/4) Epoch 1, batch 27450, loss[loss=0.3224, simple_loss=0.3817, pruned_loss=0.1315, over 21253.00 frames. ], tot_loss[loss=0.3475, simple_loss=0.3946, pruned_loss=0.1501, over 4273305.71 frames. ], batch size: 548, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:31:25,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.74 vs. limit=5.0 2023-06-18 08:31:57,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=164820.0, ans=0.125 2023-06-18 08:32:08,611 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-18 08:32:40,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-18 08:32:49,996 INFO [train.py:996] (1/4) Epoch 1, batch 27500, loss[loss=0.2906, simple_loss=0.3212, pruned_loss=0.13, over 20304.00 frames. ], tot_loss[loss=0.3451, simple_loss=0.3915, pruned_loss=0.1493, over 4279728.19 frames. ], batch size: 702, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:33:03,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-18 08:33:39,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=165120.0, ans=0.0 2023-06-18 08:33:43,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=165120.0, ans=0.125 2023-06-18 08:34:03,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.309e+02 3.875e+02 5.024e+02 1.518e+03, threshold=7.749e+02, percent-clipped=3.0 2023-06-18 08:34:10,955 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-18 08:34:24,888 INFO [train.py:996] (1/4) Epoch 1, batch 27550, loss[loss=0.2757, simple_loss=0.3433, pruned_loss=0.104, over 21712.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3861, pruned_loss=0.1458, over 4285025.89 frames. ], batch size: 332, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:35:11,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=165420.0, ans=0.09899494936611666 2023-06-18 08:35:22,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=165420.0, ans=0.0 2023-06-18 08:35:44,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.35 vs. limit=10.0 2023-06-18 08:35:59,087 INFO [train.py:996] (1/4) Epoch 1, batch 27600, loss[loss=0.3008, simple_loss=0.3462, pruned_loss=0.1277, over 21613.00 frames. ], tot_loss[loss=0.3321, simple_loss=0.3776, pruned_loss=0.1434, over 4272302.90 frames. ], batch size: 332, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:36:49,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=165720.0, ans=0.125 2023-06-18 08:37:06,690 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.205e+02 4.118e+02 5.523e+02 1.130e+03, threshold=8.236e+02, percent-clipped=6.0 2023-06-18 08:37:30,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=165840.0, ans=0.0 2023-06-18 08:37:32,568 INFO [train.py:996] (1/4) Epoch 1, batch 27650, loss[loss=0.2957, simple_loss=0.332, pruned_loss=0.1297, over 21396.00 frames. ], tot_loss[loss=0.3263, simple_loss=0.3698, pruned_loss=0.1414, over 4267624.34 frames. ], batch size: 177, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:37:36,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-18 08:37:46,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=165960.0, ans=0.1 2023-06-18 08:38:27,290 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:38:38,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=166080.0, ans=0.125 2023-06-18 08:39:06,503 INFO [train.py:996] (1/4) Epoch 1, batch 27700, loss[loss=0.2511, simple_loss=0.3132, pruned_loss=0.09446, over 21332.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3681, pruned_loss=0.1377, over 4270539.54 frames. ], batch size: 176, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:39:20,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=166260.0, ans=0.125 2023-06-18 08:40:20,600 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.538e+02 3.610e+02 4.452e+02 5.999e+02 1.124e+03, threshold=8.903e+02, percent-clipped=7.0 2023-06-18 08:40:41,456 INFO [train.py:996] (1/4) Epoch 1, batch 27750, loss[loss=0.2448, simple_loss=0.3123, pruned_loss=0.08868, over 21304.00 frames. ], tot_loss[loss=0.3257, simple_loss=0.3739, pruned_loss=0.1387, over 4271749.14 frames. ], batch size: 159, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:40:59,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=166560.0, ans=0.2 2023-06-18 08:41:09,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166560.0, ans=0.1 2023-06-18 08:41:29,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=166620.0, ans=0.125 2023-06-18 08:41:43,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=166680.0, ans=0.125 2023-06-18 08:42:16,136 INFO [train.py:996] (1/4) Epoch 1, batch 27800, loss[loss=0.3449, simple_loss=0.3843, pruned_loss=0.1528, over 21749.00 frames. ], tot_loss[loss=0.3263, simple_loss=0.373, pruned_loss=0.1399, over 4279211.41 frames. ], batch size: 389, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:42:35,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=166860.0, ans=0.125 2023-06-18 08:43:11,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=166980.0, ans=0.2 2023-06-18 08:43:12,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=166980.0, ans=0.2 2023-06-18 08:43:20,175 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 3.396e+02 4.185e+02 5.590e+02 8.815e+02, threshold=8.371e+02, percent-clipped=0.0 2023-06-18 08:43:46,587 INFO [train.py:996] (1/4) Epoch 1, batch 27850, loss[loss=0.3045, simple_loss=0.3416, pruned_loss=0.1337, over 21253.00 frames. ], tot_loss[loss=0.3285, simple_loss=0.3729, pruned_loss=0.1421, over 4291073.95 frames. ], batch size: 608, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:43:58,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=167100.0, ans=0.125 2023-06-18 08:44:32,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=167220.0, ans=0.0 2023-06-18 08:45:06,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-18 08:45:09,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=167340.0, ans=0.1 2023-06-18 08:45:15,241 INFO [train.py:996] (1/4) Epoch 1, batch 27900, loss[loss=0.314, simple_loss=0.3911, pruned_loss=0.1184, over 21714.00 frames. ], tot_loss[loss=0.3379, simple_loss=0.3852, pruned_loss=0.1453, over 4290237.52 frames. ], batch size: 247, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:45:38,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=167400.0, ans=0.2 2023-06-18 08:45:39,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=167460.0, ans=0.125 2023-06-18 08:45:43,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=167460.0, ans=0.2 2023-06-18 08:46:20,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=167580.0, ans=0.125 2023-06-18 08:46:26,447 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.916e+02 4.626e+02 5.836e+02 1.013e+03, threshold=9.252e+02, percent-clipped=5.0 2023-06-18 08:46:58,158 INFO [train.py:996] (1/4) Epoch 1, batch 27950, loss[loss=0.3407, simple_loss=0.409, pruned_loss=0.1362, over 21280.00 frames. ], tot_loss[loss=0.3298, simple_loss=0.3823, pruned_loss=0.1387, over 4280130.24 frames. ], batch size: 549, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:47:08,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=167700.0, ans=0.125 2023-06-18 08:47:15,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=167700.0, ans=0.0 2023-06-18 08:47:19,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=167760.0, ans=0.2 2023-06-18 08:48:02,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=167880.0, ans=0.025 2023-06-18 08:48:35,919 INFO [train.py:996] (1/4) Epoch 1, batch 28000, loss[loss=0.3761, simple_loss=0.411, pruned_loss=0.1706, over 21588.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3782, pruned_loss=0.1344, over 4274934.50 frames. ], batch size: 471, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:48:43,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168000.0, ans=0.125 2023-06-18 08:48:45,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=168000.0, ans=0.125 2023-06-18 08:48:49,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-18 08:49:15,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=168120.0, ans=0.04949747468305833 2023-06-18 08:49:35,506 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 3.441e+02 4.640e+02 5.582e+02 1.043e+03, threshold=9.281e+02, percent-clipped=2.0 2023-06-18 08:50:11,360 INFO [train.py:996] (1/4) Epoch 1, batch 28050, loss[loss=0.2871, simple_loss=0.3448, pruned_loss=0.1147, over 21844.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.3779, pruned_loss=0.1388, over 4279698.70 frames. ], batch size: 316, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:50:38,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=168360.0, ans=0.0 2023-06-18 08:50:50,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=168420.0, ans=0.125 2023-06-18 08:50:56,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=168420.0, ans=0.125 2023-06-18 08:51:19,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=168480.0, ans=0.125 2023-06-18 08:51:30,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=168540.0, ans=0.0 2023-06-18 08:51:41,513 INFO [train.py:996] (1/4) Epoch 1, batch 28100, loss[loss=0.3837, simple_loss=0.3936, pruned_loss=0.1869, over 21367.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3762, pruned_loss=0.1396, over 4271728.29 frames. ], batch size: 508, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:52:33,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168780.0, ans=0.1 2023-06-18 08:52:51,340 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.651e+02 4.541e+02 5.753e+02 9.912e+02, threshold=9.083e+02, percent-clipped=1.0 2023-06-18 08:53:01,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=168840.0, ans=0.0 2023-06-18 08:53:08,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=168840.0, ans=0.125 2023-06-18 08:53:10,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168840.0, ans=0.1 2023-06-18 08:53:16,358 INFO [train.py:996] (1/4) Epoch 1, batch 28150, loss[loss=0.3526, simple_loss=0.3756, pruned_loss=0.1648, over 21865.00 frames. ], tot_loss[loss=0.325, simple_loss=0.3699, pruned_loss=0.1401, over 4264457.31 frames. ], batch size: 107, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:53:57,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=169020.0, ans=0.125 2023-06-18 08:54:35,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=169140.0, ans=0.125 2023-06-18 08:54:53,285 INFO [train.py:996] (1/4) Epoch 1, batch 28200, loss[loss=0.3048, simple_loss=0.3479, pruned_loss=0.1309, over 21850.00 frames. ], tot_loss[loss=0.3254, simple_loss=0.3671, pruned_loss=0.1418, over 4269834.71 frames. ], batch size: 317, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:55:14,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=169260.0, ans=0.125 2023-06-18 08:56:04,068 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.823e+02 5.073e+02 6.497e+02 1.031e+03, threshold=1.015e+03, percent-clipped=3.0 2023-06-18 08:56:28,461 INFO [train.py:996] (1/4) Epoch 1, batch 28250, loss[loss=0.3794, simple_loss=0.4647, pruned_loss=0.147, over 19674.00 frames. ], tot_loss[loss=0.3325, simple_loss=0.3731, pruned_loss=0.1459, over 4268056.29 frames. ], batch size: 702, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:56:33,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=169500.0, ans=0.2 2023-06-18 08:56:54,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=169560.0, ans=0.0 2023-06-18 08:56:58,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=169620.0, ans=0.1 2023-06-18 08:57:06,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=169620.0, ans=0.125 2023-06-18 08:57:22,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=169620.0, ans=0.2 2023-06-18 08:57:32,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=169680.0, ans=0.125 2023-06-18 08:57:41,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-18 08:57:51,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=169740.0, ans=10.0 2023-06-18 08:58:00,555 INFO [train.py:996] (1/4) Epoch 1, batch 28300, loss[loss=0.2616, simple_loss=0.325, pruned_loss=0.09912, over 21260.00 frames. ], tot_loss[loss=0.3266, simple_loss=0.3699, pruned_loss=0.1417, over 4256454.07 frames. ], batch size: 159, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:58:54,215 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:59:00,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=169980.0, ans=0.125 2023-06-18 08:59:11,377 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 3.420e+02 4.323e+02 5.538e+02 1.121e+03, threshold=8.647e+02, percent-clipped=1.0 2023-06-18 08:59:13,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=169980.0, ans=0.125 2023-06-18 08:59:25,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=170040.0, ans=0.0 2023-06-18 08:59:31,049 INFO [train.py:996] (1/4) Epoch 1, batch 28350, loss[loss=0.2933, simple_loss=0.3405, pruned_loss=0.123, over 21789.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3632, pruned_loss=0.1326, over 4252733.93 frames. ], batch size: 317, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:59:51,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-18 09:00:02,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170160.0, ans=0.1 2023-06-18 09:00:45,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=170280.0, ans=0.125 2023-06-18 09:01:07,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=170400.0, ans=0.05 2023-06-18 09:01:09,033 INFO [train.py:996] (1/4) Epoch 1, batch 28400, loss[loss=0.3738, simple_loss=0.3986, pruned_loss=0.1745, over 21605.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3604, pruned_loss=0.1329, over 4255596.95 frames. ], batch size: 415, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:01:11,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=170400.0, ans=0.125 2023-06-18 09:01:13,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=170400.0, ans=0.2 2023-06-18 09:01:24,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=170460.0, ans=0.0 2023-06-18 09:02:05,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=170520.0, ans=0.125 2023-06-18 09:02:24,604 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 3.627e+02 4.521e+02 5.478e+02 1.024e+03, threshold=9.042e+02, percent-clipped=4.0 2023-06-18 09:02:44,590 INFO [train.py:996] (1/4) Epoch 1, batch 28450, loss[loss=0.3766, simple_loss=0.4052, pruned_loss=0.174, over 21778.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.3664, pruned_loss=0.1374, over 4262599.24 frames. ], batch size: 441, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:03:29,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=170820.0, ans=0.0 2023-06-18 09:04:04,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-18 09:04:11,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-18 09:04:20,711 INFO [train.py:996] (1/4) Epoch 1, batch 28500, loss[loss=0.364, simple_loss=0.4153, pruned_loss=0.1564, over 21858.00 frames. ], tot_loss[loss=0.3271, simple_loss=0.3708, pruned_loss=0.1418, over 4272595.67 frames. ], batch size: 118, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:04:21,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=171000.0, ans=0.1 2023-06-18 09:04:25,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=171000.0, ans=0.125 2023-06-18 09:05:27,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=171180.0, ans=0.125 2023-06-18 09:05:37,214 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.668e+02 4.799e+02 6.213e+02 1.260e+03, threshold=9.598e+02, percent-clipped=4.0 2023-06-18 09:05:52,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=171240.0, ans=0.125 2023-06-18 09:06:07,519 INFO [train.py:996] (1/4) Epoch 1, batch 28550, loss[loss=0.3823, simple_loss=0.4451, pruned_loss=0.1597, over 21775.00 frames. ], tot_loss[loss=0.3347, simple_loss=0.3798, pruned_loss=0.1448, over 4269428.98 frames. ], batch size: 282, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:06:13,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-18 09:06:21,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=171300.0, ans=0.125 2023-06-18 09:06:28,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-18 09:06:58,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-18 09:07:21,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=171540.0, ans=0.125 2023-06-18 09:07:47,164 INFO [train.py:996] (1/4) Epoch 1, batch 28600, loss[loss=0.3151, simple_loss=0.3667, pruned_loss=0.1317, over 21122.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3877, pruned_loss=0.148, over 4270315.38 frames. ], batch size: 143, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:07:53,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=171600.0, ans=0.125 2023-06-18 09:08:18,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=171660.0, ans=0.125 2023-06-18 09:08:18,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=171660.0, ans=0.0 2023-06-18 09:08:47,902 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.320e+02 4.208e+02 5.237e+02 8.981e+02, threshold=8.415e+02, percent-clipped=0.0 2023-06-18 09:09:04,965 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-18 09:09:22,051 INFO [train.py:996] (1/4) Epoch 1, batch 28650, loss[loss=0.3179, simple_loss=0.3507, pruned_loss=0.1425, over 21597.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3809, pruned_loss=0.1468, over 4275104.01 frames. ], batch size: 247, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:09:34,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=171900.0, ans=0.0 2023-06-18 09:09:41,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=171960.0, ans=0.125 2023-06-18 09:10:13,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=172080.0, ans=0.1 2023-06-18 09:10:25,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=172080.0, ans=0.0 2023-06-18 09:10:58,006 INFO [train.py:996] (1/4) Epoch 1, batch 28700, loss[loss=0.3839, simple_loss=0.4275, pruned_loss=0.1702, over 21820.00 frames. ], tot_loss[loss=0.3401, simple_loss=0.3822, pruned_loss=0.149, over 4271222.78 frames. ], batch size: 124, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:11:07,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=172200.0, ans=0.125 2023-06-18 09:11:22,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172260.0, ans=0.0 2023-06-18 09:11:24,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=172260.0, ans=0.125 2023-06-18 09:12:09,229 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.359e+02 4.224e+02 5.589e+02 9.530e+02, threshold=8.447e+02, percent-clipped=4.0 2023-06-18 09:12:13,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.15 vs. limit=6.0 2023-06-18 09:12:18,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=172440.0, ans=0.125 2023-06-18 09:12:37,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=172500.0, ans=0.125 2023-06-18 09:12:38,133 INFO [train.py:996] (1/4) Epoch 1, batch 28750, loss[loss=0.2836, simple_loss=0.3427, pruned_loss=0.1122, over 21442.00 frames. ], tot_loss[loss=0.3405, simple_loss=0.3817, pruned_loss=0.1496, over 4269186.27 frames. ], batch size: 194, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:12:55,723 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:13:15,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-18 09:14:11,452 INFO [train.py:996] (1/4) Epoch 1, batch 28800, loss[loss=0.3935, simple_loss=0.4343, pruned_loss=0.1763, over 21303.00 frames. ], tot_loss[loss=0.3435, simple_loss=0.3862, pruned_loss=0.1504, over 4275140.79 frames. ], batch size: 159, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:14:20,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=172800.0, ans=0.07 2023-06-18 09:14:27,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=172860.0, ans=0.125 2023-06-18 09:14:58,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=172920.0, ans=0.2 2023-06-18 09:15:19,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-18 09:15:25,208 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.326e+02 4.027e+02 5.437e+02 1.151e+03, threshold=8.055e+02, percent-clipped=4.0 2023-06-18 09:15:37,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173040.0, ans=0.1 2023-06-18 09:15:49,259 INFO [train.py:996] (1/4) Epoch 1, batch 28850, loss[loss=0.3351, simple_loss=0.3782, pruned_loss=0.146, over 21907.00 frames. ], tot_loss[loss=0.3449, simple_loss=0.3875, pruned_loss=0.1511, over 4280028.22 frames. ], batch size: 107, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:17:21,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=173340.0, ans=0.0 2023-06-18 09:17:25,740 INFO [train.py:996] (1/4) Epoch 1, batch 28900, loss[loss=0.3614, simple_loss=0.4078, pruned_loss=0.1575, over 21824.00 frames. ], tot_loss[loss=0.3476, simple_loss=0.3899, pruned_loss=0.1527, over 4284286.05 frames. ], batch size: 118, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:17:35,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=173400.0, ans=0.125 2023-06-18 09:17:43,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=173460.0, ans=0.0 2023-06-18 09:17:47,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173460.0, ans=0.1 2023-06-18 09:17:50,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=173460.0, ans=0.015 2023-06-18 09:17:50,407 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:17:52,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=173460.0, ans=10.0 2023-06-18 09:17:54,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=173460.0, ans=0.025 2023-06-18 09:17:54,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=173460.0, ans=0.125 2023-06-18 09:18:17,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.34 vs. limit=6.0 2023-06-18 09:18:38,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.786e+02 4.531e+02 6.034e+02 1.219e+03, threshold=9.062e+02, percent-clipped=7.0 2023-06-18 09:18:38,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=173580.0, ans=0.125 2023-06-18 09:18:58,992 INFO [train.py:996] (1/4) Epoch 1, batch 28950, loss[loss=0.3663, simple_loss=0.4504, pruned_loss=0.1411, over 21188.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.3889, pruned_loss=0.1508, over 4282965.37 frames. ], batch size: 548, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:19:44,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173820.0, ans=0.1 2023-06-18 09:20:02,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-18 09:20:06,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-18 09:20:17,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-18 09:20:19,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-18 09:20:30,542 INFO [train.py:996] (1/4) Epoch 1, batch 29000, loss[loss=0.3391, simple_loss=0.3895, pruned_loss=0.1443, over 21384.00 frames. ], tot_loss[loss=0.3461, simple_loss=0.393, pruned_loss=0.1496, over 4278503.18 frames. ], batch size: 176, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:20:34,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=174000.0, ans=0.125 2023-06-18 09:20:34,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=174000.0, ans=0.1 2023-06-18 09:21:18,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=174120.0, ans=0.035 2023-06-18 09:21:45,737 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 3.309e+02 4.233e+02 5.463e+02 9.741e+02, threshold=8.465e+02, percent-clipped=3.0 2023-06-18 09:22:05,670 INFO [train.py:996] (1/4) Epoch 1, batch 29050, loss[loss=0.332, simple_loss=0.372, pruned_loss=0.146, over 21852.00 frames. ], tot_loss[loss=0.3458, simple_loss=0.3911, pruned_loss=0.1503, over 4282587.07 frames. ], batch size: 298, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:22:45,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-18 09:22:50,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=174360.0, ans=0.0 2023-06-18 09:23:40,527 INFO [train.py:996] (1/4) Epoch 1, batch 29100, loss[loss=0.2885, simple_loss=0.3312, pruned_loss=0.1229, over 21842.00 frames. ], tot_loss[loss=0.3365, simple_loss=0.3805, pruned_loss=0.1462, over 4285842.95 frames. ], batch size: 107, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:23:43,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=174600.0, ans=0.125 2023-06-18 09:24:21,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174720.0, ans=0.1 2023-06-18 09:24:23,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=174720.0, ans=0.125 2023-06-18 09:24:44,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=174780.0, ans=0.125 2023-06-18 09:24:48,538 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.440e+02 4.060e+02 5.417e+02 8.880e+02, threshold=8.120e+02, percent-clipped=2.0 2023-06-18 09:24:56,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=174840.0, ans=0.125 2023-06-18 09:25:17,842 INFO [train.py:996] (1/4) Epoch 1, batch 29150, loss[loss=0.3914, simple_loss=0.4136, pruned_loss=0.1846, over 21415.00 frames. ], tot_loss[loss=0.335, simple_loss=0.3804, pruned_loss=0.1448, over 4282059.74 frames. ], batch size: 508, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:25:33,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=174900.0, ans=0.2 2023-06-18 09:25:37,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-18 09:25:59,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-18 09:26:04,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.56 vs. limit=15.0 2023-06-18 09:26:17,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=175080.0, ans=0.0 2023-06-18 09:26:40,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=175140.0, ans=0.125 2023-06-18 09:26:40,809 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-18 09:26:48,390 INFO [train.py:996] (1/4) Epoch 1, batch 29200, loss[loss=0.3016, simple_loss=0.3322, pruned_loss=0.1355, over 21476.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.376, pruned_loss=0.1431, over 4284949.97 frames. ], batch size: 195, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:27:00,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.36 vs. limit=15.0 2023-06-18 09:27:50,859 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.297e+02 4.275e+02 5.517e+02 1.101e+03, threshold=8.550e+02, percent-clipped=8.0 2023-06-18 09:28:15,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=175440.0, ans=0.125 2023-06-18 09:28:24,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=175500.0, ans=0.125 2023-06-18 09:28:25,595 INFO [train.py:996] (1/4) Epoch 1, batch 29250, loss[loss=0.326, simple_loss=0.3905, pruned_loss=0.1307, over 21590.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3748, pruned_loss=0.1405, over 4282826.66 frames. ], batch size: 263, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:28:50,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=175560.0, ans=0.125 2023-06-18 09:29:43,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=175740.0, ans=0.125 2023-06-18 09:30:05,162 INFO [train.py:996] (1/4) Epoch 1, batch 29300, loss[loss=0.3331, simple_loss=0.3912, pruned_loss=0.1375, over 21690.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3764, pruned_loss=0.1395, over 4278648.45 frames. ], batch size: 247, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:30:32,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=175860.0, ans=0.0 2023-06-18 09:31:02,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-18 09:31:07,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.858e+02 5.248e+02 6.398e+02 1.119e+03, threshold=1.050e+03, percent-clipped=2.0 2023-06-18 09:31:37,302 INFO [train.py:996] (1/4) Epoch 1, batch 29350, loss[loss=0.299, simple_loss=0.3288, pruned_loss=0.1346, over 16243.00 frames. ], tot_loss[loss=0.3245, simple_loss=0.372, pruned_loss=0.1385, over 4270989.47 frames. ], batch size: 64, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:31:39,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=176100.0, ans=0.07 2023-06-18 09:32:57,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=176340.0, ans=0.0 2023-06-18 09:33:05,634 INFO [train.py:996] (1/4) Epoch 1, batch 29400, loss[loss=0.2078, simple_loss=0.2498, pruned_loss=0.08288, over 21323.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3711, pruned_loss=0.1349, over 4273126.05 frames. ], batch size: 131, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:34:22,279 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 3.521e+02 4.187e+02 5.229e+02 1.148e+03, threshold=8.373e+02, percent-clipped=2.0 2023-06-18 09:34:42,248 INFO [train.py:996] (1/4) Epoch 1, batch 29450, loss[loss=0.3724, simple_loss=0.4132, pruned_loss=0.1658, over 21746.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.37, pruned_loss=0.1338, over 4271891.38 frames. ], batch size: 441, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:34:43,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-18 09:35:11,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=176820.0, ans=0.125 2023-06-18 09:35:26,451 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-06-18 09:35:35,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=176820.0, ans=10.0 2023-06-18 09:36:03,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=176940.0, ans=0.04949747468305833 2023-06-18 09:36:05,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=176940.0, ans=0.1 2023-06-18 09:36:18,585 INFO [train.py:996] (1/4) Epoch 1, batch 29500, loss[loss=0.407, simple_loss=0.4166, pruned_loss=0.1987, over 21745.00 frames. ], tot_loss[loss=0.3268, simple_loss=0.3752, pruned_loss=0.1392, over 4274471.51 frames. ], batch size: 508, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:36:34,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=177060.0, ans=0.0 2023-06-18 09:36:44,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-18 09:37:10,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=177120.0, ans=0.0 2023-06-18 09:37:20,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=177180.0, ans=0.125 2023-06-18 09:37:29,655 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.303e+02 3.922e+02 4.990e+02 9.245e+02, threshold=7.844e+02, percent-clipped=1.0 2023-06-18 09:37:48,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=177240.0, ans=0.125 2023-06-18 09:37:54,373 INFO [train.py:996] (1/4) Epoch 1, batch 29550, loss[loss=0.3258, simple_loss=0.3622, pruned_loss=0.1447, over 21273.00 frames. ], tot_loss[loss=0.3288, simple_loss=0.3738, pruned_loss=0.1419, over 4284099.24 frames. ], batch size: 159, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:38:08,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=177360.0, ans=0.125 2023-06-18 09:38:13,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177360.0, ans=0.1 2023-06-18 09:38:25,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=177420.0, ans=0.125 2023-06-18 09:38:52,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=177420.0, ans=0.0 2023-06-18 09:39:23,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=177540.0, ans=0.0 2023-06-18 09:39:25,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-18 09:39:30,927 INFO [train.py:996] (1/4) Epoch 1, batch 29600, loss[loss=0.4256, simple_loss=0.484, pruned_loss=0.1836, over 21229.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3811, pruned_loss=0.1451, over 4287348.13 frames. ], batch size: 548, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:39:38,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177600.0, ans=0.1 2023-06-18 09:40:23,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=177720.0, ans=0.07 2023-06-18 09:40:35,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=177780.0, ans=0.2 2023-06-18 09:40:40,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=177780.0, ans=0.125 2023-06-18 09:40:41,178 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.582e+02 4.234e+02 5.701e+02 1.045e+03, threshold=8.469e+02, percent-clipped=5.0 2023-06-18 09:40:55,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=177840.0, ans=0.125 2023-06-18 09:40:57,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177840.0, ans=0.1 2023-06-18 09:41:01,396 INFO [train.py:996] (1/4) Epoch 1, batch 29650, loss[loss=0.2999, simple_loss=0.3422, pruned_loss=0.1288, over 21476.00 frames. ], tot_loss[loss=0.3328, simple_loss=0.3809, pruned_loss=0.1424, over 4283395.05 frames. ], batch size: 548, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:41:09,811 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:41:11,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=177900.0, ans=0.125 2023-06-18 09:42:07,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=178080.0, ans=0.1 2023-06-18 09:42:37,428 INFO [train.py:996] (1/4) Epoch 1, batch 29700, loss[loss=0.3271, simple_loss=0.4078, pruned_loss=0.1232, over 21651.00 frames. ], tot_loss[loss=0.3333, simple_loss=0.3826, pruned_loss=0.1419, over 4278683.29 frames. ], batch size: 263, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:42:51,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-18 09:43:51,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.547e+02 3.946e+02 4.847e+02 6.771e+02 1.201e+03, threshold=9.693e+02, percent-clipped=9.0 2023-06-18 09:43:59,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=178440.0, ans=0.125 2023-06-18 09:44:11,025 INFO [train.py:996] (1/4) Epoch 1, batch 29750, loss[loss=0.3249, simple_loss=0.383, pruned_loss=0.1334, over 21258.00 frames. ], tot_loss[loss=0.334, simple_loss=0.3865, pruned_loss=0.1407, over 4279089.63 frames. ], batch size: 176, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:44:14,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=178500.0, ans=0.125 2023-06-18 09:44:55,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=178620.0, ans=0.125 2023-06-18 09:44:58,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=178620.0, ans=0.1 2023-06-18 09:45:41,173 INFO [train.py:996] (1/4) Epoch 1, batch 29800, loss[loss=0.3131, simple_loss=0.3738, pruned_loss=0.1262, over 21888.00 frames. ], tot_loss[loss=0.3356, simple_loss=0.3879, pruned_loss=0.1417, over 4281559.53 frames. ], batch size: 118, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:45:48,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=178800.0, ans=0.0 2023-06-18 09:46:51,217 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.53 vs. limit=22.5 2023-06-18 09:46:57,387 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.287e+02 3.786e+02 4.602e+02 9.209e+02, threshold=7.572e+02, percent-clipped=0.0 2023-06-18 09:47:17,280 INFO [train.py:996] (1/4) Epoch 1, batch 29850, loss[loss=0.3563, simple_loss=0.4314, pruned_loss=0.1406, over 19831.00 frames. ], tot_loss[loss=0.3284, simple_loss=0.3815, pruned_loss=0.1376, over 4278279.68 frames. ], batch size: 703, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:47:29,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=179100.0, ans=0.1 2023-06-18 09:47:35,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.47 vs. limit=22.5 2023-06-18 09:48:47,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=179340.0, ans=0.2 2023-06-18 09:48:52,472 INFO [train.py:996] (1/4) Epoch 1, batch 29900, loss[loss=0.3159, simple_loss=0.3235, pruned_loss=0.1542, over 20104.00 frames. ], tot_loss[loss=0.3303, simple_loss=0.3799, pruned_loss=0.1403, over 4283944.75 frames. ], batch size: 703, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:49:50,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=179520.0, ans=0.0 2023-06-18 09:50:09,529 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 3.414e+02 4.153e+02 5.110e+02 1.047e+03, threshold=8.306e+02, percent-clipped=3.0 2023-06-18 09:50:34,861 INFO [train.py:996] (1/4) Epoch 1, batch 29950, loss[loss=0.3709, simple_loss=0.4056, pruned_loss=0.1681, over 21747.00 frames. ], tot_loss[loss=0.3391, simple_loss=0.3853, pruned_loss=0.1464, over 4288049.07 frames. ], batch size: 332, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:51:08,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=179760.0, ans=0.125 2023-06-18 09:52:08,085 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=15.0 2023-06-18 09:52:17,249 INFO [train.py:996] (1/4) Epoch 1, batch 30000, loss[loss=0.3584, simple_loss=0.4243, pruned_loss=0.1462, over 21486.00 frames. ], tot_loss[loss=0.3402, simple_loss=0.3868, pruned_loss=0.1468, over 4285110.52 frames. ], batch size: 471, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:52:17,249 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 09:52:35,092 INFO [train.py:1028] (1/4) Epoch 1, validation: loss=0.2819, simple_loss=0.3813, pruned_loss=0.09129, over 1796401.00 frames. 2023-06-18 09:52:35,093 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 09:52:36,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.50 vs. limit=15.0 2023-06-18 09:52:42,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=180000.0, ans=0.07 2023-06-18 09:52:42,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.41 vs. limit=6.0 2023-06-18 09:52:51,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=180000.0, ans=0.2 2023-06-18 09:53:12,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=180120.0, ans=0.0 2023-06-18 09:53:54,183 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.360e+02 4.099e+02 5.189e+02 8.987e+02, threshold=8.197e+02, percent-clipped=1.0 2023-06-18 09:54:11,196 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-18 09:54:13,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=180240.0, ans=0.125 2023-06-18 09:54:19,747 INFO [train.py:996] (1/4) Epoch 1, batch 30050, loss[loss=0.2938, simple_loss=0.3489, pruned_loss=0.1193, over 21256.00 frames. ], tot_loss[loss=0.337, simple_loss=0.39, pruned_loss=0.142, over 4280809.35 frames. ], batch size: 176, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 09:54:22,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-18 09:54:32,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=180300.0, ans=0.0 2023-06-18 09:54:48,239 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=8.0 2023-06-18 09:55:39,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=180540.0, ans=0.2 2023-06-18 09:55:56,200 INFO [train.py:996] (1/4) Epoch 1, batch 30100, loss[loss=0.3066, simple_loss=0.3405, pruned_loss=0.1364, over 21376.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3883, pruned_loss=0.1421, over 4276389.05 frames. ], batch size: 131, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 09:56:05,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=180600.0, ans=0.125 2023-06-18 09:56:14,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=180660.0, ans=0.0 2023-06-18 09:56:17,639 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:57:11,278 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.434e+02 4.118e+02 5.111e+02 9.252e+02, threshold=8.235e+02, percent-clipped=1.0 2023-06-18 09:57:31,604 INFO [train.py:996] (1/4) Epoch 1, batch 30150, loss[loss=0.3337, simple_loss=0.3751, pruned_loss=0.1462, over 21287.00 frames. ], tot_loss[loss=0.3367, simple_loss=0.3844, pruned_loss=0.1445, over 4276103.32 frames. ], batch size: 176, lr: 2.21e-02, grad_scale: 64.0 2023-06-18 09:57:51,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=180960.0, ans=0.0 2023-06-18 09:58:41,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=181080.0, ans=0.0 2023-06-18 09:59:01,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=181140.0, ans=0.1 2023-06-18 09:59:05,009 INFO [train.py:996] (1/4) Epoch 1, batch 30200, loss[loss=0.3432, simple_loss=0.3653, pruned_loss=0.1606, over 20097.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3865, pruned_loss=0.1428, over 4280676.95 frames. ], batch size: 703, lr: 2.21e-02, grad_scale: 64.0 2023-06-18 09:59:33,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=181260.0, ans=0.125 2023-06-18 09:59:50,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=181320.0, ans=0.125 2023-06-18 09:59:50,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=181320.0, ans=0.0 2023-06-18 10:00:20,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=181380.0, ans=0.125 2023-06-18 10:00:22,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=181380.0, ans=0.09899494936611666 2023-06-18 10:00:23,208 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 3.548e+02 4.753e+02 6.643e+02 1.324e+03, threshold=9.506e+02, percent-clipped=12.0 2023-06-18 10:00:51,967 INFO [train.py:996] (1/4) Epoch 1, batch 30250, loss[loss=0.3647, simple_loss=0.4312, pruned_loss=0.1491, over 21232.00 frames. ], tot_loss[loss=0.3438, simple_loss=0.3956, pruned_loss=0.146, over 4275058.55 frames. ], batch size: 159, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 10:01:00,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=181500.0, ans=0.2 2023-06-18 10:01:27,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=181560.0, ans=0.1 2023-06-18 10:02:14,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=181740.0, ans=0.125 2023-06-18 10:02:28,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=181740.0, ans=0.0 2023-06-18 10:02:32,619 INFO [train.py:996] (1/4) Epoch 1, batch 30300, loss[loss=0.4124, simple_loss=0.4403, pruned_loss=0.1923, over 21387.00 frames. ], tot_loss[loss=0.3422, simple_loss=0.3923, pruned_loss=0.1461, over 4266723.94 frames. ], batch size: 549, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 10:02:38,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=181800.0, ans=15.0 2023-06-18 10:02:45,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-18 10:02:50,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=181860.0, ans=0.0 2023-06-18 10:02:59,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=181860.0, ans=0.5 2023-06-18 10:03:02,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=181860.0, ans=0.0 2023-06-18 10:03:09,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=181920.0, ans=0.2 2023-06-18 10:03:37,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-18 10:03:47,904 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.818e+02 4.498e+02 5.757e+02 1.296e+03, threshold=8.996e+02, percent-clipped=5.0 2023-06-18 10:03:51,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=182040.0, ans=0.125 2023-06-18 10:03:51,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=182040.0, ans=0.125 2023-06-18 10:04:05,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=182040.0, ans=0.125 2023-06-18 10:04:11,452 INFO [train.py:996] (1/4) Epoch 1, batch 30350, loss[loss=0.3675, simple_loss=0.4481, pruned_loss=0.1435, over 21253.00 frames. ], tot_loss[loss=0.346, simple_loss=0.3959, pruned_loss=0.1481, over 4272257.75 frames. ], batch size: 549, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:04:24,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=182100.0, ans=0.05 2023-06-18 10:05:21,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=182340.0, ans=0.125 2023-06-18 10:05:22,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=182340.0, ans=0.0 2023-06-18 10:05:23,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=182340.0, ans=0.125 2023-06-18 10:05:26,611 INFO [train.py:996] (1/4) Epoch 1, batch 30400, loss[loss=0.3146, simple_loss=0.3309, pruned_loss=0.1492, over 20480.00 frames. ], tot_loss[loss=0.3374, simple_loss=0.3862, pruned_loss=0.1443, over 4258073.39 frames. ], batch size: 703, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:06:31,073 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 4.537e+02 5.674e+02 8.183e+02 2.727e+03, threshold=1.135e+03, percent-clipped=13.0 2023-06-18 10:06:47,824 INFO [train.py:996] (1/4) Epoch 1, batch 30450, loss[loss=0.4077, simple_loss=0.4958, pruned_loss=0.1598, over 19845.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3904, pruned_loss=0.1466, over 4199429.27 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:06:50,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-18 10:06:57,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=182700.0, ans=0.04949747468305833 2023-06-18 10:07:20,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=182820.0, ans=0.0 2023-06-18 10:07:23,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=182820.0, ans=0.0 2023-06-18 10:07:30,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=182820.0, ans=0.0 2023-06-18 10:09:26,532 INFO [train.py:996] (1/4) Epoch 2, batch 0, loss[loss=0.3871, simple_loss=0.3953, pruned_loss=0.1895, over 21452.00 frames. ], tot_loss[loss=0.3871, simple_loss=0.3953, pruned_loss=0.1895, over 21452.00 frames. ], batch size: 195, lr: 2.01e-02, grad_scale: 32.0 2023-06-18 10:09:26,532 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 10:09:43,667 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.3124, simple_loss=0.4068, pruned_loss=0.109, over 1796401.00 frames. 2023-06-18 10:09:43,668 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 10:09:44,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=182970.0, ans=0.1 2023-06-18 10:10:32,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=183150.0, ans=0.125 2023-06-18 10:10:55,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=183150.0, ans=0.0 2023-06-18 10:11:04,117 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 4.740e+02 6.519e+02 1.031e+03 2.172e+03, threshold=1.304e+03, percent-clipped=18.0 2023-06-18 10:11:13,454 INFO [train.py:996] (1/4) Epoch 2, batch 50, loss[loss=0.2818, simple_loss=0.362, pruned_loss=0.1008, over 21644.00 frames. ], tot_loss[loss=0.3393, simple_loss=0.3904, pruned_loss=0.1442, over 947750.91 frames. ], batch size: 263, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:12:00,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=183390.0, ans=0.125 2023-06-18 10:12:00,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=183390.0, ans=0.125 2023-06-18 10:12:04,458 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:12:21,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=183450.0, ans=0.1 2023-06-18 10:12:49,799 INFO [train.py:996] (1/4) Epoch 2, batch 100, loss[loss=0.356, simple_loss=0.4325, pruned_loss=0.1398, over 21747.00 frames. ], tot_loss[loss=0.3535, simple_loss=0.4098, pruned_loss=0.1485, over 1692470.64 frames. ], batch size: 332, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:13:05,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=183570.0, ans=0.0 2023-06-18 10:14:16,097 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 3.494e+02 4.383e+02 5.480e+02 8.773e+02, threshold=8.766e+02, percent-clipped=0.0 2023-06-18 10:14:29,955 INFO [train.py:996] (1/4) Epoch 2, batch 150, loss[loss=0.4222, simple_loss=0.4665, pruned_loss=0.1889, over 21479.00 frames. ], tot_loss[loss=0.3484, simple_loss=0.4077, pruned_loss=0.1446, over 2260536.34 frames. ], batch size: 508, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:15:18,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.28 vs. limit=15.0 2023-06-18 10:15:29,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=184050.0, ans=0.2 2023-06-18 10:15:59,063 INFO [train.py:996] (1/4) Epoch 2, batch 200, loss[loss=0.3245, simple_loss=0.3837, pruned_loss=0.1326, over 21676.00 frames. ], tot_loss[loss=0.3392, simple_loss=0.3987, pruned_loss=0.1399, over 2705906.09 frames. ], batch size: 351, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:16:14,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=184170.0, ans=0.125 2023-06-18 10:16:16,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-18 10:16:23,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=184230.0, ans=0.125 2023-06-18 10:16:29,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184230.0, ans=0.1 2023-06-18 10:16:41,900 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:16:46,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-18 10:17:24,366 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.721e+02 4.278e+02 5.715e+02 9.625e+02, threshold=8.556e+02, percent-clipped=2.0 2023-06-18 10:17:33,211 INFO [train.py:996] (1/4) Epoch 2, batch 250, loss[loss=0.3782, simple_loss=0.4233, pruned_loss=0.1665, over 21444.00 frames. ], tot_loss[loss=0.341, simple_loss=0.3962, pruned_loss=0.143, over 3057053.38 frames. ], batch size: 131, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:18:52,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=184650.0, ans=0.0 2023-06-18 10:19:08,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-18 10:19:15,563 INFO [train.py:996] (1/4) Epoch 2, batch 300, loss[loss=0.306, simple_loss=0.3493, pruned_loss=0.1314, over 21873.00 frames. ], tot_loss[loss=0.3405, simple_loss=0.3933, pruned_loss=0.1438, over 3328201.89 frames. ], batch size: 373, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:19:58,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=184890.0, ans=0.125 2023-06-18 10:20:39,250 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.502e+02 4.825e+02 6.381e+02 1.072e+03, threshold=9.650e+02, percent-clipped=6.0 2023-06-18 10:20:48,537 INFO [train.py:996] (1/4) Epoch 2, batch 350, loss[loss=0.384, simple_loss=0.4392, pruned_loss=0.1644, over 21401.00 frames. ], tot_loss[loss=0.3364, simple_loss=0.3874, pruned_loss=0.1427, over 3533353.68 frames. ], batch size: 548, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:21:11,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-18 10:21:40,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=185190.0, ans=0.125 2023-06-18 10:22:25,625 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:22:26,772 INFO [train.py:996] (1/4) Epoch 2, batch 400, loss[loss=0.3424, simple_loss=0.357, pruned_loss=0.1639, over 21391.00 frames. ], tot_loss[loss=0.3274, simple_loss=0.3762, pruned_loss=0.1394, over 3702013.03 frames. ], batch size: 509, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:22:27,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185370.0, ans=0.1 2023-06-18 10:22:36,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185370.0, ans=0.1 2023-06-18 10:22:42,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=185370.0, ans=0.2 2023-06-18 10:22:48,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=185430.0, ans=0.0 2023-06-18 10:23:13,390 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.363e-02 2023-06-18 10:23:37,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=185550.0, ans=0.125 2023-06-18 10:23:41,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-06-18 10:23:45,342 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-18 10:23:49,219 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.316e+02 4.407e+02 6.179e+02 1.311e+03, threshold=8.814e+02, percent-clipped=2.0 2023-06-18 10:23:50,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=185610.0, ans=0.125 2023-06-18 10:23:58,108 INFO [train.py:996] (1/4) Epoch 2, batch 450, loss[loss=0.3296, simple_loss=0.4067, pruned_loss=0.1262, over 21567.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3719, pruned_loss=0.1364, over 3834963.03 frames. ], batch size: 230, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:23:59,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.05 vs. limit=15.0 2023-06-18 10:24:15,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=185670.0, ans=0.125 2023-06-18 10:24:49,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=185790.0, ans=0.125 2023-06-18 10:25:21,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=185910.0, ans=0.0 2023-06-18 10:25:28,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-18 10:25:33,374 INFO [train.py:996] (1/4) Epoch 2, batch 500, loss[loss=0.2697, simple_loss=0.3171, pruned_loss=0.1112, over 22006.00 frames. ], tot_loss[loss=0.3226, simple_loss=0.3753, pruned_loss=0.135, over 3930467.77 frames. ], batch size: 119, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:25:48,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=186030.0, ans=0.125 2023-06-18 10:26:17,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-18 10:26:48,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=186210.0, ans=0.125 2023-06-18 10:26:55,378 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.837e+02 4.927e+02 6.704e+02 1.422e+03, threshold=9.853e+02, percent-clipped=11.0 2023-06-18 10:27:04,393 INFO [train.py:996] (1/4) Epoch 2, batch 550, loss[loss=0.3083, simple_loss=0.3777, pruned_loss=0.1195, over 19908.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3756, pruned_loss=0.1336, over 4007020.32 frames. ], batch size: 704, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:27:18,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=186270.0, ans=0.125 2023-06-18 10:27:25,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-18 10:27:44,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=12.0 2023-06-18 10:27:48,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=186390.0, ans=0.125 2023-06-18 10:27:49,573 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:27:49,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=186390.0, ans=10.0 2023-06-18 10:27:52,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=186390.0, ans=0.04949747468305833 2023-06-18 10:28:21,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-18 10:28:30,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=186510.0, ans=0.2 2023-06-18 10:28:46,179 INFO [train.py:996] (1/4) Epoch 2, batch 600, loss[loss=0.393, simple_loss=0.4694, pruned_loss=0.1583, over 21711.00 frames. ], tot_loss[loss=0.3218, simple_loss=0.3768, pruned_loss=0.1334, over 4065808.89 frames. ], batch size: 414, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:28:46,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=186570.0, ans=0.125 2023-06-18 10:28:47,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-18 10:28:58,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=186570.0, ans=0.04949747468305833 2023-06-18 10:29:18,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-18 10:29:57,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-18 10:30:02,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-18 10:30:03,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=186810.0, ans=0.0 2023-06-18 10:30:07,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.420e+02 4.275e+02 5.622e+02 1.549e+03, threshold=8.550e+02, percent-clipped=4.0 2023-06-18 10:30:08,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.86 vs. limit=22.5 2023-06-18 10:30:21,292 INFO [train.py:996] (1/4) Epoch 2, batch 650, loss[loss=0.3175, simple_loss=0.3607, pruned_loss=0.1372, over 21820.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3784, pruned_loss=0.1341, over 4110451.90 frames. ], batch size: 371, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:30:32,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-06-18 10:30:33,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=186870.0, ans=0.2 2023-06-18 10:30:39,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=186930.0, ans=0.125 2023-06-18 10:30:58,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-18 10:31:12,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=187050.0, ans=0.0 2023-06-18 10:31:56,863 INFO [train.py:996] (1/4) Epoch 2, batch 700, loss[loss=0.2758, simple_loss=0.3842, pruned_loss=0.08365, over 20816.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3786, pruned_loss=0.1343, over 4149964.33 frames. ], batch size: 608, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:32:11,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-18 10:32:12,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=187230.0, ans=0.125 2023-06-18 10:32:15,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=187230.0, ans=0.1 2023-06-18 10:33:12,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=187410.0, ans=0.1 2023-06-18 10:33:12,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=187410.0, ans=0.125 2023-06-18 10:33:18,457 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.924e+02 4.620e+02 5.995e+02 1.020e+03, threshold=9.239e+02, percent-clipped=3.0 2023-06-18 10:33:31,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=187470.0, ans=0.125 2023-06-18 10:33:32,496 INFO [train.py:996] (1/4) Epoch 2, batch 750, loss[loss=0.3162, simple_loss=0.3534, pruned_loss=0.1395, over 21815.00 frames. ], tot_loss[loss=0.3255, simple_loss=0.3793, pruned_loss=0.1358, over 4176192.29 frames. ], batch size: 118, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:34:03,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=187530.0, ans=0.125 2023-06-18 10:34:16,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=187590.0, ans=0.0 2023-06-18 10:34:48,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-18 10:34:54,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=187710.0, ans=0.0 2023-06-18 10:35:07,274 INFO [train.py:996] (1/4) Epoch 2, batch 800, loss[loss=0.2682, simple_loss=0.3167, pruned_loss=0.1098, over 21950.00 frames. ], tot_loss[loss=0.3254, simple_loss=0.3768, pruned_loss=0.137, over 4208586.80 frames. ], batch size: 113, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:35:19,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=187770.0, ans=0.2 2023-06-18 10:35:47,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=187890.0, ans=0.0 2023-06-18 10:36:01,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=187950.0, ans=0.02 2023-06-18 10:36:02,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=187950.0, ans=0.125 2023-06-18 10:36:06,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=187950.0, ans=0.5 2023-06-18 10:36:09,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=187950.0, ans=0.125 2023-06-18 10:36:19,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=187950.0, ans=0.0 2023-06-18 10:36:28,268 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.568e+02 4.374e+02 5.699e+02 1.207e+03, threshold=8.749e+02, percent-clipped=3.0 2023-06-18 10:36:41,764 INFO [train.py:996] (1/4) Epoch 2, batch 850, loss[loss=0.305, simple_loss=0.3479, pruned_loss=0.131, over 21792.00 frames. ], tot_loss[loss=0.3235, simple_loss=0.3734, pruned_loss=0.1368, over 4226834.87 frames. ], batch size: 351, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:36:43,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=188070.0, ans=0.125 2023-06-18 10:37:13,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-18 10:37:16,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=188190.0, ans=0.125 2023-06-18 10:38:17,628 INFO [train.py:996] (1/4) Epoch 2, batch 900, loss[loss=0.3468, simple_loss=0.3863, pruned_loss=0.1537, over 21751.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3722, pruned_loss=0.1376, over 4243189.73 frames. ], batch size: 389, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:38:56,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=188490.0, ans=0.125 2023-06-18 10:39:31,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=188550.0, ans=0.125 2023-06-18 10:39:44,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=188610.0, ans=0.0 2023-06-18 10:39:45,693 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.152e+02 3.796e+02 5.190e+02 9.493e+02, threshold=7.592e+02, percent-clipped=1.0 2023-06-18 10:39:55,093 INFO [train.py:996] (1/4) Epoch 2, batch 950, loss[loss=0.3268, simple_loss=0.4254, pruned_loss=0.114, over 19782.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3699, pruned_loss=0.136, over 4254304.31 frames. ], batch size: 702, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:39:58,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=188670.0, ans=0.125 2023-06-18 10:40:03,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188670.0, ans=0.1 2023-06-18 10:40:23,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=188730.0, ans=0.125 2023-06-18 10:40:37,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=188790.0, ans=0.125 2023-06-18 10:41:30,565 INFO [train.py:996] (1/4) Epoch 2, batch 1000, loss[loss=0.3714, simple_loss=0.4201, pruned_loss=0.1613, over 21507.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3698, pruned_loss=0.1356, over 4265015.59 frames. ], batch size: 194, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:42:25,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=189090.0, ans=0.0 2023-06-18 10:43:00,415 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.236e+02 4.022e+02 4.696e+02 7.726e+02, threshold=8.043e+02, percent-clipped=1.0 2023-06-18 10:43:00,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=189210.0, ans=0.125 2023-06-18 10:43:06,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.07 vs. limit=22.5 2023-06-18 10:43:09,839 INFO [train.py:996] (1/4) Epoch 2, batch 1050, loss[loss=0.3209, simple_loss=0.3782, pruned_loss=0.1318, over 21801.00 frames. ], tot_loss[loss=0.3202, simple_loss=0.37, pruned_loss=0.1352, over 4269250.00 frames. ], batch size: 282, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:43:26,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=189270.0, ans=0.125 2023-06-18 10:43:58,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=189390.0, ans=0.2 2023-06-18 10:44:18,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=189450.0, ans=0.07 2023-06-18 10:44:27,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=189450.0, ans=0.0 2023-06-18 10:44:42,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=189510.0, ans=0.125 2023-06-18 10:44:45,535 INFO [train.py:996] (1/4) Epoch 2, batch 1100, loss[loss=0.3729, simple_loss=0.4187, pruned_loss=0.1636, over 21636.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3708, pruned_loss=0.1356, over 4267957.77 frames. ], batch size: 389, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:46:04,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-18 10:46:10,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=189810.0, ans=0.125 2023-06-18 10:46:13,009 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 4.116e+02 5.178e+02 8.115e+02 1.294e+03, threshold=1.036e+03, percent-clipped=24.0 2023-06-18 10:46:27,135 INFO [train.py:996] (1/4) Epoch 2, batch 1150, loss[loss=0.3766, simple_loss=0.4206, pruned_loss=0.1664, over 21647.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3701, pruned_loss=0.1343, over 4275836.57 frames. ], batch size: 414, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:46:27,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=189870.0, ans=0.0 2023-06-18 10:47:00,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189930.0, ans=0.1 2023-06-18 10:47:12,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=189990.0, ans=0.2 2023-06-18 10:47:15,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=189990.0, ans=0.1 2023-06-18 10:47:20,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=189990.0, ans=0.125 2023-06-18 10:47:20,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.06 vs. limit=10.0 2023-06-18 10:47:22,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=189990.0, ans=0.07 2023-06-18 10:47:25,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-18 10:47:59,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190110.0, ans=0.1 2023-06-18 10:48:04,097 INFO [train.py:996] (1/4) Epoch 2, batch 1200, loss[loss=0.3852, simple_loss=0.4381, pruned_loss=0.1662, over 21790.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.3703, pruned_loss=0.1337, over 4274141.31 frames. ], batch size: 124, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:48:25,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=190170.0, ans=0.125 2023-06-18 10:48:38,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=190230.0, ans=0.2 2023-06-18 10:48:52,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.82 vs. limit=22.5 2023-06-18 10:49:02,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=190290.0, ans=0.1 2023-06-18 10:49:32,214 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.616e+02 4.499e+02 6.126e+02 1.054e+03, threshold=8.999e+02, percent-clipped=1.0 2023-06-18 10:49:41,265 INFO [train.py:996] (1/4) Epoch 2, batch 1250, loss[loss=0.294, simple_loss=0.3462, pruned_loss=0.1209, over 21296.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3748, pruned_loss=0.1363, over 4283813.93 frames. ], batch size: 159, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:51:14,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=190710.0, ans=0.2 2023-06-18 10:51:23,585 INFO [train.py:996] (1/4) Epoch 2, batch 1300, loss[loss=0.366, simple_loss=0.4146, pruned_loss=0.1587, over 21761.00 frames. ], tot_loss[loss=0.324, simple_loss=0.3753, pruned_loss=0.1363, over 4280711.03 frames. ], batch size: 414, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:52:20,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=190950.0, ans=0.125 2023-06-18 10:52:45,325 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 3.558e+02 4.499e+02 5.832e+02 1.027e+03, threshold=8.998e+02, percent-clipped=2.0 2023-06-18 10:52:58,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-18 10:53:00,057 INFO [train.py:996] (1/4) Epoch 2, batch 1350, loss[loss=0.2818, simple_loss=0.2998, pruned_loss=0.1319, over 20164.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.3758, pruned_loss=0.1364, over 4284192.53 frames. ], batch size: 703, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:53:16,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-18 10:53:33,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=12.0 2023-06-18 10:53:51,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=191190.0, ans=0.0 2023-06-18 10:54:24,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191310.0, ans=0.1 2023-06-18 10:54:41,369 INFO [train.py:996] (1/4) Epoch 2, batch 1400, loss[loss=0.4962, simple_loss=0.5081, pruned_loss=0.2421, over 21467.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3744, pruned_loss=0.1347, over 4274658.65 frames. ], batch size: 471, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:54:44,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=191370.0, ans=0.1 2023-06-18 10:55:18,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=191490.0, ans=0.125 2023-06-18 10:55:47,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.01 vs. limit=15.0 2023-06-18 10:56:04,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.616e+02 4.167e+02 4.895e+02 9.301e+02, threshold=8.333e+02, percent-clipped=3.0 2023-06-18 10:56:07,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191610.0, ans=0.1 2023-06-18 10:56:18,027 INFO [train.py:996] (1/4) Epoch 2, batch 1450, loss[loss=0.3301, simple_loss=0.3813, pruned_loss=0.1394, over 21511.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3752, pruned_loss=0.1355, over 4279610.70 frames. ], batch size: 194, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:56:45,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=191730.0, ans=0.0 2023-06-18 10:57:54,113 INFO [train.py:996] (1/4) Epoch 2, batch 1500, loss[loss=0.3315, simple_loss=0.384, pruned_loss=0.1395, over 21607.00 frames. ], tot_loss[loss=0.3273, simple_loss=0.378, pruned_loss=0.1383, over 4278536.30 frames. ], batch size: 263, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:59:22,599 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.381e+02 3.361e+02 4.007e+02 4.888e+02 8.078e+02, threshold=8.013e+02, percent-clipped=0.0 2023-06-18 10:59:32,391 INFO [train.py:996] (1/4) Epoch 2, batch 1550, loss[loss=0.3543, simple_loss=0.3903, pruned_loss=0.1592, over 21353.00 frames. ], tot_loss[loss=0.3244, simple_loss=0.3752, pruned_loss=0.1368, over 4278624.86 frames. ], batch size: 143, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:00:22,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=192390.0, ans=0.0 2023-06-18 11:00:43,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=192450.0, ans=0.0 2023-06-18 11:00:59,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=192510.0, ans=0.125 2023-06-18 11:01:16,372 INFO [train.py:996] (1/4) Epoch 2, batch 1600, loss[loss=0.3192, simple_loss=0.3644, pruned_loss=0.137, over 21714.00 frames. ], tot_loss[loss=0.3248, simple_loss=0.3756, pruned_loss=0.137, over 4285974.46 frames. ], batch size: 351, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:02:12,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=192750.0, ans=0.1 2023-06-18 11:02:23,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=192750.0, ans=0.125 2023-06-18 11:02:34,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=192810.0, ans=0.2 2023-06-18 11:02:44,868 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.757e+02 4.591e+02 6.473e+02 1.240e+03, threshold=9.183e+02, percent-clipped=13.0 2023-06-18 11:02:54,024 INFO [train.py:996] (1/4) Epoch 2, batch 1650, loss[loss=0.3526, simple_loss=0.3928, pruned_loss=0.1562, over 21318.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3739, pruned_loss=0.1354, over 4286420.96 frames. ], batch size: 143, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:03:15,402 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-18 11:03:17,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=192930.0, ans=0.125 2023-06-18 11:04:21,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-18 11:04:31,330 INFO [train.py:996] (1/4) Epoch 2, batch 1700, loss[loss=0.3642, simple_loss=0.3972, pruned_loss=0.1657, over 21789.00 frames. ], tot_loss[loss=0.325, simple_loss=0.3771, pruned_loss=0.1364, over 4286636.26 frames. ], batch size: 247, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:04:42,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=193170.0, ans=0.2 2023-06-18 11:04:54,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193230.0, ans=0.1 2023-06-18 11:05:03,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=193290.0, ans=0.04949747468305833 2023-06-18 11:05:56,000 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.866e+02 3.867e+02 4.593e+02 5.670e+02 8.844e+02, threshold=9.185e+02, percent-clipped=0.0 2023-06-18 11:06:05,791 INFO [train.py:996] (1/4) Epoch 2, batch 1750, loss[loss=0.3129, simple_loss=0.3839, pruned_loss=0.1209, over 21498.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3774, pruned_loss=0.1354, over 4280903.29 frames. ], batch size: 471, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:06:12,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193470.0, ans=0.1 2023-06-18 11:06:20,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-18 11:06:52,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=193590.0, ans=0.1 2023-06-18 11:06:54,576 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-06-18 11:07:06,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=193650.0, ans=0.125 2023-06-18 11:07:07,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.08 vs. limit=10.0 2023-06-18 11:07:27,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.32 vs. limit=6.0 2023-06-18 11:07:31,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=193710.0, ans=0.125 2023-06-18 11:07:34,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=193710.0, ans=0.125 2023-06-18 11:07:39,023 INFO [train.py:996] (1/4) Epoch 2, batch 1800, loss[loss=0.2819, simple_loss=0.3732, pruned_loss=0.09535, over 21706.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3714, pruned_loss=0.1301, over 4270812.53 frames. ], batch size: 298, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:07:43,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=193770.0, ans=0.125 2023-06-18 11:07:55,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=193770.0, ans=0.125 2023-06-18 11:08:11,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.92 vs. limit=5.0 2023-06-18 11:08:49,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-18 11:08:52,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=193950.0, ans=0.035 2023-06-18 11:09:07,956 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.185e+02 3.782e+02 4.585e+02 7.556e+02, threshold=7.564e+02, percent-clipped=0.0 2023-06-18 11:09:17,312 INFO [train.py:996] (1/4) Epoch 2, batch 1850, loss[loss=0.3195, simple_loss=0.3755, pruned_loss=0.1318, over 21828.00 frames. ], tot_loss[loss=0.3099, simple_loss=0.3688, pruned_loss=0.1255, over 4273662.79 frames. ], batch size: 247, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:09:36,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=194130.0, ans=0.0 2023-06-18 11:10:11,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=194190.0, ans=0.0 2023-06-18 11:10:32,212 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-18 11:10:48,407 INFO [train.py:996] (1/4) Epoch 2, batch 1900, loss[loss=0.2839, simple_loss=0.3451, pruned_loss=0.1113, over 21787.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3719, pruned_loss=0.1281, over 4270440.15 frames. ], batch size: 282, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:11:45,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-18 11:11:46,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=194490.0, ans=0.0 2023-06-18 11:11:57,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=194550.0, ans=0.0 2023-06-18 11:12:06,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.79 vs. limit=5.0 2023-06-18 11:12:08,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=194610.0, ans=0.1 2023-06-18 11:12:15,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.697e+02 4.739e+02 6.641e+02 1.232e+03, threshold=9.479e+02, percent-clipped=18.0 2023-06-18 11:12:16,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=194610.0, ans=0.1 2023-06-18 11:12:24,851 INFO [train.py:996] (1/4) Epoch 2, batch 1950, loss[loss=0.2774, simple_loss=0.325, pruned_loss=0.1149, over 21795.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3671, pruned_loss=0.1287, over 4279515.56 frames. ], batch size: 317, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:13:03,011 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:13:03,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=194730.0, ans=0.0 2023-06-18 11:13:05,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-18 11:13:18,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=194790.0, ans=0.1 2023-06-18 11:13:47,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=194910.0, ans=0.125 2023-06-18 11:13:48,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=194910.0, ans=0.0 2023-06-18 11:14:12,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=194910.0, ans=0.0 2023-06-18 11:14:12,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.14 vs. limit=12.0 2023-06-18 11:14:14,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.79 vs. limit=6.0 2023-06-18 11:14:14,860 INFO [train.py:996] (1/4) Epoch 2, batch 2000, loss[loss=0.2739, simple_loss=0.3481, pruned_loss=0.09983, over 21796.00 frames. ], tot_loss[loss=0.3071, simple_loss=0.3613, pruned_loss=0.1264, over 4280846.73 frames. ], batch size: 282, lr: 1.95e-02, grad_scale: 64.0 2023-06-18 11:14:46,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=195030.0, ans=0.09899494936611666 2023-06-18 11:14:57,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=195090.0, ans=0.125 2023-06-18 11:15:04,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=195090.0, ans=0.0 2023-06-18 11:15:12,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-18 11:15:31,711 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.372e+02 4.347e+02 5.379e+02 1.010e+03, threshold=8.694e+02, percent-clipped=3.0 2023-06-18 11:15:36,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=195210.0, ans=0.2 2023-06-18 11:15:38,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=195210.0, ans=0.125 2023-06-18 11:15:39,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=195270.0, ans=0.0 2023-06-18 11:15:45,805 INFO [train.py:996] (1/4) Epoch 2, batch 2050, loss[loss=0.3196, simple_loss=0.3698, pruned_loss=0.1347, over 21855.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3618, pruned_loss=0.1256, over 4277602.77 frames. ], batch size: 124, lr: 1.95e-02, grad_scale: 64.0 2023-06-18 11:15:58,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=195270.0, ans=0.125 2023-06-18 11:16:02,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=195270.0, ans=0.05 2023-06-18 11:16:45,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=195450.0, ans=0.125 2023-06-18 11:16:57,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=195510.0, ans=0.2 2023-06-18 11:17:22,769 INFO [train.py:996] (1/4) Epoch 2, batch 2100, loss[loss=0.3303, simple_loss=0.4307, pruned_loss=0.1149, over 21237.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3679, pruned_loss=0.1294, over 4282445.93 frames. ], batch size: 548, lr: 1.94e-02, grad_scale: 64.0 2023-06-18 11:17:52,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=195630.0, ans=0.125 2023-06-18 11:17:55,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195630.0, ans=0.1 2023-06-18 11:18:16,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-18 11:18:19,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=195750.0, ans=0.2 2023-06-18 11:18:32,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-18 11:18:52,346 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.854e+02 4.642e+02 6.317e+02 1.235e+03, threshold=9.284e+02, percent-clipped=5.0 2023-06-18 11:18:59,861 INFO [train.py:996] (1/4) Epoch 2, batch 2150, loss[loss=0.3431, simple_loss=0.3659, pruned_loss=0.1602, over 21198.00 frames. ], tot_loss[loss=0.3172, simple_loss=0.3706, pruned_loss=0.1319, over 4277487.95 frames. ], batch size: 471, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:19:04,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-18 11:19:34,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=195930.0, ans=0.125 2023-06-18 11:19:54,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=196050.0, ans=0.0 2023-06-18 11:20:32,185 INFO [train.py:996] (1/4) Epoch 2, batch 2200, loss[loss=0.2948, simple_loss=0.3829, pruned_loss=0.1033, over 20814.00 frames. ], tot_loss[loss=0.3198, simple_loss=0.3736, pruned_loss=0.1331, over 4268246.57 frames. ], batch size: 607, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:20:32,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=196170.0, ans=0.125 2023-06-18 11:20:34,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=196170.0, ans=0.0 2023-06-18 11:21:00,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.71 vs. limit=22.5 2023-06-18 11:21:00,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=196230.0, ans=0.0 2023-06-18 11:21:51,284 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.406e+02 4.174e+02 5.326e+02 1.037e+03, threshold=8.349e+02, percent-clipped=3.0 2023-06-18 11:21:59,029 INFO [train.py:996] (1/4) Epoch 2, batch 2250, loss[loss=0.2781, simple_loss=0.3231, pruned_loss=0.1165, over 21245.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3689, pruned_loss=0.129, over 4268781.96 frames. ], batch size: 176, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:22:24,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=196530.0, ans=0.07 2023-06-18 11:23:34,867 INFO [train.py:996] (1/4) Epoch 2, batch 2300, loss[loss=0.2936, simple_loss=0.3264, pruned_loss=0.1305, over 21671.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3622, pruned_loss=0.1284, over 4253667.46 frames. ], batch size: 248, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:24:41,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.17 vs. limit=10.0 2023-06-18 11:24:58,607 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 3.491e+02 4.298e+02 5.253e+02 1.181e+03, threshold=8.597e+02, percent-clipped=4.0 2023-06-18 11:25:06,159 INFO [train.py:996] (1/4) Epoch 2, batch 2350, loss[loss=0.367, simple_loss=0.3699, pruned_loss=0.182, over 21378.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3581, pruned_loss=0.1286, over 4249871.82 frames. ], batch size: 508, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:25:29,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197070.0, ans=0.1 2023-06-18 11:26:06,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=197250.0, ans=0.0 2023-06-18 11:26:43,905 INFO [train.py:996] (1/4) Epoch 2, batch 2400, loss[loss=0.2707, simple_loss=0.3169, pruned_loss=0.1123, over 20105.00 frames. ], tot_loss[loss=0.3127, simple_loss=0.3617, pruned_loss=0.1318, over 4247169.37 frames. ], batch size: 702, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:26:51,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-18 11:26:51,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=197370.0, ans=15.0 2023-06-18 11:27:01,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=197370.0, ans=0.125 2023-06-18 11:28:04,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=197610.0, ans=0.05 2023-06-18 11:28:18,676 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.754e+02 3.750e+02 4.331e+02 6.076e+02 1.202e+03, threshold=8.663e+02, percent-clipped=8.0 2023-06-18 11:28:30,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=197670.0, ans=0.125 2023-06-18 11:28:31,360 INFO [train.py:996] (1/4) Epoch 2, batch 2450, loss[loss=0.2568, simple_loss=0.3118, pruned_loss=0.1009, over 21629.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.3708, pruned_loss=0.1361, over 4251074.64 frames. ], batch size: 263, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:28:53,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=197730.0, ans=0.125 2023-06-18 11:28:54,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=197730.0, ans=0.125 2023-06-18 11:29:13,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=197790.0, ans=0.125 2023-06-18 11:29:39,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=197850.0, ans=0.95 2023-06-18 11:29:42,623 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:30:02,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=197970.0, ans=0.0 2023-06-18 11:30:04,079 INFO [train.py:996] (1/4) Epoch 2, batch 2500, loss[loss=0.2892, simple_loss=0.34, pruned_loss=0.1193, over 15382.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3678, pruned_loss=0.1345, over 4255839.97 frames. ], batch size: 61, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:30:16,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=197970.0, ans=0.125 2023-06-18 11:31:13,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=198150.0, ans=0.035 2023-06-18 11:31:34,336 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.317e+02 4.381e+02 5.204e+02 7.754e+02, threshold=8.763e+02, percent-clipped=1.0 2023-06-18 11:31:36,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=198210.0, ans=0.0 2023-06-18 11:31:46,959 INFO [train.py:996] (1/4) Epoch 2, batch 2550, loss[loss=0.283, simple_loss=0.3503, pruned_loss=0.1079, over 20788.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3685, pruned_loss=0.1342, over 4254140.58 frames. ], batch size: 608, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:31:52,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=198270.0, ans=0.125 2023-06-18 11:32:38,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=198390.0, ans=0.125 2023-06-18 11:32:43,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=198450.0, ans=0.125 2023-06-18 11:33:08,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=198510.0, ans=0.125 2023-06-18 11:33:18,173 INFO [train.py:996] (1/4) Epoch 2, batch 2600, loss[loss=0.355, simple_loss=0.3889, pruned_loss=0.1605, over 21708.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.3713, pruned_loss=0.1352, over 4250296.18 frames. ], batch size: 351, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:33:29,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=198570.0, ans=0.5 2023-06-18 11:33:54,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=198690.0, ans=0.05 2023-06-18 11:34:27,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=198750.0, ans=0.125 2023-06-18 11:34:47,548 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.436e+02 4.244e+02 5.240e+02 1.197e+03, threshold=8.488e+02, percent-clipped=2.0 2023-06-18 11:34:52,773 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:34:55,302 INFO [train.py:996] (1/4) Epoch 2, batch 2650, loss[loss=0.3141, simple_loss=0.3547, pruned_loss=0.1367, over 21500.00 frames. ], tot_loss[loss=0.3235, simple_loss=0.3726, pruned_loss=0.1372, over 4258479.80 frames. ], batch size: 194, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:34:57,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=198870.0, ans=0.0 2023-06-18 11:35:06,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=198870.0, ans=0.1 2023-06-18 11:36:03,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=199050.0, ans=0.125 2023-06-18 11:36:20,998 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-18 11:36:34,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199110.0, ans=0.1 2023-06-18 11:36:37,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=199170.0, ans=0.0 2023-06-18 11:36:39,073 INFO [train.py:996] (1/4) Epoch 2, batch 2700, loss[loss=0.2669, simple_loss=0.3157, pruned_loss=0.1091, over 21913.00 frames. ], tot_loss[loss=0.3203, simple_loss=0.3705, pruned_loss=0.1351, over 4266215.68 frames. ], batch size: 107, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:36:50,493 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.91 vs. limit=22.5 2023-06-18 11:37:01,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=199230.0, ans=0.125 2023-06-18 11:37:05,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=199230.0, ans=0.125 2023-06-18 11:37:35,285 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-18 11:37:41,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-18 11:37:52,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=199350.0, ans=0.125 2023-06-18 11:38:02,896 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 4.256e+02 5.020e+02 6.245e+02 1.096e+03, threshold=1.004e+03, percent-clipped=9.0 2023-06-18 11:38:14,781 INFO [train.py:996] (1/4) Epoch 2, batch 2750, loss[loss=0.3158, simple_loss=0.3723, pruned_loss=0.1296, over 21842.00 frames. ], tot_loss[loss=0.3167, simple_loss=0.3666, pruned_loss=0.1334, over 4274260.75 frames. ], batch size: 118, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:38:19,487 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.10 vs. limit=22.5 2023-06-18 11:38:22,253 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-18 11:38:23,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=15.0 2023-06-18 11:38:45,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=199590.0, ans=0.0 2023-06-18 11:39:09,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-18 11:39:25,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=199650.0, ans=0.125 2023-06-18 11:39:33,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=199710.0, ans=0.125 2023-06-18 11:39:55,066 INFO [train.py:996] (1/4) Epoch 2, batch 2800, loss[loss=0.305, simple_loss=0.3993, pruned_loss=0.1054, over 19766.00 frames. ], tot_loss[loss=0.3217, simple_loss=0.3736, pruned_loss=0.1349, over 4275042.92 frames. ], batch size: 703, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:40:06,598 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:40:09,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=199830.0, ans=0.0 2023-06-18 11:40:14,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199830.0, ans=0.1 2023-06-18 11:40:48,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=199890.0, ans=0.2 2023-06-18 11:41:06,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=199950.0, ans=0.0 2023-06-18 11:41:12,398 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:41:16,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200010.0, ans=0.1 2023-06-18 11:41:25,816 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 3.620e+02 4.325e+02 5.387e+02 9.118e+02, threshold=8.651e+02, percent-clipped=0.0 2023-06-18 11:41:27,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=200010.0, ans=0.125 2023-06-18 11:41:32,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=200070.0, ans=0.0 2023-06-18 11:41:33,772 INFO [train.py:996] (1/4) Epoch 2, batch 2850, loss[loss=0.2471, simple_loss=0.3009, pruned_loss=0.09666, over 21542.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3734, pruned_loss=0.1352, over 4273944.50 frames. ], batch size: 230, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:41:41,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.09 vs. limit=15.0 2023-06-18 11:43:07,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=200310.0, ans=0.04949747468305833 2023-06-18 11:43:10,717 INFO [train.py:996] (1/4) Epoch 2, batch 2900, loss[loss=0.3164, simple_loss=0.3679, pruned_loss=0.1325, over 21289.00 frames. ], tot_loss[loss=0.3213, simple_loss=0.3724, pruned_loss=0.1351, over 4279630.56 frames. ], batch size: 549, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:44:18,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-18 11:44:39,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 4.023e+02 4.917e+02 6.862e+02 1.107e+03, threshold=9.834e+02, percent-clipped=8.0 2023-06-18 11:44:47,197 INFO [train.py:996] (1/4) Epoch 2, batch 2950, loss[loss=0.4196, simple_loss=0.4639, pruned_loss=0.1876, over 21606.00 frames. ], tot_loss[loss=0.3225, simple_loss=0.374, pruned_loss=0.1355, over 4288658.26 frames. ], batch size: 508, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:45:15,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=200730.0, ans=0.125 2023-06-18 11:45:37,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=200790.0, ans=0.125 2023-06-18 11:46:06,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=200910.0, ans=15.0 2023-06-18 11:46:13,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=200910.0, ans=0.125 2023-06-18 11:46:20,631 INFO [train.py:996] (1/4) Epoch 2, batch 3000, loss[loss=0.3585, simple_loss=0.407, pruned_loss=0.155, over 21849.00 frames. ], tot_loss[loss=0.3254, simple_loss=0.3778, pruned_loss=0.1365, over 4294032.70 frames. ], batch size: 118, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:46:20,631 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 11:46:31,162 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.1904, 3.7212, 1.9158, 1.4379], device='cuda:1') 2023-06-18 11:46:36,275 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2851, simple_loss=0.377, pruned_loss=0.09657, over 1796401.00 frames. 2023-06-18 11:46:36,276 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 11:47:01,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=201030.0, ans=0.125 2023-06-18 11:48:07,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.149e+02 4.031e+02 5.261e+02 8.201e+02, threshold=8.061e+02, percent-clipped=0.0 2023-06-18 11:48:15,241 INFO [train.py:996] (1/4) Epoch 2, batch 3050, loss[loss=0.3695, simple_loss=0.4057, pruned_loss=0.1667, over 21277.00 frames. ], tot_loss[loss=0.3252, simple_loss=0.3785, pruned_loss=0.136, over 4286730.31 frames. ], batch size: 159, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:48:56,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=201330.0, ans=0.025 2023-06-18 11:49:18,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=201390.0, ans=0.125 2023-06-18 11:49:28,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=201450.0, ans=0.1 2023-06-18 11:50:02,708 INFO [train.py:996] (1/4) Epoch 2, batch 3100, loss[loss=0.3413, simple_loss=0.4066, pruned_loss=0.138, over 21761.00 frames. ], tot_loss[loss=0.3222, simple_loss=0.3774, pruned_loss=0.1335, over 4281633.59 frames. ], batch size: 332, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:50:26,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=201630.0, ans=0.2 2023-06-18 11:51:06,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-18 11:51:23,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-18 11:51:31,172 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.312e+02 4.167e+02 4.991e+02 9.720e+02, threshold=8.334e+02, percent-clipped=2.0 2023-06-18 11:51:34,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=201810.0, ans=0.1 2023-06-18 11:51:37,935 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:51:39,060 INFO [train.py:996] (1/4) Epoch 2, batch 3150, loss[loss=0.3407, simple_loss=0.3901, pruned_loss=0.1456, over 21232.00 frames. ], tot_loss[loss=0.3226, simple_loss=0.3779, pruned_loss=0.1337, over 4274763.36 frames. ], batch size: 143, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:51:49,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=201870.0, ans=0.0 2023-06-18 11:51:50,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=201870.0, ans=0.04949747468305833 2023-06-18 11:51:59,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-18 11:52:15,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=201930.0, ans=0.125 2023-06-18 11:52:15,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=201930.0, ans=0.95 2023-06-18 11:52:25,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=201990.0, ans=0.125 2023-06-18 11:52:30,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=201990.0, ans=0.2 2023-06-18 11:53:11,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=202110.0, ans=0.125 2023-06-18 11:53:22,363 INFO [train.py:996] (1/4) Epoch 2, batch 3200, loss[loss=0.314, simple_loss=0.3902, pruned_loss=0.119, over 20848.00 frames. ], tot_loss[loss=0.3229, simple_loss=0.3789, pruned_loss=0.1335, over 4278841.15 frames. ], batch size: 607, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:53:50,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=202230.0, ans=0.0 2023-06-18 11:54:03,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202290.0, ans=0.1 2023-06-18 11:54:45,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.95 vs. limit=22.5 2023-06-18 11:54:51,728 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.651e+02 4.650e+02 5.913e+02 1.032e+03, threshold=9.300e+02, percent-clipped=10.0 2023-06-18 11:55:04,314 INFO [train.py:996] (1/4) Epoch 2, batch 3250, loss[loss=0.308, simple_loss=0.3552, pruned_loss=0.1304, over 21889.00 frames. ], tot_loss[loss=0.3268, simple_loss=0.3811, pruned_loss=0.1362, over 4273831.77 frames. ], batch size: 372, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:55:16,014 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.22 vs. limit=10.0 2023-06-18 11:55:43,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=202590.0, ans=0.125 2023-06-18 11:56:43,221 INFO [train.py:996] (1/4) Epoch 2, batch 3300, loss[loss=0.2777, simple_loss=0.3372, pruned_loss=0.1092, over 21297.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3773, pruned_loss=0.136, over 4271811.92 frames. ], batch size: 211, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:56:50,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=202770.0, ans=0.125 2023-06-18 11:56:51,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=202770.0, ans=0.125 2023-06-18 11:57:31,639 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:57:44,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-18 11:57:47,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=202950.0, ans=0.125 2023-06-18 11:58:08,734 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 3.845e+02 4.643e+02 5.721e+02 1.092e+03, threshold=9.285e+02, percent-clipped=5.0 2023-06-18 11:58:16,390 INFO [train.py:996] (1/4) Epoch 2, batch 3350, loss[loss=0.3081, simple_loss=0.3586, pruned_loss=0.1288, over 21258.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3778, pruned_loss=0.135, over 4276559.02 frames. ], batch size: 159, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:58:21,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=203070.0, ans=15.0 2023-06-18 11:59:05,114 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-18 11:59:16,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=203250.0, ans=0.04949747468305833 2023-06-18 11:59:31,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=203250.0, ans=15.0 2023-06-18 11:59:35,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=203250.0, ans=0.04949747468305833 2023-06-18 11:59:39,001 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-18 11:59:53,092 INFO [train.py:996] (1/4) Epoch 2, batch 3400, loss[loss=0.3014, simple_loss=0.3688, pruned_loss=0.117, over 21410.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3776, pruned_loss=0.1365, over 4281026.42 frames. ], batch size: 211, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:59:58,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=203370.0, ans=0.2 2023-06-18 12:00:04,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=203370.0, ans=0.125 2023-06-18 12:00:33,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=203490.0, ans=0.0 2023-06-18 12:01:18,181 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.307e+02 4.139e+02 5.241e+02 1.031e+03, threshold=8.278e+02, percent-clipped=1.0 2023-06-18 12:01:26,572 INFO [train.py:996] (1/4) Epoch 2, batch 3450, loss[loss=0.3433, simple_loss=0.3737, pruned_loss=0.1565, over 21426.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.3718, pruned_loss=0.1347, over 4270691.95 frames. ], batch size: 389, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:01:34,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=203670.0, ans=0.0 2023-06-18 12:01:42,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=203730.0, ans=0.125 2023-06-18 12:02:18,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-18 12:02:26,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=203790.0, ans=0.0 2023-06-18 12:02:46,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-18 12:03:00,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=203910.0, ans=0.2 2023-06-18 12:03:02,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=203910.0, ans=0.07 2023-06-18 12:03:06,549 INFO [train.py:996] (1/4) Epoch 2, batch 3500, loss[loss=0.465, simple_loss=0.4963, pruned_loss=0.2168, over 21392.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3822, pruned_loss=0.14, over 4271617.13 frames. ], batch size: 507, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:04:24,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-18 12:04:35,404 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.505e+02 3.628e+02 4.522e+02 5.964e+02 1.068e+03, threshold=9.044e+02, percent-clipped=7.0 2023-06-18 12:04:47,804 INFO [train.py:996] (1/4) Epoch 2, batch 3550, loss[loss=0.2698, simple_loss=0.3085, pruned_loss=0.1156, over 19971.00 frames. ], tot_loss[loss=0.3341, simple_loss=0.3851, pruned_loss=0.1415, over 4270961.42 frames. ], batch size: 702, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:04:55,050 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-18 12:05:42,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-18 12:05:59,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=204450.0, ans=0.1 2023-06-18 12:06:21,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-18 12:06:26,608 INFO [train.py:996] (1/4) Epoch 2, batch 3600, loss[loss=0.3192, simple_loss=0.3757, pruned_loss=0.1314, over 20688.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3775, pruned_loss=0.1391, over 4273707.57 frames. ], batch size: 607, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:06:54,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=204630.0, ans=0.1 2023-06-18 12:07:10,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.01 vs. limit=22.5 2023-06-18 12:07:33,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=204750.0, ans=0.0 2023-06-18 12:07:57,514 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 3.619e+02 4.242e+02 5.545e+02 1.042e+03, threshold=8.484e+02, percent-clipped=1.0 2023-06-18 12:08:03,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-18 12:08:05,224 INFO [train.py:996] (1/4) Epoch 2, batch 3650, loss[loss=0.3552, simple_loss=0.4008, pruned_loss=0.1548, over 21240.00 frames. ], tot_loss[loss=0.3287, simple_loss=0.3783, pruned_loss=0.1396, over 4276806.96 frames. ], batch size: 143, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:09:22,982 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:09:41,417 INFO [train.py:996] (1/4) Epoch 2, batch 3700, loss[loss=0.3159, simple_loss=0.364, pruned_loss=0.1339, over 21795.00 frames. ], tot_loss[loss=0.3261, simple_loss=0.376, pruned_loss=0.1381, over 4276739.62 frames. ], batch size: 441, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:09:59,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=205170.0, ans=0.125 2023-06-18 12:10:05,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=205230.0, ans=0.125 2023-06-18 12:11:01,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=205410.0, ans=0.125 2023-06-18 12:11:09,713 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.219e+02 3.762e+02 4.567e+02 1.013e+03, threshold=7.524e+02, percent-clipped=2.0 2023-06-18 12:11:22,717 INFO [train.py:996] (1/4) Epoch 2, batch 3750, loss[loss=0.2778, simple_loss=0.3296, pruned_loss=0.113, over 21596.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3733, pruned_loss=0.1372, over 4279955.40 frames. ], batch size: 230, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:11:23,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=205470.0, ans=0.125 2023-06-18 12:11:23,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=205470.0, ans=0.2 2023-06-18 12:11:29,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-18 12:11:30,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=205470.0, ans=0.125 2023-06-18 12:11:30,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=205470.0, ans=0.125 2023-06-18 12:12:15,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=205590.0, ans=0.125 2023-06-18 12:12:23,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=205650.0, ans=0.125 2023-06-18 12:12:59,778 INFO [train.py:996] (1/4) Epoch 2, batch 3800, loss[loss=0.4467, simple_loss=0.4633, pruned_loss=0.2151, over 21441.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3702, pruned_loss=0.1345, over 4270823.82 frames. ], batch size: 471, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:13:38,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205890.0, ans=0.1 2023-06-18 12:13:41,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=205890.0, ans=0.0 2023-06-18 12:13:52,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=205890.0, ans=0.0 2023-06-18 12:13:54,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=205950.0, ans=0.125 2023-06-18 12:14:05,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=205950.0, ans=0.0 2023-06-18 12:14:28,916 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.435e+02 4.330e+02 5.504e+02 8.212e+02, threshold=8.659e+02, percent-clipped=4.0 2023-06-18 12:14:36,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-18 12:14:37,044 INFO [train.py:996] (1/4) Epoch 2, batch 3850, loss[loss=0.3103, simple_loss=0.345, pruned_loss=0.1378, over 21797.00 frames. ], tot_loss[loss=0.3226, simple_loss=0.3711, pruned_loss=0.1371, over 4276699.07 frames. ], batch size: 317, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:15:01,184 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=12.0 2023-06-18 12:15:35,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=206250.0, ans=0.125 2023-06-18 12:16:13,512 INFO [train.py:996] (1/4) Epoch 2, batch 3900, loss[loss=0.3475, simple_loss=0.3768, pruned_loss=0.1591, over 21790.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.3671, pruned_loss=0.1373, over 4277991.58 frames. ], batch size: 441, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:16:51,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=206490.0, ans=0.0 2023-06-18 12:16:59,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=206490.0, ans=0.125 2023-06-18 12:17:42,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.558e+02 3.735e+02 4.662e+02 6.230e+02 1.205e+03, threshold=9.323e+02, percent-clipped=9.0 2023-06-18 12:17:50,610 INFO [train.py:996] (1/4) Epoch 2, batch 3950, loss[loss=0.3234, simple_loss=0.3939, pruned_loss=0.1264, over 21610.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3681, pruned_loss=0.1354, over 4273011.98 frames. ], batch size: 441, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:18:28,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=206730.0, ans=0.0 2023-06-18 12:18:52,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=206850.0, ans=0.125 2023-06-18 12:19:16,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.48 vs. limit=15.0 2023-06-18 12:19:27,480 INFO [train.py:996] (1/4) Epoch 2, batch 4000, loss[loss=0.2793, simple_loss=0.3245, pruned_loss=0.117, over 21836.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3609, pruned_loss=0.1313, over 4270515.83 frames. ], batch size: 352, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:19:44,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=206970.0, ans=0.0 2023-06-18 12:19:49,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=207030.0, ans=0.125 2023-06-18 12:20:25,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=207090.0, ans=0.5 2023-06-18 12:20:42,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=207210.0, ans=0.1 2023-06-18 12:20:50,971 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.512e+02 4.093e+02 5.285e+02 8.562e+02, threshold=8.187e+02, percent-clipped=0.0 2023-06-18 12:20:59,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.24 vs. limit=10.0 2023-06-18 12:21:02,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-18 12:21:03,154 INFO [train.py:996] (1/4) Epoch 2, batch 4050, loss[loss=0.2875, simple_loss=0.3617, pruned_loss=0.1066, over 21681.00 frames. ], tot_loss[loss=0.3094, simple_loss=0.3611, pruned_loss=0.1288, over 4267227.16 frames. ], batch size: 414, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:21:47,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=19.57 vs. limit=22.5 2023-06-18 12:22:12,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=207450.0, ans=0.125 2023-06-18 12:22:44,793 INFO [train.py:996] (1/4) Epoch 2, batch 4100, loss[loss=0.2811, simple_loss=0.3475, pruned_loss=0.1074, over 21606.00 frames. ], tot_loss[loss=0.3105, simple_loss=0.3622, pruned_loss=0.1294, over 4279734.85 frames. ], batch size: 230, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:22:48,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-18 12:22:50,169 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=22.5 2023-06-18 12:23:00,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=207630.0, ans=0.125 2023-06-18 12:23:38,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=207690.0, ans=0.0 2023-06-18 12:23:45,323 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=22.5 2023-06-18 12:23:55,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=207810.0, ans=0.0 2023-06-18 12:23:57,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-18 12:24:03,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=207810.0, ans=0.5 2023-06-18 12:24:08,746 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.979e+02 3.498e+02 4.033e+02 7.129e+02, threshold=6.997e+02, percent-clipped=0.0 2023-06-18 12:24:15,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=207810.0, ans=0.125 2023-06-18 12:24:20,657 INFO [train.py:996] (1/4) Epoch 2, batch 4150, loss[loss=0.28, simple_loss=0.3524, pruned_loss=0.1038, over 21690.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3621, pruned_loss=0.125, over 4282518.72 frames. ], batch size: 332, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:24:39,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=207930.0, ans=0.125 2023-06-18 12:25:20,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-18 12:25:53,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=208110.0, ans=0.0 2023-06-18 12:25:58,994 INFO [train.py:996] (1/4) Epoch 2, batch 4200, loss[loss=0.3859, simple_loss=0.4269, pruned_loss=0.1724, over 21558.00 frames. ], tot_loss[loss=0.3055, simple_loss=0.362, pruned_loss=0.1245, over 4284264.73 frames. ], batch size: 441, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:26:43,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=208290.0, ans=0.0 2023-06-18 12:27:10,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=208350.0, ans=0.125 2023-06-18 12:27:29,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.93 vs. limit=6.0 2023-06-18 12:27:31,459 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.591e+02 4.541e+02 5.629e+02 1.049e+03, threshold=9.081e+02, percent-clipped=10.0 2023-06-18 12:27:38,033 INFO [train.py:996] (1/4) Epoch 2, batch 4250, loss[loss=0.3297, simple_loss=0.3834, pruned_loss=0.138, over 21363.00 frames. ], tot_loss[loss=0.3112, simple_loss=0.3677, pruned_loss=0.1273, over 4274479.06 frames. ], batch size: 159, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:28:06,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2023-06-18 12:28:41,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=208650.0, ans=0.2 2023-06-18 12:29:25,629 INFO [train.py:996] (1/4) Epoch 2, batch 4300, loss[loss=0.3412, simple_loss=0.4161, pruned_loss=0.1332, over 21593.00 frames. ], tot_loss[loss=0.3171, simple_loss=0.3746, pruned_loss=0.1298, over 4269038.46 frames. ], batch size: 389, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:29:44,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=208830.0, ans=0.0 2023-06-18 12:30:26,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=208950.0, ans=10.0 2023-06-18 12:31:04,002 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 3.305e+02 3.923e+02 4.922e+02 1.064e+03, threshold=7.846e+02, percent-clipped=2.0 2023-06-18 12:31:10,102 INFO [train.py:996] (1/4) Epoch 2, batch 4350, loss[loss=0.3335, simple_loss=0.3764, pruned_loss=0.1454, over 21847.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3738, pruned_loss=0.1294, over 4268898.58 frames. ], batch size: 98, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:31:13,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=209070.0, ans=0.125 2023-06-18 12:32:05,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=209250.0, ans=0.125 2023-06-18 12:32:27,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209310.0, ans=0.1 2023-06-18 12:32:43,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=209310.0, ans=0.125 2023-06-18 12:32:47,834 INFO [train.py:996] (1/4) Epoch 2, batch 4400, loss[loss=0.2868, simple_loss=0.3468, pruned_loss=0.1134, over 21776.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.368, pruned_loss=0.1278, over 4273521.44 frames. ], batch size: 124, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:32:54,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=209370.0, ans=0.0 2023-06-18 12:33:05,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=209370.0, ans=0.07 2023-06-18 12:33:44,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=209550.0, ans=0.125 2023-06-18 12:34:19,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.653e+02 4.609e+02 5.842e+02 1.096e+03, threshold=9.217e+02, percent-clipped=5.0 2023-06-18 12:34:30,761 INFO [train.py:996] (1/4) Epoch 2, batch 4450, loss[loss=0.3869, simple_loss=0.438, pruned_loss=0.1679, over 21697.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3757, pruned_loss=0.1297, over 4275298.41 frames. ], batch size: 441, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:34:33,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-18 12:35:12,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=209790.0, ans=0.1 2023-06-18 12:35:41,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=209850.0, ans=0.09899494936611666 2023-06-18 12:36:06,126 INFO [train.py:996] (1/4) Epoch 2, batch 4500, loss[loss=0.2909, simple_loss=0.361, pruned_loss=0.1104, over 21488.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3772, pruned_loss=0.1324, over 4279313.03 frames. ], batch size: 194, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:36:11,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=209970.0, ans=0.2 2023-06-18 12:36:56,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=210090.0, ans=0.0 2023-06-18 12:37:22,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=210210.0, ans=0.125 2023-06-18 12:37:24,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=210210.0, ans=0.125 2023-06-18 12:37:32,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=210210.0, ans=10.0 2023-06-18 12:37:37,910 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.555e+02 4.065e+02 4.925e+02 8.814e+02, threshold=8.131e+02, percent-clipped=0.0 2023-06-18 12:37:48,819 INFO [train.py:996] (1/4) Epoch 2, batch 4550, loss[loss=0.2788, simple_loss=0.3134, pruned_loss=0.1221, over 19996.00 frames. ], tot_loss[loss=0.3222, simple_loss=0.3801, pruned_loss=0.1321, over 4281636.39 frames. ], batch size: 703, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:37:59,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-18 12:38:03,618 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.35 vs. limit=22.5 2023-06-18 12:38:12,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210330.0, ans=0.1 2023-06-18 12:38:33,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=210390.0, ans=0.125 2023-06-18 12:38:54,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=210450.0, ans=0.1 2023-06-18 12:39:06,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=210510.0, ans=0.125 2023-06-18 12:39:21,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=210510.0, ans=0.0 2023-06-18 12:39:25,103 INFO [train.py:996] (1/4) Epoch 2, batch 4600, loss[loss=0.2702, simple_loss=0.3345, pruned_loss=0.1029, over 16761.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3846, pruned_loss=0.1354, over 4277703.44 frames. ], batch size: 61, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:39:27,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-18 12:39:30,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=210570.0, ans=0.0 2023-06-18 12:39:35,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=210570.0, ans=0.07 2023-06-18 12:39:43,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=210630.0, ans=0.125 2023-06-18 12:40:56,142 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 3.566e+02 4.300e+02 5.335e+02 1.700e+03, threshold=8.600e+02, percent-clipped=8.0 2023-06-18 12:41:02,493 INFO [train.py:996] (1/4) Epoch 2, batch 4650, loss[loss=0.2424, simple_loss=0.2976, pruned_loss=0.09359, over 21380.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3798, pruned_loss=0.1333, over 4284036.41 frames. ], batch size: 194, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:41:29,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=210930.0, ans=0.125 2023-06-18 12:42:00,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=211050.0, ans=0.0 2023-06-18 12:42:32,547 INFO [train.py:996] (1/4) Epoch 2, batch 4700, loss[loss=0.2798, simple_loss=0.3234, pruned_loss=0.1181, over 21741.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3689, pruned_loss=0.1299, over 4277067.55 frames. ], batch size: 300, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:42:34,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=211170.0, ans=0.125 2023-06-18 12:42:40,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=211170.0, ans=0.1 2023-06-18 12:43:12,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-18 12:43:24,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=211290.0, ans=0.2 2023-06-18 12:43:32,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=211350.0, ans=0.0 2023-06-18 12:43:50,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=211410.0, ans=0.125 2023-06-18 12:43:56,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=211410.0, ans=0.0 2023-06-18 12:44:00,734 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.242e+02 4.193e+02 5.636e+02 1.011e+03, threshold=8.385e+02, percent-clipped=2.0 2023-06-18 12:44:03,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-18 12:44:06,697 INFO [train.py:996] (1/4) Epoch 2, batch 4750, loss[loss=0.2853, simple_loss=0.3339, pruned_loss=0.1184, over 21798.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3631, pruned_loss=0.129, over 4276892.73 frames. ], batch size: 282, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:44:21,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=22.5 2023-06-18 12:45:23,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-18 12:45:34,280 INFO [train.py:996] (1/4) Epoch 2, batch 4800, loss[loss=0.3019, simple_loss=0.3795, pruned_loss=0.1122, over 21857.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.3648, pruned_loss=0.13, over 4285506.54 frames. ], batch size: 372, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:45:45,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=211770.0, ans=0.04949747468305833 2023-06-18 12:46:01,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=211830.0, ans=0.125 2023-06-18 12:47:01,222 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.590e+02 4.523e+02 5.544e+02 1.095e+03, threshold=9.046e+02, percent-clipped=1.0 2023-06-18 12:47:07,535 INFO [train.py:996] (1/4) Epoch 2, batch 4850, loss[loss=0.3069, simple_loss=0.3759, pruned_loss=0.119, over 21648.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3634, pruned_loss=0.1294, over 4289098.69 frames. ], batch size: 263, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:47:51,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=212190.0, ans=0.125 2023-06-18 12:48:10,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=212250.0, ans=0.125 2023-06-18 12:48:20,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=212310.0, ans=0.125 2023-06-18 12:48:35,058 INFO [train.py:996] (1/4) Epoch 2, batch 4900, loss[loss=0.4348, simple_loss=0.4623, pruned_loss=0.2037, over 21524.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3659, pruned_loss=0.1315, over 4295897.38 frames. ], batch size: 471, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:50:00,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=212610.0, ans=0.125 2023-06-18 12:50:06,439 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 3.496e+02 4.489e+02 5.539e+02 1.137e+03, threshold=8.978e+02, percent-clipped=3.0 2023-06-18 12:50:13,025 INFO [train.py:996] (1/4) Epoch 2, batch 4950, loss[loss=0.245, simple_loss=0.3245, pruned_loss=0.08275, over 21664.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.3674, pruned_loss=0.1287, over 4282239.90 frames. ], batch size: 247, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:50:39,235 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:50:54,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=212790.0, ans=0.0 2023-06-18 12:51:20,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212850.0, ans=0.1 2023-06-18 12:51:46,441 INFO [train.py:996] (1/4) Epoch 2, batch 5000, loss[loss=0.3165, simple_loss=0.3765, pruned_loss=0.1282, over 21790.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.3659, pruned_loss=0.1249, over 4281767.96 frames. ], batch size: 298, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:51:46,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=212970.0, ans=0.125 2023-06-18 12:52:07,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=213030.0, ans=0.0 2023-06-18 12:52:52,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-18 12:53:06,323 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.969e+02 3.666e+02 4.897e+02 8.510e+02, threshold=7.332e+02, percent-clipped=0.0 2023-06-18 12:53:12,737 INFO [train.py:996] (1/4) Epoch 2, batch 5050, loss[loss=0.3584, simple_loss=0.3944, pruned_loss=0.1612, over 21842.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3659, pruned_loss=0.1279, over 4289600.12 frames. ], batch size: 107, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:54:05,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-18 12:54:43,407 INFO [train.py:996] (1/4) Epoch 2, batch 5100, loss[loss=0.2907, simple_loss=0.3651, pruned_loss=0.1082, over 20700.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3639, pruned_loss=0.1287, over 4290935.41 frames. ], batch size: 607, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:56:13,624 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.389e+02 4.046e+02 5.054e+02 9.083e+02, threshold=8.093e+02, percent-clipped=6.0 2023-06-18 12:56:19,859 INFO [train.py:996] (1/4) Epoch 2, batch 5150, loss[loss=0.3241, simple_loss=0.3739, pruned_loss=0.1371, over 21746.00 frames. ], tot_loss[loss=0.3112, simple_loss=0.3638, pruned_loss=0.1293, over 4282922.77 frames. ], batch size: 389, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:56:20,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=213870.0, ans=0.0 2023-06-18 12:57:02,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=213990.0, ans=0.0 2023-06-18 12:57:05,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=213990.0, ans=0.0 2023-06-18 12:57:07,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=213990.0, ans=0.2 2023-06-18 12:57:52,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=22.5 2023-06-18 12:57:55,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-18 12:57:56,077 INFO [train.py:996] (1/4) Epoch 2, batch 5200, loss[loss=0.2926, simple_loss=0.3734, pruned_loss=0.1059, over 21647.00 frames. ], tot_loss[loss=0.3149, simple_loss=0.3678, pruned_loss=0.131, over 4289988.60 frames. ], batch size: 230, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:58:23,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=214230.0, ans=0.0 2023-06-18 12:58:28,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-18 12:59:00,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214350.0, ans=0.1 2023-06-18 12:59:25,781 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.610e+02 4.791e+02 6.505e+02 1.223e+03, threshold=9.582e+02, percent-clipped=11.0 2023-06-18 12:59:32,051 INFO [train.py:996] (1/4) Epoch 2, batch 5250, loss[loss=0.3595, simple_loss=0.4282, pruned_loss=0.1454, over 21516.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3686, pruned_loss=0.1265, over 4277350.66 frames. ], batch size: 471, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:59:41,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=214470.0, ans=0.1 2023-06-18 12:59:42,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-06-18 12:59:55,018 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.710e-02 2023-06-18 13:00:30,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2023-06-18 13:00:39,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=214650.0, ans=0.125 2023-06-18 13:01:11,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=214770.0, ans=0.125 2023-06-18 13:01:12,301 INFO [train.py:996] (1/4) Epoch 2, batch 5300, loss[loss=0.3241, simple_loss=0.3693, pruned_loss=0.1394, over 21781.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.3676, pruned_loss=0.1279, over 4284757.47 frames. ], batch size: 389, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:01:27,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=214770.0, ans=0.0 2023-06-18 13:02:22,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214950.0, ans=0.1 2023-06-18 13:02:24,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=215010.0, ans=0.0 2023-06-18 13:02:27,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-18 13:02:35,402 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 3.046e+02 3.546e+02 4.539e+02 8.571e+02, threshold=7.092e+02, percent-clipped=0.0 2023-06-18 13:02:41,301 INFO [train.py:996] (1/4) Epoch 2, batch 5350, loss[loss=0.2943, simple_loss=0.3474, pruned_loss=0.1206, over 21871.00 frames. ], tot_loss[loss=0.3139, simple_loss=0.368, pruned_loss=0.1299, over 4288371.72 frames. ], batch size: 118, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:02:46,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=215070.0, ans=0.035 2023-06-18 13:03:13,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=215130.0, ans=0.125 2023-06-18 13:03:58,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=215250.0, ans=0.125 2023-06-18 13:04:16,997 INFO [train.py:996] (1/4) Epoch 2, batch 5400, loss[loss=0.2875, simple_loss=0.342, pruned_loss=0.1165, over 21648.00 frames. ], tot_loss[loss=0.315, simple_loss=0.3669, pruned_loss=0.1315, over 4291475.34 frames. ], batch size: 263, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:04:44,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=215430.0, ans=0.1 2023-06-18 13:05:03,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=215490.0, ans=0.2 2023-06-18 13:05:03,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=215490.0, ans=0.1 2023-06-18 13:05:06,122 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:05:50,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=215610.0, ans=0.0 2023-06-18 13:05:58,049 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.179e+02 4.117e+02 5.254e+02 8.433e+02, threshold=8.234e+02, percent-clipped=2.0 2023-06-18 13:06:04,227 INFO [train.py:996] (1/4) Epoch 2, batch 5450, loss[loss=0.2787, simple_loss=0.3644, pruned_loss=0.09646, over 21799.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3691, pruned_loss=0.1285, over 4283614.49 frames. ], batch size: 332, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:06:32,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=215730.0, ans=0.0 2023-06-18 13:07:36,798 INFO [train.py:996] (1/4) Epoch 2, batch 5500, loss[loss=0.2611, simple_loss=0.3459, pruned_loss=0.08812, over 21644.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3715, pruned_loss=0.1239, over 4278848.35 frames. ], batch size: 263, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:07:44,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.95 vs. limit=15.0 2023-06-18 13:07:53,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=215970.0, ans=0.0 2023-06-18 13:08:18,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=216090.0, ans=0.2 2023-06-18 13:08:22,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=216090.0, ans=0.125 2023-06-18 13:08:28,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216090.0, ans=0.1 2023-06-18 13:08:49,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=216150.0, ans=0.125 2023-06-18 13:08:53,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.31 vs. limit=6.0 2023-06-18 13:08:57,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=216150.0, ans=0.0 2023-06-18 13:09:13,838 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 3.111e+02 3.765e+02 4.593e+02 1.085e+03, threshold=7.530e+02, percent-clipped=3.0 2023-06-18 13:09:17,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=216210.0, ans=0.05 2023-06-18 13:09:20,388 INFO [train.py:996] (1/4) Epoch 2, batch 5550, loss[loss=0.3174, simple_loss=0.3916, pruned_loss=0.1216, over 19917.00 frames. ], tot_loss[loss=0.3064, simple_loss=0.3706, pruned_loss=0.1211, over 4283652.67 frames. ], batch size: 702, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:09:49,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=216330.0, ans=0.125 2023-06-18 13:10:34,902 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-18 13:10:37,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=216450.0, ans=0.125 2023-06-18 13:10:40,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=216510.0, ans=0.125 2023-06-18 13:10:45,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=216510.0, ans=10.0 2023-06-18 13:10:52,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=216510.0, ans=0.125 2023-06-18 13:11:01,291 INFO [train.py:996] (1/4) Epoch 2, batch 5600, loss[loss=0.2827, simple_loss=0.3613, pruned_loss=0.1021, over 21733.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3699, pruned_loss=0.1182, over 4281760.49 frames. ], batch size: 298, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:11:10,096 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-18 13:11:11,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=216570.0, ans=0.0 2023-06-18 13:11:42,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=216690.0, ans=0.0 2023-06-18 13:11:57,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=216750.0, ans=0.0 2023-06-18 13:11:57,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=216750.0, ans=0.04949747468305833 2023-06-18 13:12:08,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216750.0, ans=0.1 2023-06-18 13:12:12,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216750.0, ans=0.1 2023-06-18 13:12:25,399 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 3.142e+02 3.780e+02 5.340e+02 1.337e+03, threshold=7.560e+02, percent-clipped=11.0 2023-06-18 13:12:27,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=216810.0, ans=0.2 2023-06-18 13:12:36,214 INFO [train.py:996] (1/4) Epoch 2, batch 5650, loss[loss=0.3953, simple_loss=0.4232, pruned_loss=0.1837, over 21698.00 frames. ], tot_loss[loss=0.3095, simple_loss=0.3752, pruned_loss=0.1219, over 4277375.17 frames. ], batch size: 507, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:13:09,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=216930.0, ans=0.1 2023-06-18 13:13:12,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=216990.0, ans=0.0 2023-06-18 13:13:22,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=216990.0, ans=0.125 2023-06-18 13:14:02,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=217110.0, ans=0.025 2023-06-18 13:14:09,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=217110.0, ans=0.125 2023-06-18 13:14:11,968 INFO [train.py:996] (1/4) Epoch 2, batch 5700, loss[loss=0.3463, simple_loss=0.4333, pruned_loss=0.1296, over 20748.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3754, pruned_loss=0.1246, over 4276242.64 frames. ], batch size: 608, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:14:46,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-18 13:15:22,753 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=22.5 2023-06-18 13:15:41,856 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.89 vs. limit=10.0 2023-06-18 13:15:43,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-06-18 13:15:43,845 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 3.148e+02 3.823e+02 4.974e+02 1.006e+03, threshold=7.646e+02, percent-clipped=5.0 2023-06-18 13:15:49,886 INFO [train.py:996] (1/4) Epoch 2, batch 5750, loss[loss=0.2442, simple_loss=0.3328, pruned_loss=0.07779, over 21725.00 frames. ], tot_loss[loss=0.3074, simple_loss=0.3724, pruned_loss=0.1212, over 4268107.92 frames. ], batch size: 351, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:15:51,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=217470.0, ans=0.2 2023-06-18 13:16:08,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=217470.0, ans=0.125 2023-06-18 13:16:19,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=217530.0, ans=0.0 2023-06-18 13:16:20,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=217530.0, ans=0.125 2023-06-18 13:16:44,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=217590.0, ans=0.0 2023-06-18 13:17:01,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.15 vs. limit=22.5 2023-06-18 13:17:06,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=217650.0, ans=0.025 2023-06-18 13:17:41,292 INFO [train.py:996] (1/4) Epoch 2, batch 5800, loss[loss=0.2904, simple_loss=0.3354, pruned_loss=0.1227, over 21239.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.3712, pruned_loss=0.1201, over 4270329.57 frames. ], batch size: 608, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:18:05,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-18 13:18:11,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217830.0, ans=0.1 2023-06-18 13:19:13,427 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 3.013e+02 4.071e+02 4.851e+02 8.760e+02, threshold=8.142e+02, percent-clipped=2.0 2023-06-18 13:19:19,673 INFO [train.py:996] (1/4) Epoch 2, batch 5850, loss[loss=0.314, simple_loss=0.3851, pruned_loss=0.1214, over 21457.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.3635, pruned_loss=0.1129, over 4269306.12 frames. ], batch size: 507, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:19:30,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=218070.0, ans=0.125 2023-06-18 13:19:33,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-18 13:19:34,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=218070.0, ans=0.125 2023-06-18 13:19:41,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=218130.0, ans=0.125 2023-06-18 13:19:41,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=218130.0, ans=0.125 2023-06-18 13:20:50,781 INFO [train.py:996] (1/4) Epoch 2, batch 5900, loss[loss=0.2545, simple_loss=0.3187, pruned_loss=0.09513, over 21809.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3518, pruned_loss=0.1033, over 4275400.16 frames. ], batch size: 282, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:20:51,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=218370.0, ans=0.2 2023-06-18 13:21:06,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=218370.0, ans=0.125 2023-06-18 13:21:18,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=12.0 2023-06-18 13:21:30,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=218490.0, ans=0.0 2023-06-18 13:22:19,398 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 3.163e+02 4.084e+02 5.462e+02 1.507e+03, threshold=8.168e+02, percent-clipped=5.0 2023-06-18 13:22:25,383 INFO [train.py:996] (1/4) Epoch 2, batch 5950, loss[loss=0.3686, simple_loss=0.3941, pruned_loss=0.1715, over 21659.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3524, pruned_loss=0.1081, over 4275226.42 frames. ], batch size: 441, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:23:02,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=218730.0, ans=0.1 2023-06-18 13:23:14,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=218790.0, ans=0.125 2023-06-18 13:23:46,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=218910.0, ans=0.125 2023-06-18 13:24:04,242 INFO [train.py:996] (1/4) Epoch 2, batch 6000, loss[loss=0.2775, simple_loss=0.3138, pruned_loss=0.1206, over 21338.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3503, pruned_loss=0.114, over 4283244.33 frames. ], batch size: 177, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:24:04,243 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 13:24:20,114 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2916, simple_loss=0.3878, pruned_loss=0.09771, over 1796401.00 frames. 2023-06-18 13:24:20,114 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 13:24:22,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=218970.0, ans=0.0 2023-06-18 13:24:28,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=218970.0, ans=0.125 2023-06-18 13:24:42,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=219030.0, ans=0.0 2023-06-18 13:25:00,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=219090.0, ans=0.02 2023-06-18 13:25:35,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=219150.0, ans=0.05 2023-06-18 13:25:45,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=219210.0, ans=0.125 2023-06-18 13:25:51,503 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.782e+02 3.962e+02 4.700e+02 6.169e+02 1.115e+03, threshold=9.400e+02, percent-clipped=12.0 2023-06-18 13:25:57,774 INFO [train.py:996] (1/4) Epoch 2, batch 6050, loss[loss=0.2261, simple_loss=0.289, pruned_loss=0.08164, over 21639.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3475, pruned_loss=0.117, over 4278148.05 frames. ], batch size: 247, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:26:45,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=219390.0, ans=0.125 2023-06-18 13:27:32,994 INFO [train.py:996] (1/4) Epoch 2, batch 6100, loss[loss=0.3262, simple_loss=0.3494, pruned_loss=0.1515, over 21399.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3446, pruned_loss=0.1148, over 4277912.51 frames. ], batch size: 476, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:27:47,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=219570.0, ans=0.04949747468305833 2023-06-18 13:27:51,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=8.0 2023-06-18 13:28:06,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=219690.0, ans=0.05 2023-06-18 13:28:48,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=219750.0, ans=0.125 2023-06-18 13:28:58,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-18 13:29:03,478 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.959e+02 3.646e+02 4.733e+02 1.048e+03, threshold=7.291e+02, percent-clipped=1.0 2023-06-18 13:29:09,586 INFO [train.py:996] (1/4) Epoch 2, batch 6150, loss[loss=0.3404, simple_loss=0.3757, pruned_loss=0.1526, over 21635.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3495, pruned_loss=0.1195, over 4289652.66 frames. ], batch size: 471, lr: 1.84e-02, grad_scale: 64.0 2023-06-18 13:29:22,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=16.39 vs. limit=22.5 2023-06-18 13:29:37,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=219930.0, ans=0.125 2023-06-18 13:30:07,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=219990.0, ans=0.0 2023-06-18 13:30:08,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=220050.0, ans=0.125 2023-06-18 13:30:36,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=220110.0, ans=0.125 2023-06-18 13:30:48,833 INFO [train.py:996] (1/4) Epoch 2, batch 6200, loss[loss=0.3566, simple_loss=0.3932, pruned_loss=0.16, over 21734.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3526, pruned_loss=0.12, over 4289412.92 frames. ], batch size: 112, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:31:20,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220230.0, ans=0.1 2023-06-18 13:31:36,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=220290.0, ans=0.125 2023-06-18 13:32:01,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=220350.0, ans=0.95 2023-06-18 13:32:07,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=220410.0, ans=0.1 2023-06-18 13:32:20,916 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.171e+02 4.018e+02 5.849e+02 1.001e+03, threshold=8.035e+02, percent-clipped=11.0 2023-06-18 13:32:25,577 INFO [train.py:996] (1/4) Epoch 2, batch 6250, loss[loss=0.2653, simple_loss=0.3606, pruned_loss=0.08494, over 21706.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.357, pruned_loss=0.1193, over 4286058.79 frames. ], batch size: 298, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:32:49,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=220530.0, ans=0.07 2023-06-18 13:33:00,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=220530.0, ans=0.0 2023-06-18 13:33:11,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=220590.0, ans=0.125 2023-06-18 13:33:14,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=220590.0, ans=0.125 2023-06-18 13:33:25,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=220650.0, ans=0.1 2023-06-18 13:33:59,269 INFO [train.py:996] (1/4) Epoch 2, batch 6300, loss[loss=0.3414, simple_loss=0.3807, pruned_loss=0.151, over 21874.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3622, pruned_loss=0.1188, over 4280144.40 frames. ], batch size: 107, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:33:59,637 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:34:02,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=220770.0, ans=0.0 2023-06-18 13:34:06,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=220770.0, ans=0.125 2023-06-18 13:34:52,627 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.262e-03 2023-06-18 13:35:30,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.155e+02 3.817e+02 5.474e+02 1.365e+03, threshold=7.634e+02, percent-clipped=9.0 2023-06-18 13:35:32,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=221010.0, ans=0.0 2023-06-18 13:35:35,447 INFO [train.py:996] (1/4) Epoch 2, batch 6350, loss[loss=0.3408, simple_loss=0.3858, pruned_loss=0.1479, over 21813.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3684, pruned_loss=0.1244, over 4279015.50 frames. ], batch size: 441, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:35:50,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=221070.0, ans=0.5 2023-06-18 13:36:17,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=221130.0, ans=0.125 2023-06-18 13:36:23,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=221190.0, ans=0.1 2023-06-18 13:36:36,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=221250.0, ans=0.125 2023-06-18 13:37:17,729 INFO [train.py:996] (1/4) Epoch 2, batch 6400, loss[loss=0.3207, simple_loss=0.3702, pruned_loss=0.1356, over 21596.00 frames. ], tot_loss[loss=0.3182, simple_loss=0.3755, pruned_loss=0.1304, over 4284063.54 frames. ], batch size: 263, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:37:57,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=221490.0, ans=0.125 2023-06-18 13:38:01,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-18 13:38:12,036 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-18 13:38:12,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=221490.0, ans=0.125 2023-06-18 13:38:20,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=221550.0, ans=0.125 2023-06-18 13:38:49,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=221610.0, ans=0.125 2023-06-18 13:38:53,759 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.333e+02 3.952e+02 5.090e+02 9.873e+02, threshold=7.903e+02, percent-clipped=3.0 2023-06-18 13:38:57,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=221670.0, ans=0.2 2023-06-18 13:38:58,444 INFO [train.py:996] (1/4) Epoch 2, batch 6450, loss[loss=0.2465, simple_loss=0.3398, pruned_loss=0.07659, over 21698.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3763, pruned_loss=0.13, over 4284395.03 frames. ], batch size: 298, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:39:03,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=221670.0, ans=0.1 2023-06-18 13:40:35,149 INFO [train.py:996] (1/4) Epoch 2, batch 6500, loss[loss=0.2942, simple_loss=0.3365, pruned_loss=0.1259, over 21979.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3667, pruned_loss=0.1278, over 4279207.03 frames. ], batch size: 103, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:41:04,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=222030.0, ans=0.125 2023-06-18 13:41:11,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-18 13:41:16,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222090.0, ans=0.1 2023-06-18 13:41:39,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222150.0, ans=0.1 2023-06-18 13:41:58,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=222210.0, ans=0.125 2023-06-18 13:42:05,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=222210.0, ans=0.125 2023-06-18 13:42:06,227 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.480e+02 3.085e+02 3.485e+02 4.361e+02 6.672e+02, threshold=6.971e+02, percent-clipped=0.0 2023-06-18 13:42:10,622 INFO [train.py:996] (1/4) Epoch 2, batch 6550, loss[loss=0.2918, simple_loss=0.3727, pruned_loss=0.1054, over 21670.00 frames. ], tot_loss[loss=0.3104, simple_loss=0.3665, pruned_loss=0.1271, over 4275359.90 frames. ], batch size: 298, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:43:39,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=222510.0, ans=0.125 2023-06-18 13:43:46,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-06-18 13:43:48,121 INFO [train.py:996] (1/4) Epoch 2, batch 6600, loss[loss=0.3156, simple_loss=0.3479, pruned_loss=0.1416, over 21753.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3623, pruned_loss=0.1274, over 4265792.51 frames. ], batch size: 317, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:44:02,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-18 13:44:55,273 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-18 13:45:18,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222810.0, ans=0.1 2023-06-18 13:45:19,121 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.099e+02 3.990e+02 5.465e+02 1.147e+03, threshold=7.980e+02, percent-clipped=13.0 2023-06-18 13:45:28,359 INFO [train.py:996] (1/4) Epoch 2, batch 6650, loss[loss=0.2397, simple_loss=0.2794, pruned_loss=0.09997, over 15241.00 frames. ], tot_loss[loss=0.3005, simple_loss=0.3545, pruned_loss=0.1232, over 4262320.39 frames. ], batch size: 61, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:45:41,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=222870.0, ans=0.125 2023-06-18 13:45:41,948 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-06-18 13:45:50,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222930.0, ans=0.1 2023-06-18 13:46:04,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=222990.0, ans=0.1 2023-06-18 13:46:46,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=223110.0, ans=0.125 2023-06-18 13:47:06,249 INFO [train.py:996] (1/4) Epoch 2, batch 6700, loss[loss=0.2601, simple_loss=0.3157, pruned_loss=0.1022, over 21121.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3494, pruned_loss=0.1223, over 4262473.34 frames. ], batch size: 143, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:47:36,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=223230.0, ans=0.1 2023-06-18 13:47:50,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=12.0 2023-06-18 13:48:23,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=223410.0, ans=0.125 2023-06-18 13:48:25,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=223410.0, ans=0.125 2023-06-18 13:48:31,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=223410.0, ans=0.0 2023-06-18 13:48:32,193 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.773e+02 4.498e+02 5.331e+02 9.291e+02, threshold=8.996e+02, percent-clipped=2.0 2023-06-18 13:48:41,553 INFO [train.py:996] (1/4) Epoch 2, batch 6750, loss[loss=0.2891, simple_loss=0.3344, pruned_loss=0.1218, over 21813.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3471, pruned_loss=0.1227, over 4263775.18 frames. ], batch size: 102, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:48:48,166 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:48:57,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=223530.0, ans=0.0 2023-06-18 13:49:06,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=223530.0, ans=0.125 2023-06-18 13:49:45,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=223650.0, ans=0.125 2023-06-18 13:49:45,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-18 13:50:17,681 INFO [train.py:996] (1/4) Epoch 2, batch 6800, loss[loss=0.3128, simple_loss=0.3493, pruned_loss=0.1382, over 21807.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3496, pruned_loss=0.1259, over 4267954.16 frames. ], batch size: 124, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:50:41,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=223830.0, ans=0.0 2023-06-18 13:50:51,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=223890.0, ans=0.125 2023-06-18 13:51:33,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=224010.0, ans=0.0 2023-06-18 13:51:40,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=224010.0, ans=0.0 2023-06-18 13:51:43,139 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 3.005e+02 3.766e+02 4.478e+02 7.220e+02, threshold=7.533e+02, percent-clipped=0.0 2023-06-18 13:51:52,442 INFO [train.py:996] (1/4) Epoch 2, batch 6850, loss[loss=0.2612, simple_loss=0.3054, pruned_loss=0.1085, over 16105.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3462, pruned_loss=0.1268, over 4255084.15 frames. ], batch size: 66, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:51:52,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=224070.0, ans=0.0 2023-06-18 13:51:59,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=224070.0, ans=0.0 2023-06-18 13:52:06,973 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-18 13:52:12,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=224130.0, ans=0.125 2023-06-18 13:52:53,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=224250.0, ans=0.125 2023-06-18 13:53:28,605 INFO [train.py:996] (1/4) Epoch 2, batch 6900, loss[loss=0.2594, simple_loss=0.2911, pruned_loss=0.1138, over 20279.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3456, pruned_loss=0.127, over 4260911.89 frames. ], batch size: 703, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:55:00,773 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.118e+02 3.769e+02 5.122e+02 8.656e+02, threshold=7.539e+02, percent-clipped=2.0 2023-06-18 13:55:05,537 INFO [train.py:996] (1/4) Epoch 2, batch 6950, loss[loss=0.3133, simple_loss=0.4034, pruned_loss=0.1116, over 20742.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.349, pruned_loss=0.1226, over 4266182.13 frames. ], batch size: 607, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:55:15,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=224670.0, ans=0.0 2023-06-18 13:55:19,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=224730.0, ans=0.2 2023-06-18 13:56:23,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=224910.0, ans=0.125 2023-06-18 13:56:39,914 INFO [train.py:996] (1/4) Epoch 2, batch 7000, loss[loss=0.3232, simple_loss=0.3603, pruned_loss=0.143, over 21832.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3535, pruned_loss=0.1266, over 4271277.20 frames. ], batch size: 372, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:57:11,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=225030.0, ans=0.125 2023-06-18 13:58:04,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=225210.0, ans=0.125 2023-06-18 13:58:12,224 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.411e+02 3.463e+02 4.256e+02 5.508e+02 8.252e+02, threshold=8.512e+02, percent-clipped=6.0 2023-06-18 13:58:16,987 INFO [train.py:996] (1/4) Epoch 2, batch 7050, loss[loss=0.2596, simple_loss=0.3192, pruned_loss=0.09995, over 21197.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3514, pruned_loss=0.1243, over 4265097.94 frames. ], batch size: 159, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:58:59,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=225390.0, ans=0.1 2023-06-18 13:59:10,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.66 vs. limit=15.0 2023-06-18 13:59:44,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=225510.0, ans=0.125 2023-06-18 13:59:53,608 INFO [train.py:996] (1/4) Epoch 2, batch 7100, loss[loss=0.2417, simple_loss=0.3073, pruned_loss=0.088, over 21331.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3574, pruned_loss=0.1266, over 4257212.33 frames. ], batch size: 194, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:59:54,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-18 14:01:12,809 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:01:19,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=225810.0, ans=0.125 2023-06-18 14:01:28,271 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.245e+02 4.248e+02 6.112e+02 1.073e+03, threshold=8.497e+02, percent-clipped=3.0 2023-06-18 14:01:30,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-18 14:01:31,271 INFO [train.py:996] (1/4) Epoch 2, batch 7150, loss[loss=0.2289, simple_loss=0.2924, pruned_loss=0.08273, over 21394.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.354, pruned_loss=0.1222, over 4256442.54 frames. ], batch size: 211, lr: 1.81e-02, grad_scale: 16.0 2023-06-18 14:01:41,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-18 14:02:28,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=225990.0, ans=0.035 2023-06-18 14:02:28,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=225990.0, ans=0.0 2023-06-18 14:02:33,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-18 14:02:51,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=226110.0, ans=0.125 2023-06-18 14:02:54,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=226110.0, ans=0.125 2023-06-18 14:03:08,077 INFO [train.py:996] (1/4) Epoch 2, batch 7200, loss[loss=0.2518, simple_loss=0.3027, pruned_loss=0.1004, over 21396.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3579, pruned_loss=0.1268, over 4257733.97 frames. ], batch size: 160, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:03:40,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=226230.0, ans=0.125 2023-06-18 14:03:42,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=226230.0, ans=0.125 2023-06-18 14:04:40,118 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.552e+02 3.455e+02 4.315e+02 5.205e+02 7.912e+02, threshold=8.629e+02, percent-clipped=0.0 2023-06-18 14:04:47,725 INFO [train.py:996] (1/4) Epoch 2, batch 7250, loss[loss=0.2934, simple_loss=0.3304, pruned_loss=0.1281, over 21624.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3529, pruned_loss=0.1268, over 4259838.69 frames. ], batch size: 247, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:04:48,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=226470.0, ans=0.09899494936611666 2023-06-18 14:05:54,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=226650.0, ans=0.125 2023-06-18 14:06:12,113 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=8.0 2023-06-18 14:06:28,493 INFO [train.py:996] (1/4) Epoch 2, batch 7300, loss[loss=0.2933, simple_loss=0.3321, pruned_loss=0.1272, over 21522.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3464, pruned_loss=0.1258, over 4263394.83 frames. ], batch size: 391, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:07:07,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=226890.0, ans=0.125 2023-06-18 14:07:11,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=226890.0, ans=0.0 2023-06-18 14:07:29,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=226950.0, ans=0.0 2023-06-18 14:08:03,952 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.035e+02 3.518e+02 4.361e+02 7.798e+02, threshold=7.035e+02, percent-clipped=0.0 2023-06-18 14:08:07,006 INFO [train.py:996] (1/4) Epoch 2, batch 7350, loss[loss=0.3079, simple_loss=0.4062, pruned_loss=0.1048, over 19682.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3441, pruned_loss=0.125, over 4261720.80 frames. ], batch size: 703, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:08:39,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=227130.0, ans=0.125 2023-06-18 14:08:40,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=227130.0, ans=0.025 2023-06-18 14:09:08,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=227250.0, ans=0.0 2023-06-18 14:09:42,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=24.14 vs. limit=15.0 2023-06-18 14:09:46,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=227310.0, ans=0.125 2023-06-18 14:09:50,803 INFO [train.py:996] (1/4) Epoch 2, batch 7400, loss[loss=0.3978, simple_loss=0.4308, pruned_loss=0.1825, over 21743.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3541, pruned_loss=0.1295, over 4264211.39 frames. ], batch size: 441, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:09:54,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=227370.0, ans=0.125 2023-06-18 14:10:16,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-18 14:10:41,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=227550.0, ans=0.025 2023-06-18 14:10:46,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=227550.0, ans=0.5 2023-06-18 14:10:59,298 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2023-06-18 14:11:22,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=227610.0, ans=0.125 2023-06-18 14:11:25,764 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.736e+02 3.593e+02 4.529e+02 5.644e+02 1.003e+03, threshold=9.058e+02, percent-clipped=10.0 2023-06-18 14:11:29,111 INFO [train.py:996] (1/4) Epoch 2, batch 7450, loss[loss=0.2809, simple_loss=0.3213, pruned_loss=0.1203, over 21523.00 frames. ], tot_loss[loss=0.3042, simple_loss=0.3533, pruned_loss=0.1275, over 4263908.77 frames. ], batch size: 195, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:12:13,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=227790.0, ans=0.04949747468305833 2023-06-18 14:12:20,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-18 14:12:39,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=227850.0, ans=0.125 2023-06-18 14:12:44,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=227850.0, ans=0.125 2023-06-18 14:13:07,152 INFO [train.py:996] (1/4) Epoch 2, batch 7500, loss[loss=0.4083, simple_loss=0.4529, pruned_loss=0.1819, over 21668.00 frames. ], tot_loss[loss=0.3098, simple_loss=0.3603, pruned_loss=0.1296, over 4266090.86 frames. ], batch size: 441, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:14:10,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=228150.0, ans=0.0 2023-06-18 14:14:15,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=228150.0, ans=0.125 2023-06-18 14:14:20,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=228150.0, ans=0.125 2023-06-18 14:14:28,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=228150.0, ans=0.125 2023-06-18 14:14:34,077 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:14:42,786 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.252e+02 3.894e+02 4.815e+02 8.018e+02, threshold=7.787e+02, percent-clipped=0.0 2023-06-18 14:14:45,739 INFO [train.py:996] (1/4) Epoch 2, batch 7550, loss[loss=0.266, simple_loss=0.3023, pruned_loss=0.1148, over 20148.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3662, pruned_loss=0.1265, over 4260855.10 frames. ], batch size: 703, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:14:58,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228270.0, ans=0.1 2023-06-18 14:15:32,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.24 vs. limit=15.0 2023-06-18 14:16:20,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=228510.0, ans=0.0 2023-06-18 14:16:22,898 INFO [train.py:996] (1/4) Epoch 2, batch 7600, loss[loss=0.2635, simple_loss=0.3329, pruned_loss=0.09707, over 21446.00 frames. ], tot_loss[loss=0.3105, simple_loss=0.3668, pruned_loss=0.1271, over 4272122.81 frames. ], batch size: 194, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:16:31,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=228570.0, ans=0.0 2023-06-18 14:16:55,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-18 14:17:36,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228750.0, ans=0.1 2023-06-18 14:17:39,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=228810.0, ans=0.125 2023-06-18 14:17:51,317 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.725e+02 4.604e+02 5.632e+02 9.928e+02, threshold=9.208e+02, percent-clipped=8.0 2023-06-18 14:17:54,578 INFO [train.py:996] (1/4) Epoch 2, batch 7650, loss[loss=0.3276, simple_loss=0.363, pruned_loss=0.1461, over 21946.00 frames. ], tot_loss[loss=0.3138, simple_loss=0.3675, pruned_loss=0.13, over 4272902.76 frames. ], batch size: 316, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:17:59,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=228870.0, ans=0.0 2023-06-18 14:18:12,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=228930.0, ans=0.2 2023-06-18 14:19:05,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=229050.0, ans=0.125 2023-06-18 14:19:27,521 INFO [train.py:996] (1/4) Epoch 2, batch 7700, loss[loss=0.3203, simple_loss=0.3639, pruned_loss=0.1384, over 19959.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3676, pruned_loss=0.1319, over 4271851.57 frames. ], batch size: 702, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:19:34,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=229170.0, ans=0.125 2023-06-18 14:20:39,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=229350.0, ans=0.0 2023-06-18 14:20:47,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=229410.0, ans=0.125 2023-06-18 14:20:59,787 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.706e+02 4.565e+02 6.512e+02 1.080e+03, threshold=9.129e+02, percent-clipped=5.0 2023-06-18 14:21:02,953 INFO [train.py:996] (1/4) Epoch 2, batch 7750, loss[loss=0.3486, simple_loss=0.4185, pruned_loss=0.1393, over 21725.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.372, pruned_loss=0.132, over 4271051.41 frames. ], batch size: 247, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:21:09,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=229470.0, ans=0.125 2023-06-18 14:21:44,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=229530.0, ans=0.125 2023-06-18 14:22:05,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=15.0 2023-06-18 14:22:06,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=229590.0, ans=0.0 2023-06-18 14:22:15,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=229650.0, ans=0.125 2023-06-18 14:22:40,244 INFO [train.py:996] (1/4) Epoch 2, batch 7800, loss[loss=0.2584, simple_loss=0.2589, pruned_loss=0.1289, over 17056.00 frames. ], tot_loss[loss=0.317, simple_loss=0.372, pruned_loss=0.131, over 4268033.69 frames. ], batch size: 60, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:22:48,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=229770.0, ans=0.0 2023-06-18 14:23:28,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=229890.0, ans=0.125 2023-06-18 14:23:41,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=229890.0, ans=0.0 2023-06-18 14:23:58,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=229950.0, ans=0.125 2023-06-18 14:24:13,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.566e+02 4.138e+02 5.286e+02 1.209e+03, threshold=8.275e+02, percent-clipped=5.0 2023-06-18 14:24:15,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=230070.0, ans=0.04949747468305833 2023-06-18 14:24:16,390 INFO [train.py:996] (1/4) Epoch 2, batch 7850, loss[loss=0.2747, simple_loss=0.3215, pruned_loss=0.114, over 21658.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3658, pruned_loss=0.1289, over 4266802.71 frames. ], batch size: 333, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:24:17,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.76 vs. limit=15.0 2023-06-18 14:24:19,747 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:25:23,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=230250.0, ans=0.1 2023-06-18 14:25:50,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=230310.0, ans=0.125 2023-06-18 14:25:55,463 INFO [train.py:996] (1/4) Epoch 2, batch 7900, loss[loss=0.2957, simple_loss=0.3777, pruned_loss=0.1068, over 21739.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3615, pruned_loss=0.1282, over 4264740.57 frames. ], batch size: 332, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:26:29,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=230430.0, ans=0.125 2023-06-18 14:26:52,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=230490.0, ans=0.125 2023-06-18 14:27:04,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-18 14:27:16,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-18 14:27:29,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 3.545e+02 4.486e+02 5.981e+02 1.155e+03, threshold=8.972e+02, percent-clipped=9.0 2023-06-18 14:27:32,831 INFO [train.py:996] (1/4) Epoch 2, batch 7950, loss[loss=0.3363, simple_loss=0.3683, pruned_loss=0.1522, over 20121.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3646, pruned_loss=0.1279, over 4258036.84 frames. ], batch size: 703, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:28:03,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=230670.0, ans=0.04949747468305833 2023-06-18 14:28:29,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=230790.0, ans=0.1 2023-06-18 14:29:07,188 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 14:29:26,741 INFO [train.py:996] (1/4) Epoch 2, batch 8000, loss[loss=0.3088, simple_loss=0.4182, pruned_loss=0.09976, over 19928.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3714, pruned_loss=0.1318, over 4257235.89 frames. ], batch size: 702, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:29:47,425 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-06-18 14:30:48,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=231210.0, ans=0.125 2023-06-18 14:30:58,745 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.385e+02 3.226e+02 3.981e+02 5.095e+02 8.184e+02, threshold=7.963e+02, percent-clipped=0.0 2023-06-18 14:31:00,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231270.0, ans=0.1 2023-06-18 14:31:02,181 INFO [train.py:996] (1/4) Epoch 2, batch 8050, loss[loss=0.3637, simple_loss=0.4068, pruned_loss=0.1603, over 20022.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.377, pruned_loss=0.132, over 4258355.28 frames. ], batch size: 702, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:31:06,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.10 vs. limit=15.0 2023-06-18 14:31:20,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.76 vs. limit=15.0 2023-06-18 14:32:26,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-18 14:32:26,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.97 vs. limit=15.0 2023-06-18 14:32:32,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=231510.0, ans=0.125 2023-06-18 14:32:43,059 INFO [train.py:996] (1/4) Epoch 2, batch 8100, loss[loss=0.2996, simple_loss=0.3589, pruned_loss=0.1201, over 21767.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3747, pruned_loss=0.1326, over 4266436.79 frames. ], batch size: 112, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:33:07,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=231630.0, ans=0.0 2023-06-18 14:33:18,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.89 vs. limit=22.5 2023-06-18 14:33:30,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=231690.0, ans=0.125 2023-06-18 14:34:21,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.933e+02 5.153e+02 6.580e+02 1.761e+03, threshold=1.031e+03, percent-clipped=12.0 2023-06-18 14:34:24,582 INFO [train.py:996] (1/4) Epoch 2, batch 8150, loss[loss=0.3367, simple_loss=0.3548, pruned_loss=0.1593, over 20249.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.3794, pruned_loss=0.1345, over 4268717.25 frames. ], batch size: 703, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:34:28,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=231870.0, ans=0.2 2023-06-18 14:35:06,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=231930.0, ans=0.125 2023-06-18 14:35:12,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=231990.0, ans=0.0 2023-06-18 14:35:21,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-18 14:35:33,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=232050.0, ans=0.2 2023-06-18 14:35:42,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=232110.0, ans=0.0 2023-06-18 14:35:44,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=232110.0, ans=0.125 2023-06-18 14:35:56,576 INFO [train.py:996] (1/4) Epoch 2, batch 8200, loss[loss=0.2743, simple_loss=0.3243, pruned_loss=0.1121, over 21614.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3709, pruned_loss=0.1309, over 4268511.79 frames. ], batch size: 298, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:36:16,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=232230.0, ans=0.125 2023-06-18 14:36:56,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-18 14:37:08,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=232350.0, ans=0.125 2023-06-18 14:37:26,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.508e+02 4.438e+02 6.300e+02 1.246e+03, threshold=8.875e+02, percent-clipped=2.0 2023-06-18 14:37:29,919 INFO [train.py:996] (1/4) Epoch 2, batch 8250, loss[loss=0.3305, simple_loss=0.4109, pruned_loss=0.1251, over 21770.00 frames. ], tot_loss[loss=0.3159, simple_loss=0.3703, pruned_loss=0.1307, over 4259963.20 frames. ], batch size: 351, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:38:21,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=232590.0, ans=0.0 2023-06-18 14:38:23,469 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-18 14:38:24,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=232590.0, ans=0.2 2023-06-18 14:38:38,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=232650.0, ans=0.0 2023-06-18 14:39:08,468 INFO [train.py:996] (1/4) Epoch 2, batch 8300, loss[loss=0.2914, simple_loss=0.3628, pruned_loss=0.11, over 21731.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3698, pruned_loss=0.1288, over 4256548.78 frames. ], batch size: 332, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:39:24,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232770.0, ans=0.1 2023-06-18 14:39:27,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=232770.0, ans=0.07 2023-06-18 14:39:29,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=232770.0, ans=0.125 2023-06-18 14:39:41,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=232830.0, ans=0.125 2023-06-18 14:40:30,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=233010.0, ans=0.125 2023-06-18 14:40:38,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 3.024e+02 4.156e+02 5.477e+02 9.498e+02, threshold=8.312e+02, percent-clipped=2.0 2023-06-18 14:40:46,284 INFO [train.py:996] (1/4) Epoch 2, batch 8350, loss[loss=0.3291, simple_loss=0.3762, pruned_loss=0.141, over 20048.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.3666, pruned_loss=0.1249, over 4258908.32 frames. ], batch size: 703, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:41:23,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=233130.0, ans=0.0 2023-06-18 14:41:48,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-18 14:41:53,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=233250.0, ans=0.0 2023-06-18 14:41:56,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=233310.0, ans=0.0 2023-06-18 14:42:04,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=233310.0, ans=0.125 2023-06-18 14:42:05,466 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-18 14:42:18,638 INFO [train.py:996] (1/4) Epoch 2, batch 8400, loss[loss=0.1903, simple_loss=0.2434, pruned_loss=0.06859, over 16296.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3628, pruned_loss=0.1208, over 4257470.60 frames. ], batch size: 61, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:43:01,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=233490.0, ans=0.0 2023-06-18 14:43:47,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 3.200e+02 3.844e+02 5.205e+02 8.692e+02, threshold=7.689e+02, percent-clipped=1.0 2023-06-18 14:43:55,595 INFO [train.py:996] (1/4) Epoch 2, batch 8450, loss[loss=0.2897, simple_loss=0.3423, pruned_loss=0.1186, over 21582.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3605, pruned_loss=0.1208, over 4256788.39 frames. ], batch size: 548, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:43:58,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=233670.0, ans=0.2 2023-06-18 14:44:57,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=233910.0, ans=0.015 2023-06-18 14:45:01,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=233910.0, ans=0.125 2023-06-18 14:45:22,562 INFO [train.py:996] (1/4) Epoch 2, batch 8500, loss[loss=0.2725, simple_loss=0.3214, pruned_loss=0.1118, over 21576.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3564, pruned_loss=0.1222, over 4265119.59 frames. ], batch size: 263, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:45:22,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=233970.0, ans=0.125 2023-06-18 14:45:35,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=233970.0, ans=0.125 2023-06-18 14:45:36,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233970.0, ans=0.1 2023-06-18 14:45:51,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-18 14:45:59,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=234090.0, ans=0.95 2023-06-18 14:45:59,124 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.651e-02 2023-06-18 14:46:56,926 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.316e+02 3.972e+02 4.532e+02 9.950e+02, threshold=7.945e+02, percent-clipped=2.0 2023-06-18 14:47:05,331 INFO [train.py:996] (1/4) Epoch 2, batch 8550, loss[loss=0.2989, simple_loss=0.3547, pruned_loss=0.1216, over 21625.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3612, pruned_loss=0.1252, over 4273802.62 frames. ], batch size: 230, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:47:11,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=234270.0, ans=0.2 2023-06-18 14:47:35,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=234390.0, ans=0.2 2023-06-18 14:47:50,184 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-18 14:48:06,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-06-18 14:48:41,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-18 14:48:43,745 INFO [train.py:996] (1/4) Epoch 2, batch 8600, loss[loss=0.3741, simple_loss=0.4184, pruned_loss=0.1649, over 21359.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3704, pruned_loss=0.1305, over 4270468.04 frames. ], batch size: 548, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:49:04,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=234630.0, ans=0.125 2023-06-18 14:49:17,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=234690.0, ans=0.125 2023-06-18 14:49:58,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.04 vs. limit=10.0 2023-06-18 14:50:17,540 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.454e+02 4.150e+02 5.051e+02 9.343e+02, threshold=8.300e+02, percent-clipped=1.0 2023-06-18 14:50:18,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=234810.0, ans=0.125 2023-06-18 14:50:20,564 INFO [train.py:996] (1/4) Epoch 2, batch 8650, loss[loss=0.3071, simple_loss=0.3718, pruned_loss=0.1212, over 21626.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.3788, pruned_loss=0.1321, over 4273971.87 frames. ], batch size: 263, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:50:37,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234930.0, ans=0.1 2023-06-18 14:50:49,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-18 14:50:57,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=234990.0, ans=0.0 2023-06-18 14:51:55,564 INFO [train.py:996] (1/4) Epoch 2, batch 8700, loss[loss=0.2643, simple_loss=0.3141, pruned_loss=0.1073, over 21747.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3689, pruned_loss=0.1262, over 4274820.39 frames. ], batch size: 124, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:52:05,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-18 14:52:45,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=235350.0, ans=0.0 2023-06-18 14:53:29,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 3.324e+02 3.894e+02 5.284e+02 1.235e+03, threshold=7.788e+02, percent-clipped=5.0 2023-06-18 14:53:32,120 INFO [train.py:996] (1/4) Epoch 2, batch 8750, loss[loss=0.3309, simple_loss=0.3662, pruned_loss=0.1478, over 21453.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3661, pruned_loss=0.1273, over 4278437.63 frames. ], batch size: 177, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:55:09,964 INFO [train.py:996] (1/4) Epoch 2, batch 8800, loss[loss=0.3282, simple_loss=0.4178, pruned_loss=0.1193, over 21769.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3746, pruned_loss=0.1308, over 4278263.29 frames. ], batch size: 332, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:55:32,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=235830.0, ans=0.1 2023-06-18 14:56:36,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=236010.0, ans=0.125 2023-06-18 14:56:45,180 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.886e+02 4.865e+02 6.860e+02 1.473e+03, threshold=9.729e+02, percent-clipped=14.0 2023-06-18 14:56:48,264 INFO [train.py:996] (1/4) Epoch 2, batch 8850, loss[loss=0.2954, simple_loss=0.3809, pruned_loss=0.1049, over 21416.00 frames. ], tot_loss[loss=0.3242, simple_loss=0.3812, pruned_loss=0.1336, over 4281526.27 frames. ], batch size: 194, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:56:48,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=236070.0, ans=0.125 2023-06-18 14:57:04,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-18 14:57:11,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=236130.0, ans=0.125 2023-06-18 14:57:23,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=236190.0, ans=0.125 2023-06-18 14:58:12,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=236310.0, ans=0.0 2023-06-18 14:58:26,530 INFO [train.py:996] (1/4) Epoch 2, batch 8900, loss[loss=0.29, simple_loss=0.3398, pruned_loss=0.1201, over 21772.00 frames. ], tot_loss[loss=0.3202, simple_loss=0.3755, pruned_loss=0.1324, over 4270220.04 frames. ], batch size: 371, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:59:42,230 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.65 vs. limit=6.0 2023-06-18 14:59:44,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236550.0, ans=0.1 2023-06-18 14:59:55,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=236610.0, ans=0.2 2023-06-18 15:00:03,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 3.292e+02 4.166e+02 5.426e+02 1.146e+03, threshold=8.333e+02, percent-clipped=5.0 2023-06-18 15:00:05,988 INFO [train.py:996] (1/4) Epoch 2, batch 8950, loss[loss=0.2827, simple_loss=0.3311, pruned_loss=0.1171, over 21842.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3731, pruned_loss=0.1297, over 4268808.19 frames. ], batch size: 98, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:00:06,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=236670.0, ans=0.2 2023-06-18 15:00:25,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=236730.0, ans=0.1 2023-06-18 15:00:45,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=236730.0, ans=0.0 2023-06-18 15:00:49,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=236790.0, ans=0.0 2023-06-18 15:00:51,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=236790.0, ans=0.125 2023-06-18 15:01:16,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=236850.0, ans=0.2 2023-06-18 15:01:37,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=236910.0, ans=0.2 2023-06-18 15:01:42,178 INFO [train.py:996] (1/4) Epoch 2, batch 9000, loss[loss=0.2745, simple_loss=0.3199, pruned_loss=0.1146, over 21746.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3656, pruned_loss=0.128, over 4272126.32 frames. ], batch size: 112, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:01:42,179 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 15:02:02,152 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2979, simple_loss=0.3967, pruned_loss=0.09958, over 1796401.00 frames. 2023-06-18 15:02:02,153 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 15:02:05,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=236970.0, ans=0.0 2023-06-18 15:02:49,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.87 vs. limit=22.5 2023-06-18 15:02:57,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=237090.0, ans=6.0 2023-06-18 15:03:36,047 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 3.310e+02 4.099e+02 5.036e+02 9.465e+02, threshold=8.198e+02, percent-clipped=3.0 2023-06-18 15:03:39,407 INFO [train.py:996] (1/4) Epoch 2, batch 9050, loss[loss=0.3268, simple_loss=0.3819, pruned_loss=0.1358, over 21310.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3603, pruned_loss=0.1237, over 4277856.90 frames. ], batch size: 159, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:04:00,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=237270.0, ans=0.125 2023-06-18 15:04:35,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=237390.0, ans=0.2 2023-06-18 15:05:23,356 INFO [train.py:996] (1/4) Epoch 2, batch 9100, loss[loss=0.3096, simple_loss=0.3888, pruned_loss=0.1152, over 21795.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3672, pruned_loss=0.1278, over 4273308.07 frames. ], batch size: 371, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:05:42,793 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:05:58,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=237690.0, ans=0.125 2023-06-18 15:06:12,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=237690.0, ans=0.0 2023-06-18 15:06:20,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=237750.0, ans=0.125 2023-06-18 15:06:58,936 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 3.002e+02 3.899e+02 5.912e+02 1.285e+03, threshold=7.799e+02, percent-clipped=7.0 2023-06-18 15:07:05,465 INFO [train.py:996] (1/4) Epoch 2, batch 9150, loss[loss=0.2662, simple_loss=0.3398, pruned_loss=0.09626, over 21330.00 frames. ], tot_loss[loss=0.3067, simple_loss=0.3679, pruned_loss=0.1228, over 4269624.02 frames. ], batch size: 159, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:07:23,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=237930.0, ans=0.2 2023-06-18 15:07:40,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=237990.0, ans=0.125 2023-06-18 15:07:46,910 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:08:43,031 INFO [train.py:996] (1/4) Epoch 2, batch 9200, loss[loss=0.3308, simple_loss=0.3827, pruned_loss=0.1395, over 21305.00 frames. ], tot_loss[loss=0.3072, simple_loss=0.3705, pruned_loss=0.122, over 4266825.34 frames. ], batch size: 159, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:10:17,227 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 3.265e+02 3.893e+02 4.706e+02 1.094e+03, threshold=7.786e+02, percent-clipped=2.0 2023-06-18 15:10:18,916 INFO [train.py:996] (1/4) Epoch 2, batch 9250, loss[loss=0.3361, simple_loss=0.3809, pruned_loss=0.1457, over 21138.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3734, pruned_loss=0.1268, over 4270880.42 frames. ], batch size: 143, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:10:31,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=238470.0, ans=0.2 2023-06-18 15:11:28,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=238650.0, ans=0.0 2023-06-18 15:11:59,716 INFO [train.py:996] (1/4) Epoch 2, batch 9300, loss[loss=0.3224, simple_loss=0.3551, pruned_loss=0.1449, over 21182.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.368, pruned_loss=0.1277, over 4265430.10 frames. ], batch size: 608, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:12:03,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=238770.0, ans=0.025 2023-06-18 15:12:14,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=238830.0, ans=0.1 2023-06-18 15:13:03,428 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:13:03,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-06-18 15:13:12,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=238950.0, ans=0.125 2023-06-18 15:13:18,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=238950.0, ans=0.125 2023-06-18 15:13:27,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=239010.0, ans=0.125 2023-06-18 15:13:37,187 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.699e+02 3.726e+02 4.567e+02 5.377e+02 1.117e+03, threshold=9.135e+02, percent-clipped=5.0 2023-06-18 15:13:38,728 INFO [train.py:996] (1/4) Epoch 2, batch 9350, loss[loss=0.3495, simple_loss=0.419, pruned_loss=0.14, over 21246.00 frames. ], tot_loss[loss=0.319, simple_loss=0.3782, pruned_loss=0.1299, over 4264796.94 frames. ], batch size: 548, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:14:07,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=239130.0, ans=0.125 2023-06-18 15:14:19,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239190.0, ans=0.1 2023-06-18 15:14:44,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=239250.0, ans=0.0 2023-06-18 15:14:59,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=8.0 2023-06-18 15:15:17,460 INFO [train.py:996] (1/4) Epoch 2, batch 9400, loss[loss=0.2995, simple_loss=0.3424, pruned_loss=0.1283, over 21700.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3795, pruned_loss=0.1312, over 4263351.52 frames. ], batch size: 333, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:15:50,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=239430.0, ans=0.0 2023-06-18 15:16:36,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=239550.0, ans=0.125 2023-06-18 15:16:47,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=239610.0, ans=0.125 2023-06-18 15:16:53,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.296e+02 4.208e+02 5.207e+02 1.060e+03, threshold=8.416e+02, percent-clipped=2.0 2023-06-18 15:16:54,627 INFO [train.py:996] (1/4) Epoch 2, batch 9450, loss[loss=0.2873, simple_loss=0.3268, pruned_loss=0.1239, over 21430.00 frames. ], tot_loss[loss=0.315, simple_loss=0.3696, pruned_loss=0.1302, over 4272180.68 frames. ], batch size: 389, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:17:12,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=239670.0, ans=0.025 2023-06-18 15:17:12,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239670.0, ans=0.1 2023-06-18 15:17:25,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=239730.0, ans=0.0 2023-06-18 15:18:14,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=239910.0, ans=0.125 2023-06-18 15:18:19,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-18 15:18:31,545 INFO [train.py:996] (1/4) Epoch 2, batch 9500, loss[loss=0.2327, simple_loss=0.3029, pruned_loss=0.08121, over 21490.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3625, pruned_loss=0.1273, over 4267195.15 frames. ], batch size: 212, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:18:39,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-18 15:18:46,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=239970.0, ans=0.0 2023-06-18 15:18:48,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-18 15:18:52,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=240030.0, ans=0.2 2023-06-18 15:18:52,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=240030.0, ans=0.1 2023-06-18 15:19:01,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=22.5 2023-06-18 15:19:22,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=240090.0, ans=0.125 2023-06-18 15:19:28,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=240090.0, ans=0.125 2023-06-18 15:19:28,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=240090.0, ans=0.125 2023-06-18 15:19:37,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=240150.0, ans=0.125 2023-06-18 15:20:02,352 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.526e+02 3.477e+02 4.438e+02 5.411e+02 9.373e+02, threshold=8.876e+02, percent-clipped=3.0 2023-06-18 15:20:04,079 INFO [train.py:996] (1/4) Epoch 2, batch 9550, loss[loss=0.3399, simple_loss=0.4058, pruned_loss=0.137, over 21812.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3681, pruned_loss=0.1295, over 4261402.73 frames. ], batch size: 247, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:20:26,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=240270.0, ans=0.125 2023-06-18 15:20:26,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=240270.0, ans=0.0 2023-06-18 15:20:49,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=240330.0, ans=0.125 2023-06-18 15:21:06,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=240390.0, ans=0.0 2023-06-18 15:21:22,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=240450.0, ans=0.0 2023-06-18 15:21:22,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=240450.0, ans=0.125 2023-06-18 15:21:40,117 INFO [train.py:996] (1/4) Epoch 2, batch 9600, loss[loss=0.4073, simple_loss=0.4189, pruned_loss=0.1979, over 21783.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3708, pruned_loss=0.1325, over 4265508.02 frames. ], batch size: 508, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:21:51,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=240570.0, ans=0.125 2023-06-18 15:22:20,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=240630.0, ans=0.2 2023-06-18 15:22:26,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=240690.0, ans=0.0 2023-06-18 15:22:26,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.02 vs. limit=6.0 2023-06-18 15:22:35,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=240690.0, ans=0.0 2023-06-18 15:23:16,547 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.171e+02 3.689e+02 4.506e+02 8.293e+02, threshold=7.377e+02, percent-clipped=0.0 2023-06-18 15:23:18,115 INFO [train.py:996] (1/4) Epoch 2, batch 9650, loss[loss=0.3415, simple_loss=0.3844, pruned_loss=0.1493, over 21823.00 frames. ], tot_loss[loss=0.3156, simple_loss=0.3689, pruned_loss=0.1311, over 4268807.61 frames. ], batch size: 247, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:24:20,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=241050.0, ans=0.125 2023-06-18 15:25:00,594 INFO [train.py:996] (1/4) Epoch 2, batch 9700, loss[loss=0.2841, simple_loss=0.3556, pruned_loss=0.1063, over 21475.00 frames. ], tot_loss[loss=0.3171, simple_loss=0.3725, pruned_loss=0.1308, over 4273497.49 frames. ], batch size: 548, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:25:23,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=241170.0, ans=0.125 2023-06-18 15:25:30,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-18 15:25:43,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=241290.0, ans=0.125 2023-06-18 15:26:06,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=241350.0, ans=0.2 2023-06-18 15:26:35,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.207e+02 3.701e+02 4.556e+02 8.027e+02, threshold=7.401e+02, percent-clipped=3.0 2023-06-18 15:26:36,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=241470.0, ans=0.1 2023-06-18 15:26:37,172 INFO [train.py:996] (1/4) Epoch 2, batch 9750, loss[loss=0.261, simple_loss=0.3011, pruned_loss=0.1104, over 21216.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3668, pruned_loss=0.1291, over 4274468.50 frames. ], batch size: 548, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:26:47,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=241470.0, ans=0.0 2023-06-18 15:27:49,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=22.5 2023-06-18 15:28:08,542 INFO [train.py:996] (1/4) Epoch 2, batch 9800, loss[loss=0.3115, simple_loss=0.3618, pruned_loss=0.1306, over 21958.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3649, pruned_loss=0.1287, over 4258303.55 frames. ], batch size: 351, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:28:27,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=241830.0, ans=0.125 2023-06-18 15:28:44,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=241830.0, ans=0.125 2023-06-18 15:29:12,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=241950.0, ans=0.125 2023-06-18 15:29:37,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=242010.0, ans=0.0 2023-06-18 15:29:43,736 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.313e+02 4.009e+02 5.228e+02 9.511e+02, threshold=8.018e+02, percent-clipped=4.0 2023-06-18 15:29:45,205 INFO [train.py:996] (1/4) Epoch 2, batch 9850, loss[loss=0.2679, simple_loss=0.3131, pruned_loss=0.1114, over 21798.00 frames. ], tot_loss[loss=0.309, simple_loss=0.3604, pruned_loss=0.1288, over 4265363.13 frames. ], batch size: 351, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:29:56,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=242070.0, ans=0.2 2023-06-18 15:30:30,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=242190.0, ans=0.0 2023-06-18 15:30:49,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=242250.0, ans=0.125 2023-06-18 15:31:22,215 INFO [train.py:996] (1/4) Epoch 2, batch 9900, loss[loss=0.3357, simple_loss=0.3802, pruned_loss=0.1456, over 21320.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.3586, pruned_loss=0.1281, over 4249105.42 frames. ], batch size: 471, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:31:31,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-18 15:31:38,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=242370.0, ans=0.2 2023-06-18 15:31:38,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-18 15:31:55,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=242430.0, ans=0.125 2023-06-18 15:33:02,432 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.487e+02 4.462e+02 5.702e+02 1.060e+03, threshold=8.923e+02, percent-clipped=2.0 2023-06-18 15:33:03,926 INFO [train.py:996] (1/4) Epoch 2, batch 9950, loss[loss=0.3407, simple_loss=0.3976, pruned_loss=0.1419, over 21786.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.3627, pruned_loss=0.1318, over 4260738.03 frames. ], batch size: 124, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:33:47,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-18 15:33:50,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=242790.0, ans=0.125 2023-06-18 15:34:04,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242850.0, ans=0.1 2023-06-18 15:34:14,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=22.5 2023-06-18 15:34:36,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-18 15:34:41,499 INFO [train.py:996] (1/4) Epoch 2, batch 10000, loss[loss=0.3005, simple_loss=0.3444, pruned_loss=0.1283, over 21626.00 frames. ], tot_loss[loss=0.3064, simple_loss=0.3564, pruned_loss=0.1283, over 4269918.29 frames. ], batch size: 298, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:35:13,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243030.0, ans=0.1 2023-06-18 15:36:04,873 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-18 15:36:14,710 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.362e+02 4.103e+02 5.165e+02 9.257e+02, threshold=8.205e+02, percent-clipped=2.0 2023-06-18 15:36:16,250 INFO [train.py:996] (1/4) Epoch 2, batch 10050, loss[loss=0.2971, simple_loss=0.3452, pruned_loss=0.1245, over 21654.00 frames. ], tot_loss[loss=0.3051, simple_loss=0.356, pruned_loss=0.1271, over 4278774.39 frames. ], batch size: 298, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:36:33,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=243270.0, ans=0.95 2023-06-18 15:36:45,627 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-18 15:36:48,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=243330.0, ans=0.125 2023-06-18 15:37:02,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-18 15:37:20,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-18 15:37:21,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=243450.0, ans=0.125 2023-06-18 15:37:37,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=243510.0, ans=0.125 2023-06-18 15:38:03,626 INFO [train.py:996] (1/4) Epoch 2, batch 10100, loss[loss=0.3381, simple_loss=0.3996, pruned_loss=0.1383, over 21639.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3524, pruned_loss=0.1245, over 4279502.46 frames. ], batch size: 414, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:38:16,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243570.0, ans=0.1 2023-06-18 15:38:28,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=12.0 2023-06-18 15:39:39,567 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.265e+02 3.952e+02 5.116e+02 8.346e+02, threshold=7.904e+02, percent-clipped=1.0 2023-06-18 15:39:41,246 INFO [train.py:996] (1/4) Epoch 2, batch 10150, loss[loss=0.2757, simple_loss=0.3345, pruned_loss=0.1084, over 21568.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3607, pruned_loss=0.1286, over 4270803.38 frames. ], batch size: 230, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:39:43,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-06-18 15:39:58,994 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:40:27,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=243990.0, ans=0.125 2023-06-18 15:40:28,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.62 vs. limit=22.5 2023-06-18 15:40:30,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243990.0, ans=0.1 2023-06-18 15:40:44,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=244050.0, ans=0.125 2023-06-18 15:41:08,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-06-18 15:41:19,207 INFO [train.py:996] (1/4) Epoch 2, batch 10200, loss[loss=0.2378, simple_loss=0.3156, pruned_loss=0.07993, over 21686.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.358, pruned_loss=0.1246, over 4260595.50 frames. ], batch size: 247, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:41:21,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-18 15:42:17,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=244350.0, ans=0.0 2023-06-18 15:42:37,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=244350.0, ans=0.5 2023-06-18 15:42:54,647 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.941e+02 3.489e+02 4.418e+02 6.706e+02, threshold=6.977e+02, percent-clipped=0.0 2023-06-18 15:42:56,138 INFO [train.py:996] (1/4) Epoch 2, batch 10250, loss[loss=0.2145, simple_loss=0.3011, pruned_loss=0.06392, over 21449.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.353, pruned_loss=0.118, over 4260076.52 frames. ], batch size: 212, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:43:20,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244530.0, ans=0.1 2023-06-18 15:43:51,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=244650.0, ans=0.125 2023-06-18 15:43:53,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=244650.0, ans=0.0 2023-06-18 15:44:16,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-18 15:44:22,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-18 15:44:23,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=244710.0, ans=0.2 2023-06-18 15:44:28,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=244710.0, ans=0.0 2023-06-18 15:44:30,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=244710.0, ans=0.07 2023-06-18 15:44:34,302 INFO [train.py:996] (1/4) Epoch 2, batch 10300, loss[loss=0.2952, simple_loss=0.3652, pruned_loss=0.1126, over 21833.00 frames. ], tot_loss[loss=0.2968, simple_loss=0.3561, pruned_loss=0.1187, over 4264688.45 frames. ], batch size: 282, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:44:39,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=244770.0, ans=0.025 2023-06-18 15:44:47,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=244770.0, ans=0.125 2023-06-18 15:44:57,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244830.0, ans=0.1 2023-06-18 15:44:58,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.94 vs. limit=22.5 2023-06-18 15:45:09,488 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:45:52,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=244950.0, ans=0.2 2023-06-18 15:46:17,222 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.346e+02 4.331e+02 5.577e+02 1.197e+03, threshold=8.662e+02, percent-clipped=10.0 2023-06-18 15:46:18,809 INFO [train.py:996] (1/4) Epoch 2, batch 10350, loss[loss=0.3223, simple_loss=0.3558, pruned_loss=0.1444, over 21113.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3583, pruned_loss=0.1197, over 4266112.79 frames. ], batch size: 608, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:46:52,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=245130.0, ans=0.125 2023-06-18 15:47:07,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=245190.0, ans=0.125 2023-06-18 15:47:11,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=245190.0, ans=0.0 2023-06-18 15:47:45,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=245310.0, ans=0.0 2023-06-18 15:48:00,714 INFO [train.py:996] (1/4) Epoch 2, batch 10400, loss[loss=0.2298, simple_loss=0.2753, pruned_loss=0.09212, over 21176.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3497, pruned_loss=0.1162, over 4259724.26 frames. ], batch size: 176, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:48:04,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=245370.0, ans=0.125 2023-06-18 15:48:10,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=245370.0, ans=0.2 2023-06-18 15:48:24,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=245430.0, ans=0.0 2023-06-18 15:49:04,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=245550.0, ans=0.0 2023-06-18 15:49:34,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=245610.0, ans=0.0 2023-06-18 15:49:40,523 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 3.447e+02 4.106e+02 4.896e+02 8.870e+02, threshold=8.213e+02, percent-clipped=2.0 2023-06-18 15:49:42,065 INFO [train.py:996] (1/4) Epoch 2, batch 10450, loss[loss=0.3036, simple_loss=0.3545, pruned_loss=0.1264, over 21127.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3567, pruned_loss=0.1221, over 4268303.11 frames. ], batch size: 143, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:49:46,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=245670.0, ans=0.125 2023-06-18 15:50:09,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=245730.0, ans=0.125 2023-06-18 15:51:19,826 INFO [train.py:996] (1/4) Epoch 2, batch 10500, loss[loss=0.2569, simple_loss=0.3006, pruned_loss=0.1066, over 21617.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3552, pruned_loss=0.1207, over 4266067.35 frames. ], batch size: 247, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:51:46,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-18 15:51:55,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246030.0, ans=0.1 2023-06-18 15:52:03,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-18 15:52:14,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=246090.0, ans=0.125 2023-06-18 15:52:40,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=246210.0, ans=0.1 2023-06-18 15:52:54,399 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.201e+02 3.705e+02 4.440e+02 6.098e+02, threshold=7.409e+02, percent-clipped=0.0 2023-06-18 15:52:55,941 INFO [train.py:996] (1/4) Epoch 2, batch 10550, loss[loss=0.2835, simple_loss=0.3336, pruned_loss=0.1168, over 22015.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3498, pruned_loss=0.1208, over 4273496.25 frames. ], batch size: 103, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:53:29,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-18 15:53:33,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=246330.0, ans=0.0 2023-06-18 15:53:35,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=246330.0, ans=0.0 2023-06-18 15:54:06,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=246450.0, ans=0.0 2023-06-18 15:54:06,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=246450.0, ans=0.0 2023-06-18 15:54:24,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=246510.0, ans=0.125 2023-06-18 15:54:32,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=246570.0, ans=22.5 2023-06-18 15:54:33,560 INFO [train.py:996] (1/4) Epoch 2, batch 10600, loss[loss=0.2199, simple_loss=0.2788, pruned_loss=0.08046, over 21806.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3447, pruned_loss=0.1178, over 4256802.59 frames. ], batch size: 118, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:54:35,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=246570.0, ans=0.0 2023-06-18 15:55:35,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=246690.0, ans=0.125 2023-06-18 15:55:37,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=246690.0, ans=0.0 2023-06-18 15:55:37,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2023-06-18 15:55:39,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.53 vs. limit=15.0 2023-06-18 15:55:47,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=246750.0, ans=0.0 2023-06-18 15:55:50,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=246750.0, ans=0.0 2023-06-18 15:56:22,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 3.234e+02 3.580e+02 4.539e+02 8.323e+02, threshold=7.159e+02, percent-clipped=4.0 2023-06-18 15:56:23,893 INFO [train.py:996] (1/4) Epoch 2, batch 10650, loss[loss=0.2195, simple_loss=0.2916, pruned_loss=0.07373, over 21669.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3465, pruned_loss=0.1162, over 4250269.76 frames. ], batch size: 247, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:56:40,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-18 15:56:43,226 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-18 15:58:01,627 INFO [train.py:996] (1/4) Epoch 2, batch 10700, loss[loss=0.2824, simple_loss=0.3443, pruned_loss=0.1103, over 21980.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.346, pruned_loss=0.1154, over 4255441.98 frames. ], batch size: 317, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:58:15,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=247170.0, ans=0.0 2023-06-18 15:58:49,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247290.0, ans=0.1 2023-06-18 15:59:43,240 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.410e+02 4.130e+02 4.973e+02 8.640e+02, threshold=8.260e+02, percent-clipped=4.0 2023-06-18 15:59:44,842 INFO [train.py:996] (1/4) Epoch 2, batch 10750, loss[loss=0.3098, simple_loss=0.3923, pruned_loss=0.1136, over 21794.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3589, pruned_loss=0.1234, over 4261396.00 frames. ], batch size: 351, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 16:00:06,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=247530.0, ans=0.125 2023-06-18 16:00:30,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=247590.0, ans=0.5 2023-06-18 16:00:40,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-18 16:01:30,540 INFO [train.py:996] (1/4) Epoch 2, batch 10800, loss[loss=0.2981, simple_loss=0.3813, pruned_loss=0.1075, over 20706.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3666, pruned_loss=0.1259, over 4260032.93 frames. ], batch size: 607, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 16:02:00,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=247890.0, ans=0.035 2023-06-18 16:02:24,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=247890.0, ans=0.0 2023-06-18 16:02:33,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-18 16:02:40,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=247950.0, ans=0.2 2023-06-18 16:02:44,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=247950.0, ans=0.125 2023-06-18 16:02:48,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=248010.0, ans=0.0 2023-06-18 16:03:08,329 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.160e+02 3.815e+02 4.913e+02 8.496e+02, threshold=7.629e+02, percent-clipped=1.0 2023-06-18 16:03:08,349 INFO [train.py:996] (1/4) Epoch 2, batch 10850, loss[loss=0.2854, simple_loss=0.327, pruned_loss=0.1219, over 21838.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3678, pruned_loss=0.127, over 4262934.82 frames. ], batch size: 98, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:03:13,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=248070.0, ans=0.125 2023-06-18 16:03:29,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=22.5 2023-06-18 16:04:05,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=248250.0, ans=0.125 2023-06-18 16:04:40,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=248310.0, ans=0.125 2023-06-18 16:04:46,955 INFO [train.py:996] (1/4) Epoch 2, batch 10900, loss[loss=0.2606, simple_loss=0.3288, pruned_loss=0.09622, over 21235.00 frames. ], tot_loss[loss=0.3042, simple_loss=0.3598, pruned_loss=0.1243, over 4268045.08 frames. ], batch size: 159, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:04:53,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=248370.0, ans=0.0 2023-06-18 16:05:10,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=248430.0, ans=0.125 2023-06-18 16:05:46,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=248550.0, ans=0.0 2023-06-18 16:05:54,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.78 vs. limit=22.5 2023-06-18 16:06:16,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.12 vs. limit=6.0 2023-06-18 16:06:18,625 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 2.990e+02 3.670e+02 4.688e+02 1.000e+03, threshold=7.341e+02, percent-clipped=2.0 2023-06-18 16:06:18,645 INFO [train.py:996] (1/4) Epoch 2, batch 10950, loss[loss=0.2667, simple_loss=0.3225, pruned_loss=0.1055, over 21814.00 frames. ], tot_loss[loss=0.2982, simple_loss=0.3533, pruned_loss=0.1215, over 4259900.15 frames. ], batch size: 352, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:06:31,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=248670.0, ans=0.125 2023-06-18 16:06:39,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=248730.0, ans=0.1 2023-06-18 16:06:45,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-18 16:07:27,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-18 16:07:37,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=248850.0, ans=0.125 2023-06-18 16:07:55,482 INFO [train.py:996] (1/4) Epoch 2, batch 11000, loss[loss=0.2679, simple_loss=0.3243, pruned_loss=0.1058, over 21299.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3522, pruned_loss=0.122, over 4258592.45 frames. ], batch size: 159, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:08:41,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=249090.0, ans=10.0 2023-06-18 16:09:00,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=249150.0, ans=0.125 2023-06-18 16:09:30,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=249210.0, ans=0.125 2023-06-18 16:09:32,784 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.446e+02 4.232e+02 5.447e+02 9.802e+02, threshold=8.463e+02, percent-clipped=9.0 2023-06-18 16:09:32,805 INFO [train.py:996] (1/4) Epoch 2, batch 11050, loss[loss=0.2755, simple_loss=0.3228, pruned_loss=0.1141, over 21807.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3517, pruned_loss=0.1243, over 4260640.20 frames. ], batch size: 112, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:09:53,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=249330.0, ans=0.035 2023-06-18 16:10:41,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-18 16:10:49,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-06-18 16:11:01,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=249510.0, ans=0.07 2023-06-18 16:11:10,287 INFO [train.py:996] (1/4) Epoch 2, batch 11100, loss[loss=0.3321, simple_loss=0.3681, pruned_loss=0.148, over 21970.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3506, pruned_loss=0.1251, over 4262450.16 frames. ], batch size: 103, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:11:21,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=249570.0, ans=0.0 2023-06-18 16:11:51,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=249690.0, ans=0.125 2023-06-18 16:12:15,797 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-18 16:12:26,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=249750.0, ans=0.125 2023-06-18 16:12:41,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=249810.0, ans=0.0 2023-06-18 16:12:47,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.019e+02 3.669e+02 4.475e+02 9.197e+02, threshold=7.338e+02, percent-clipped=1.0 2023-06-18 16:12:47,671 INFO [train.py:996] (1/4) Epoch 2, batch 11150, loss[loss=0.2782, simple_loss=0.3427, pruned_loss=0.1068, over 21355.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3485, pruned_loss=0.1241, over 4249529.99 frames. ], batch size: 131, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:12:58,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=249870.0, ans=0.1 2023-06-18 16:13:06,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=249930.0, ans=0.125 2023-06-18 16:13:07,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=249930.0, ans=0.125 2023-06-18 16:13:37,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=249990.0, ans=0.0 2023-06-18 16:14:02,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=250050.0, ans=0.125 2023-06-18 16:14:23,561 INFO [train.py:996] (1/4) Epoch 2, batch 11200, loss[loss=0.3256, simple_loss=0.4297, pruned_loss=0.1107, over 20812.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3483, pruned_loss=0.1235, over 4240636.69 frames. ], batch size: 608, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:14:30,715 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=12.0 2023-06-18 16:14:31,798 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.532e-01 2023-06-18 16:14:34,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=250170.0, ans=0.125 2023-06-18 16:14:52,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=250290.0, ans=0.0 2023-06-18 16:15:08,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=250290.0, ans=0.125 2023-06-18 16:15:09,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=250290.0, ans=0.0 2023-06-18 16:15:44,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=250410.0, ans=0.95 2023-06-18 16:15:46,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=250410.0, ans=6.0 2023-06-18 16:15:59,447 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.566e+02 4.194e+02 5.517e+02 1.156e+03, threshold=8.389e+02, percent-clipped=11.0 2023-06-18 16:15:59,467 INFO [train.py:996] (1/4) Epoch 2, batch 11250, loss[loss=0.2978, simple_loss=0.3685, pruned_loss=0.1135, over 21427.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3471, pruned_loss=0.1227, over 4243874.44 frames. ], batch size: 131, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:16:06,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250470.0, ans=0.1 2023-06-18 16:16:10,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=250470.0, ans=0.1 2023-06-18 16:16:16,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=250530.0, ans=0.125 2023-06-18 16:16:57,409 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-18 16:17:03,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=250650.0, ans=0.125 2023-06-18 16:17:30,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=250710.0, ans=0.125 2023-06-18 16:17:36,313 INFO [train.py:996] (1/4) Epoch 2, batch 11300, loss[loss=0.2325, simple_loss=0.2979, pruned_loss=0.0835, over 21256.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3471, pruned_loss=0.1224, over 4248635.56 frames. ], batch size: 159, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:17:47,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=250770.0, ans=0.125 2023-06-18 16:17:56,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=250830.0, ans=0.125 2023-06-18 16:19:12,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=251070.0, ans=0.2 2023-06-18 16:19:13,153 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.183e+02 3.739e+02 4.623e+02 9.049e+02, threshold=7.478e+02, percent-clipped=1.0 2023-06-18 16:19:13,173 INFO [train.py:996] (1/4) Epoch 2, batch 11350, loss[loss=0.2733, simple_loss=0.3391, pruned_loss=0.1038, over 21202.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.3466, pruned_loss=0.1202, over 4261496.85 frames. ], batch size: 176, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:19:14,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-18 16:20:26,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=251250.0, ans=0.125 2023-06-18 16:20:53,100 INFO [train.py:996] (1/4) Epoch 2, batch 11400, loss[loss=0.2733, simple_loss=0.3352, pruned_loss=0.1057, over 21277.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3527, pruned_loss=0.1226, over 4265100.30 frames. ], batch size: 176, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:21:01,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=251370.0, ans=0.125 2023-06-18 16:21:03,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=251370.0, ans=0.125 2023-06-18 16:22:23,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=251610.0, ans=0.125 2023-06-18 16:22:28,754 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-06-18 16:22:34,134 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.522e+02 4.244e+02 5.675e+02 1.170e+03, threshold=8.488e+02, percent-clipped=5.0 2023-06-18 16:22:34,155 INFO [train.py:996] (1/4) Epoch 2, batch 11450, loss[loss=0.2848, simple_loss=0.3545, pruned_loss=0.1076, over 21639.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3543, pruned_loss=0.1216, over 4266551.26 frames. ], batch size: 263, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:22:55,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=251730.0, ans=0.125 2023-06-18 16:22:55,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=251730.0, ans=0.0 2023-06-18 16:23:14,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=251730.0, ans=0.5 2023-06-18 16:23:38,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=251790.0, ans=0.125 2023-06-18 16:24:13,562 INFO [train.py:996] (1/4) Epoch 2, batch 11500, loss[loss=0.2471, simple_loss=0.3203, pruned_loss=0.08692, over 21209.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3582, pruned_loss=0.1233, over 4276024.74 frames. ], batch size: 159, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:25:11,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252090.0, ans=0.1 2023-06-18 16:25:21,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252150.0, ans=0.1 2023-06-18 16:25:21,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.86 vs. limit=15.0 2023-06-18 16:25:36,641 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.63 vs. limit=22.5 2023-06-18 16:25:40,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=252210.0, ans=0.125 2023-06-18 16:25:52,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.283e+02 4.041e+02 4.776e+02 1.091e+03, threshold=8.082e+02, percent-clipped=3.0 2023-06-18 16:25:52,862 INFO [train.py:996] (1/4) Epoch 2, batch 11550, loss[loss=0.2852, simple_loss=0.3617, pruned_loss=0.1044, over 21726.00 frames. ], tot_loss[loss=0.3055, simple_loss=0.3644, pruned_loss=0.1233, over 4279244.43 frames. ], batch size: 247, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:25:53,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=15.0 2023-06-18 16:26:44,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=252390.0, ans=0.125 2023-06-18 16:27:08,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=252450.0, ans=0.2 2023-06-18 16:27:48,276 INFO [train.py:996] (1/4) Epoch 2, batch 11600, loss[loss=0.2816, simple_loss=0.3612, pruned_loss=0.1011, over 21826.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3778, pruned_loss=0.1246, over 4264521.40 frames. ], batch size: 118, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:28:07,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-18 16:28:35,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=252750.0, ans=0.125 2023-06-18 16:28:36,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=252750.0, ans=0.125 2023-06-18 16:28:53,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=252810.0, ans=0.125 2023-06-18 16:29:15,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=252810.0, ans=0.1 2023-06-18 16:29:17,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=252810.0, ans=0.125 2023-06-18 16:29:25,170 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 3.666e+02 4.957e+02 6.331e+02 1.126e+03, threshold=9.914e+02, percent-clipped=8.0 2023-06-18 16:29:25,190 INFO [train.py:996] (1/4) Epoch 2, batch 11650, loss[loss=0.3466, simple_loss=0.4051, pruned_loss=0.144, over 21786.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3841, pruned_loss=0.1248, over 4273158.75 frames. ], batch size: 317, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:29:34,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=252870.0, ans=0.125 2023-06-18 16:29:45,272 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:31:02,378 INFO [train.py:996] (1/4) Epoch 2, batch 11700, loss[loss=0.293, simple_loss=0.3294, pruned_loss=0.1282, over 21288.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.3757, pruned_loss=0.125, over 4272147.44 frames. ], batch size: 144, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:31:06,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=253170.0, ans=0.125 2023-06-18 16:31:14,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=12.0 2023-06-18 16:31:33,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=253290.0, ans=0.0 2023-06-18 16:31:50,236 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:31:51,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=253350.0, ans=0.0 2023-06-18 16:32:05,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=253350.0, ans=0.125 2023-06-18 16:32:25,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=253410.0, ans=0.2 2023-06-18 16:32:29,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-18 16:32:38,117 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.237e+02 3.960e+02 5.340e+02 1.578e+03, threshold=7.920e+02, percent-clipped=3.0 2023-06-18 16:32:38,138 INFO [train.py:996] (1/4) Epoch 2, batch 11750, loss[loss=0.3314, simple_loss=0.3544, pruned_loss=0.1542, over 21442.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3662, pruned_loss=0.1252, over 4262854.44 frames. ], batch size: 476, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:32:56,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-18 16:32:58,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=253530.0, ans=0.0 2023-06-18 16:33:18,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=253590.0, ans=0.0 2023-06-18 16:33:18,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.15 vs. limit=15.0 2023-06-18 16:34:13,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=253710.0, ans=10.0 2023-06-18 16:34:17,513 INFO [train.py:996] (1/4) Epoch 2, batch 11800, loss[loss=0.3242, simple_loss=0.3755, pruned_loss=0.1365, over 21530.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3694, pruned_loss=0.1294, over 4263232.62 frames. ], batch size: 389, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:34:22,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=253770.0, ans=0.2 2023-06-18 16:34:24,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=253770.0, ans=0.0 2023-06-18 16:34:27,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=253770.0, ans=0.125 2023-06-18 16:34:31,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253830.0, ans=0.1 2023-06-18 16:34:40,664 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-18 16:35:27,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=253950.0, ans=0.0 2023-06-18 16:35:28,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-18 16:35:57,245 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.227e+02 4.040e+02 5.078e+02 7.033e+02, threshold=8.080e+02, percent-clipped=0.0 2023-06-18 16:35:57,274 INFO [train.py:996] (1/4) Epoch 2, batch 11850, loss[loss=0.363, simple_loss=0.4273, pruned_loss=0.1494, over 21503.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3705, pruned_loss=0.1283, over 4265400.49 frames. ], batch size: 507, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:36:19,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=254130.0, ans=0.125 2023-06-18 16:36:25,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=254130.0, ans=0.2 2023-06-18 16:36:35,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=28.10 vs. limit=15.0 2023-06-18 16:37:35,363 INFO [train.py:996] (1/4) Epoch 2, batch 11900, loss[loss=0.2644, simple_loss=0.3363, pruned_loss=0.09624, over 21694.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3705, pruned_loss=0.1256, over 4270510.19 frames. ], batch size: 298, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:38:01,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=254430.0, ans=0.125 2023-06-18 16:39:01,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=254610.0, ans=0.125 2023-06-18 16:39:03,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=254610.0, ans=12.0 2023-06-18 16:39:13,278 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 3.213e+02 3.818e+02 4.903e+02 8.116e+02, threshold=7.635e+02, percent-clipped=1.0 2023-06-18 16:39:13,298 INFO [train.py:996] (1/4) Epoch 2, batch 11950, loss[loss=0.1884, simple_loss=0.2337, pruned_loss=0.07155, over 15951.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3699, pruned_loss=0.1207, over 4262968.26 frames. ], batch size: 60, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:39:17,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-18 16:39:33,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=254730.0, ans=0.0 2023-06-18 16:39:56,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=254790.0, ans=15.0 2023-06-18 16:40:32,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=254910.0, ans=0.125 2023-06-18 16:40:49,306 INFO [train.py:996] (1/4) Epoch 2, batch 12000, loss[loss=0.2677, simple_loss=0.3134, pruned_loss=0.1111, over 21488.00 frames. ], tot_loss[loss=0.3003, simple_loss=0.3632, pruned_loss=0.1187, over 4262239.59 frames. ], batch size: 212, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:40:49,306 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 16:41:05,149 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2926, simple_loss=0.3848, pruned_loss=0.1002, over 1796401.00 frames. 2023-06-18 16:41:05,150 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 16:42:15,202 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-18 16:42:33,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=255210.0, ans=0.0 2023-06-18 16:42:42,375 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.604e+02 5.059e+02 6.079e+02 1.381e+03, threshold=1.012e+03, percent-clipped=10.0 2023-06-18 16:42:42,405 INFO [train.py:996] (1/4) Epoch 2, batch 12050, loss[loss=0.301, simple_loss=0.3358, pruned_loss=0.133, over 21681.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3599, pruned_loss=0.1208, over 4257354.72 frames. ], batch size: 247, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:42:50,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=255270.0, ans=0.07 2023-06-18 16:42:52,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-18 16:43:03,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=255330.0, ans=0.125 2023-06-18 16:43:32,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=255390.0, ans=0.0 2023-06-18 16:43:57,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=255450.0, ans=0.0 2023-06-18 16:44:07,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=22.5 2023-06-18 16:44:15,706 INFO [train.py:996] (1/4) Epoch 2, batch 12100, loss[loss=0.3304, simple_loss=0.3899, pruned_loss=0.1355, over 21188.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3691, pruned_loss=0.1275, over 4266399.50 frames. ], batch size: 176, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:44:16,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=255570.0, ans=0.1 2023-06-18 16:44:17,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=255570.0, ans=0.2 2023-06-18 16:44:48,553 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:45:05,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=255690.0, ans=0.125 2023-06-18 16:45:21,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=255750.0, ans=0.0 2023-06-18 16:45:51,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=255810.0, ans=0.1 2023-06-18 16:46:01,034 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.581e+02 3.947e+02 4.820e+02 5.748e+02 9.180e+02, threshold=9.640e+02, percent-clipped=0.0 2023-06-18 16:46:01,055 INFO [train.py:996] (1/4) Epoch 2, batch 12150, loss[loss=0.4267, simple_loss=0.4808, pruned_loss=0.1863, over 21486.00 frames. ], tot_loss[loss=0.3166, simple_loss=0.3738, pruned_loss=0.1297, over 4267495.89 frames. ], batch size: 507, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:46:07,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=255870.0, ans=0.1 2023-06-18 16:46:16,713 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-18 16:46:37,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=255930.0, ans=0.0 2023-06-18 16:46:41,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=255990.0, ans=0.2 2023-06-18 16:46:59,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=256050.0, ans=0.2 2023-06-18 16:47:02,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=256050.0, ans=0.0 2023-06-18 16:47:44,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256110.0, ans=0.1 2023-06-18 16:47:47,399 INFO [train.py:996] (1/4) Epoch 2, batch 12200, loss[loss=0.2614, simple_loss=0.3112, pruned_loss=0.1058, over 21636.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3707, pruned_loss=0.1282, over 4270979.07 frames. ], batch size: 298, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:48:13,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=256230.0, ans=0.125 2023-06-18 16:48:38,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=256350.0, ans=0.2 2023-06-18 16:49:24,890 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.962e+02 3.747e+02 4.961e+02 1.098e+03, threshold=7.494e+02, percent-clipped=1.0 2023-06-18 16:49:24,910 INFO [train.py:996] (1/4) Epoch 2, batch 12250, loss[loss=0.2247, simple_loss=0.2904, pruned_loss=0.07944, over 21219.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3608, pruned_loss=0.1229, over 4271127.81 frames. ], batch size: 176, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:49:40,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=256530.0, ans=0.125 2023-06-18 16:49:53,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=256530.0, ans=0.125 2023-06-18 16:49:54,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=256590.0, ans=0.95 2023-06-18 16:49:56,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256590.0, ans=0.1 2023-06-18 16:50:13,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=256650.0, ans=0.125 2023-06-18 16:51:02,000 INFO [train.py:996] (1/4) Epoch 2, batch 12300, loss[loss=0.1881, simple_loss=0.2617, pruned_loss=0.05729, over 21672.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3486, pruned_loss=0.1134, over 4274728.89 frames. ], batch size: 230, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:51:24,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=256830.0, ans=0.0 2023-06-18 16:51:39,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=256890.0, ans=0.0 2023-06-18 16:51:57,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=256950.0, ans=0.0 2023-06-18 16:52:02,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-06-18 16:52:37,991 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.926e+02 3.592e+02 4.510e+02 1.066e+03, threshold=7.183e+02, percent-clipped=4.0 2023-06-18 16:52:38,011 INFO [train.py:996] (1/4) Epoch 2, batch 12350, loss[loss=0.294, simple_loss=0.3515, pruned_loss=0.1183, over 21800.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3479, pruned_loss=0.1109, over 4269066.45 frames. ], batch size: 298, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:52:47,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257070.0, ans=0.1 2023-06-18 16:53:01,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=257130.0, ans=0.125 2023-06-18 16:53:08,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.00 vs. limit=22.5 2023-06-18 16:53:23,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=257250.0, ans=0.125 2023-06-18 16:53:34,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=257250.0, ans=0.125 2023-06-18 16:53:49,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=257310.0, ans=0.0 2023-06-18 16:54:00,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=257310.0, ans=0.2 2023-06-18 16:54:09,386 INFO [train.py:996] (1/4) Epoch 2, batch 12400, loss[loss=0.3788, simple_loss=0.4047, pruned_loss=0.1764, over 21886.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3545, pruned_loss=0.1177, over 4278862.11 frames. ], batch size: 414, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:54:43,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=257490.0, ans=0.0 2023-06-18 16:54:57,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=257550.0, ans=0.125 2023-06-18 16:55:43,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.570e+02 4.078e+02 4.958e+02 7.763e+02, threshold=8.156e+02, percent-clipped=1.0 2023-06-18 16:55:43,230 INFO [train.py:996] (1/4) Epoch 2, batch 12450, loss[loss=0.3569, simple_loss=0.394, pruned_loss=0.1599, over 20818.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3593, pruned_loss=0.1229, over 4280489.10 frames. ], batch size: 608, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:56:20,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=257790.0, ans=0.125 2023-06-18 16:56:30,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=257790.0, ans=0.0 2023-06-18 16:56:56,030 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:57:20,742 INFO [train.py:996] (1/4) Epoch 2, batch 12500, loss[loss=0.382, simple_loss=0.4292, pruned_loss=0.1674, over 19896.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.3705, pruned_loss=0.1271, over 4277969.85 frames. ], batch size: 703, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:57:21,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=257970.0, ans=0.0 2023-06-18 16:57:27,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=257970.0, ans=0.125 2023-06-18 16:57:28,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-18 16:57:37,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=258030.0, ans=0.125 2023-06-18 16:57:41,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=258030.0, ans=0.025 2023-06-18 16:58:26,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=258150.0, ans=0.125 2023-06-18 16:58:31,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=258150.0, ans=0.125 2023-06-18 16:58:48,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.41 vs. limit=6.0 2023-06-18 16:58:58,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.350e+02 3.939e+02 4.917e+02 9.519e+02, threshold=7.878e+02, percent-clipped=2.0 2023-06-18 16:58:58,500 INFO [train.py:996] (1/4) Epoch 2, batch 12550, loss[loss=0.3134, simple_loss=0.3726, pruned_loss=0.1271, over 21536.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.3776, pruned_loss=0.1318, over 4277181.84 frames. ], batch size: 131, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:59:03,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=258270.0, ans=0.125 2023-06-18 16:59:32,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=258330.0, ans=0.1 2023-06-18 16:59:58,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=258390.0, ans=22.5 2023-06-18 17:00:11,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=258450.0, ans=0.2 2023-06-18 17:00:12,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=258450.0, ans=0.125 2023-06-18 17:00:13,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=258450.0, ans=0.125 2023-06-18 17:00:15,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=258450.0, ans=0.04949747468305833 2023-06-18 17:00:24,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=258510.0, ans=0.0 2023-06-18 17:00:36,874 INFO [train.py:996] (1/4) Epoch 2, batch 12600, loss[loss=0.2826, simple_loss=0.3705, pruned_loss=0.09731, over 21218.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3761, pruned_loss=0.1273, over 4268919.77 frames. ], batch size: 549, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:00:47,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258570.0, ans=0.1 2023-06-18 17:01:38,809 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.29 vs. limit=22.5 2023-06-18 17:01:50,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=258750.0, ans=0.125 2023-06-18 17:02:12,915 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.876e+02 3.494e+02 4.502e+02 7.452e+02, threshold=6.987e+02, percent-clipped=0.0 2023-06-18 17:02:12,935 INFO [train.py:996] (1/4) Epoch 2, batch 12650, loss[loss=0.2518, simple_loss=0.3317, pruned_loss=0.08597, over 20847.00 frames. ], tot_loss[loss=0.3035, simple_loss=0.3655, pruned_loss=0.1207, over 4266260.39 frames. ], batch size: 609, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:02:38,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=258930.0, ans=0.125 2023-06-18 17:02:55,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=258930.0, ans=0.02 2023-06-18 17:02:57,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=258990.0, ans=0.125 2023-06-18 17:03:00,194 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:03:14,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=259050.0, ans=0.0 2023-06-18 17:03:17,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=259050.0, ans=0.1 2023-06-18 17:03:20,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=259050.0, ans=0.125 2023-06-18 17:03:39,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=259110.0, ans=0.1 2023-06-18 17:03:49,444 INFO [train.py:996] (1/4) Epoch 2, batch 12700, loss[loss=0.3405, simple_loss=0.3854, pruned_loss=0.1478, over 21323.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3652, pruned_loss=0.1234, over 4272256.21 frames. ], batch size: 159, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:03:51,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-18 17:03:59,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=259170.0, ans=0.0 2023-06-18 17:05:25,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.415e+02 4.071e+02 4.986e+02 8.988e+02, threshold=8.142e+02, percent-clipped=6.0 2023-06-18 17:05:25,078 INFO [train.py:996] (1/4) Epoch 2, batch 12750, loss[loss=0.2728, simple_loss=0.3503, pruned_loss=0.09768, over 21674.00 frames. ], tot_loss[loss=0.3105, simple_loss=0.3692, pruned_loss=0.1259, over 4272266.60 frames. ], batch size: 247, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:05:41,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=259470.0, ans=0.2 2023-06-18 17:06:09,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=259530.0, ans=0.2 2023-06-18 17:06:09,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=259530.0, ans=0.125 2023-06-18 17:06:31,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=259650.0, ans=0.125 2023-06-18 17:06:57,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.21 vs. limit=10.0 2023-06-18 17:07:08,566 INFO [train.py:996] (1/4) Epoch 2, batch 12800, loss[loss=0.3673, simple_loss=0.4023, pruned_loss=0.1662, over 21482.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3701, pruned_loss=0.128, over 4269195.52 frames. ], batch size: 471, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:07:11,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2023-06-18 17:07:15,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=259770.0, ans=0.1 2023-06-18 17:07:28,372 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-18 17:07:35,170 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-18 17:07:41,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.98 vs. limit=10.0 2023-06-18 17:08:01,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=259890.0, ans=0.1 2023-06-18 17:08:17,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=259950.0, ans=0.125 2023-06-18 17:08:18,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.73 vs. limit=22.5 2023-06-18 17:08:46,727 INFO [train.py:996] (1/4) Epoch 2, batch 12850, loss[loss=0.2549, simple_loss=0.3288, pruned_loss=0.09053, over 21288.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3724, pruned_loss=0.1296, over 4275669.66 frames. ], batch size: 176, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:08:48,203 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 3.179e+02 3.774e+02 4.668e+02 7.829e+02, threshold=7.547e+02, percent-clipped=0.0 2023-06-18 17:08:59,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=260070.0, ans=0.125 2023-06-18 17:10:35,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-18 17:10:37,510 INFO [train.py:996] (1/4) Epoch 2, batch 12900, loss[loss=0.2938, simple_loss=0.3731, pruned_loss=0.1073, over 21686.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3723, pruned_loss=0.1256, over 4272167.76 frames. ], batch size: 414, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:10:58,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=260430.0, ans=0.125 2023-06-18 17:11:21,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=260490.0, ans=0.2 2023-06-18 17:11:22,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.65 vs. limit=6.0 2023-06-18 17:11:23,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=260490.0, ans=0.2 2023-06-18 17:12:15,682 INFO [train.py:996] (1/4) Epoch 2, batch 12950, loss[loss=0.2469, simple_loss=0.3194, pruned_loss=0.0872, over 21724.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3705, pruned_loss=0.1238, over 4275470.45 frames. ], batch size: 332, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:12:17,121 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 3.080e+02 3.556e+02 4.378e+02 7.837e+02, threshold=7.111e+02, percent-clipped=1.0 2023-06-18 17:13:13,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=260850.0, ans=0.0 2023-06-18 17:13:14,348 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=12.0 2023-06-18 17:13:15,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=260850.0, ans=0.125 2023-06-18 17:13:51,817 INFO [train.py:996] (1/4) Epoch 2, batch 13000, loss[loss=0.2197, simple_loss=0.2942, pruned_loss=0.07264, over 21567.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.3688, pruned_loss=0.1231, over 4276932.22 frames. ], batch size: 230, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:14:44,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=261090.0, ans=0.0 2023-06-18 17:14:44,805 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:15:24,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=261210.0, ans=0.0 2023-06-18 17:15:27,567 INFO [train.py:996] (1/4) Epoch 2, batch 13050, loss[loss=0.29, simple_loss=0.3479, pruned_loss=0.116, over 21906.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3624, pruned_loss=0.1203, over 4274578.70 frames. ], batch size: 124, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:15:29,090 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.088e+02 4.287e+02 5.215e+02 1.044e+03, threshold=8.575e+02, percent-clipped=6.0 2023-06-18 17:15:34,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=261270.0, ans=0.2 2023-06-18 17:15:41,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=261330.0, ans=0.2 2023-06-18 17:15:51,599 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.67 vs. limit=22.5 2023-06-18 17:16:35,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-18 17:17:04,248 INFO [train.py:996] (1/4) Epoch 2, batch 13100, loss[loss=0.2973, simple_loss=0.3399, pruned_loss=0.1273, over 19988.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3616, pruned_loss=0.1196, over 4280573.23 frames. ], batch size: 702, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:17:50,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=261690.0, ans=0.0 2023-06-18 17:18:11,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=261750.0, ans=0.2 2023-06-18 17:18:17,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=261750.0, ans=0.125 2023-06-18 17:18:19,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=261750.0, ans=0.125 2023-06-18 17:18:36,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=261810.0, ans=0.2 2023-06-18 17:18:44,165 INFO [train.py:996] (1/4) Epoch 2, batch 13150, loss[loss=0.3418, simple_loss=0.3883, pruned_loss=0.1477, over 21394.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.365, pruned_loss=0.1243, over 4283325.05 frames. ], batch size: 471, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:18:45,941 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 3.617e+02 4.529e+02 5.724e+02 9.376e+02, threshold=9.058e+02, percent-clipped=0.0 2023-06-18 17:18:50,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-18 17:19:02,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=261870.0, ans=0.125 2023-06-18 17:19:51,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-18 17:19:58,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=262050.0, ans=0.125 2023-06-18 17:20:10,038 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.48 vs. limit=22.5 2023-06-18 17:20:18,144 INFO [train.py:996] (1/4) Epoch 2, batch 13200, loss[loss=0.3597, simple_loss=0.4034, pruned_loss=0.158, over 21559.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3645, pruned_loss=0.124, over 4280636.85 frames. ], batch size: 414, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:20:41,127 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=22.5 2023-06-18 17:20:48,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-18 17:20:51,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262230.0, ans=0.1 2023-06-18 17:21:21,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=262290.0, ans=10.0 2023-06-18 17:21:59,582 INFO [train.py:996] (1/4) Epoch 2, batch 13250, loss[loss=0.2829, simple_loss=0.3597, pruned_loss=0.1031, over 21605.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.3644, pruned_loss=0.1251, over 4281715.36 frames. ], batch size: 263, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:22:06,463 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.187e+02 3.809e+02 4.597e+02 7.682e+02, threshold=7.618e+02, percent-clipped=1.0 2023-06-18 17:22:28,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=262530.0, ans=0.125 2023-06-18 17:22:37,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-18 17:22:45,255 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:23:04,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=262650.0, ans=0.1 2023-06-18 17:23:14,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-18 17:23:21,034 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-18 17:23:22,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=262710.0, ans=15.0 2023-06-18 17:23:43,643 INFO [train.py:996] (1/4) Epoch 2, batch 13300, loss[loss=0.2773, simple_loss=0.351, pruned_loss=0.1018, over 21958.00 frames. ], tot_loss[loss=0.3064, simple_loss=0.366, pruned_loss=0.1234, over 4278727.09 frames. ], batch size: 317, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:24:33,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262890.0, ans=0.1 2023-06-18 17:25:10,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=263010.0, ans=0.04949747468305833 2023-06-18 17:25:15,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-18 17:25:25,052 INFO [train.py:996] (1/4) Epoch 2, batch 13350, loss[loss=0.3345, simple_loss=0.4203, pruned_loss=0.1243, over 20804.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3709, pruned_loss=0.1271, over 4278533.62 frames. ], batch size: 607, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:25:26,608 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.186e+02 3.887e+02 4.954e+02 1.112e+03, threshold=7.774e+02, percent-clipped=6.0 2023-06-18 17:25:58,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=263130.0, ans=0.125 2023-06-18 17:26:03,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=263190.0, ans=0.0 2023-06-18 17:26:25,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=263250.0, ans=0.125 2023-06-18 17:26:34,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-18 17:27:08,164 INFO [train.py:996] (1/4) Epoch 2, batch 13400, loss[loss=0.2974, simple_loss=0.3535, pruned_loss=0.1206, over 21806.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3741, pruned_loss=0.1305, over 4282042.52 frames. ], batch size: 332, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:27:30,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=263430.0, ans=0.0 2023-06-18 17:27:35,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263430.0, ans=0.1 2023-06-18 17:27:50,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=263490.0, ans=0.2 2023-06-18 17:28:15,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=263550.0, ans=0.1 2023-06-18 17:28:42,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=263610.0, ans=0.125 2023-06-18 17:28:45,366 INFO [train.py:996] (1/4) Epoch 2, batch 13450, loss[loss=0.3227, simple_loss=0.3718, pruned_loss=0.1368, over 21810.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3782, pruned_loss=0.1342, over 4277369.10 frames. ], batch size: 371, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:28:46,840 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.497e+02 3.954e+02 4.827e+02 1.042e+03, threshold=7.908e+02, percent-clipped=7.0 2023-06-18 17:28:50,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=263670.0, ans=0.04949747468305833 2023-06-18 17:29:20,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=263790.0, ans=0.125 2023-06-18 17:29:21,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=263790.0, ans=0.0 2023-06-18 17:29:23,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=263790.0, ans=0.125 2023-06-18 17:29:29,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=263790.0, ans=0.125 2023-06-18 17:29:31,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=22.5 2023-06-18 17:29:40,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=263850.0, ans=0.95 2023-06-18 17:29:51,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=263850.0, ans=0.125 2023-06-18 17:30:09,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=263910.0, ans=0.0 2023-06-18 17:30:15,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=263910.0, ans=0.1 2023-06-18 17:30:23,554 INFO [train.py:996] (1/4) Epoch 2, batch 13500, loss[loss=0.2023, simple_loss=0.2467, pruned_loss=0.07895, over 16473.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3644, pruned_loss=0.1281, over 4273062.85 frames. ], batch size: 60, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:30:41,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=264030.0, ans=0.125 2023-06-18 17:31:07,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=264090.0, ans=0.04949747468305833 2023-06-18 17:31:09,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=264090.0, ans=0.0 2023-06-18 17:31:41,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.74 vs. limit=10.0 2023-06-18 17:31:48,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=264210.0, ans=0.125 2023-06-18 17:32:00,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-18 17:32:03,898 INFO [train.py:996] (1/4) Epoch 2, batch 13550, loss[loss=0.2867, simple_loss=0.3426, pruned_loss=0.1154, over 21874.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3699, pruned_loss=0.128, over 4278590.91 frames. ], batch size: 107, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:32:05,761 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.329e+02 4.198e+02 5.480e+02 1.124e+03, threshold=8.396e+02, percent-clipped=8.0 2023-06-18 17:32:07,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=264270.0, ans=0.0 2023-06-18 17:32:11,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=264270.0, ans=0.125 2023-06-18 17:32:15,117 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:33:42,107 INFO [train.py:996] (1/4) Epoch 2, batch 13600, loss[loss=0.3168, simple_loss=0.3786, pruned_loss=0.1275, over 21848.00 frames. ], tot_loss[loss=0.3139, simple_loss=0.3707, pruned_loss=0.1285, over 4281733.92 frames. ], batch size: 351, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:34:29,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=264690.0, ans=0.125 2023-06-18 17:34:35,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=264690.0, ans=0.125 2023-06-18 17:35:19,302 INFO [train.py:996] (1/4) Epoch 2, batch 13650, loss[loss=0.3158, simple_loss=0.37, pruned_loss=0.1308, over 21495.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.3641, pruned_loss=0.1236, over 4280691.51 frames. ], batch size: 508, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:35:20,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 3.001e+02 3.620e+02 4.450e+02 8.511e+02, threshold=7.240e+02, percent-clipped=1.0 2023-06-18 17:35:46,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=264930.0, ans=0.0 2023-06-18 17:36:05,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=22.5 2023-06-18 17:36:16,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=264990.0, ans=0.125 2023-06-18 17:36:17,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=264990.0, ans=0.1 2023-06-18 17:36:32,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265050.0, ans=0.1 2023-06-18 17:36:59,385 INFO [train.py:996] (1/4) Epoch 2, batch 13700, loss[loss=0.2714, simple_loss=0.3366, pruned_loss=0.103, over 21739.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3575, pruned_loss=0.1232, over 4270644.00 frames. ], batch size: 298, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:37:22,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=265230.0, ans=0.0 2023-06-18 17:37:29,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=265230.0, ans=0.09899494936611666 2023-06-18 17:37:40,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=265290.0, ans=0.125 2023-06-18 17:37:49,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=265290.0, ans=0.04949747468305833 2023-06-18 17:38:13,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=265350.0, ans=0.125 2023-06-18 17:38:23,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=265410.0, ans=0.125 2023-06-18 17:38:28,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=265410.0, ans=0.125 2023-06-18 17:38:39,023 INFO [train.py:996] (1/4) Epoch 2, batch 13750, loss[loss=0.271, simple_loss=0.3339, pruned_loss=0.104, over 21778.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3539, pruned_loss=0.1219, over 4268101.06 frames. ], batch size: 282, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:38:45,113 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.651e+02 4.578e+02 5.768e+02 1.165e+03, threshold=9.156e+02, percent-clipped=11.0 2023-06-18 17:39:18,878 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:40:15,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=265710.0, ans=0.05 2023-06-18 17:40:23,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=265710.0, ans=0.0 2023-06-18 17:40:32,081 INFO [train.py:996] (1/4) Epoch 2, batch 13800, loss[loss=0.364, simple_loss=0.4447, pruned_loss=0.1417, over 21737.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3629, pruned_loss=0.1221, over 4267669.16 frames. ], batch size: 351, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:40:47,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=265770.0, ans=0.2 2023-06-18 17:41:04,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265830.0, ans=0.1 2023-06-18 17:41:15,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=265890.0, ans=0.1 2023-06-18 17:41:18,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=265890.0, ans=0.125 2023-06-18 17:41:24,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265890.0, ans=0.1 2023-06-18 17:41:26,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=265890.0, ans=0.2 2023-06-18 17:41:35,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=265950.0, ans=0.125 2023-06-18 17:41:48,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=266010.0, ans=0.125 2023-06-18 17:42:15,730 INFO [train.py:996] (1/4) Epoch 2, batch 13850, loss[loss=0.3372, simple_loss=0.3876, pruned_loss=0.1434, over 21826.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.371, pruned_loss=0.1247, over 4272719.99 frames. ], batch size: 282, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:42:17,192 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.095e+02 3.814e+02 4.955e+02 1.017e+03, threshold=7.628e+02, percent-clipped=1.0 2023-06-18 17:42:22,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266070.0, ans=0.1 2023-06-18 17:42:55,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=266190.0, ans=0.125 2023-06-18 17:43:48,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266310.0, ans=0.1 2023-06-18 17:43:49,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-18 17:43:50,429 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:43:50,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-18 17:43:52,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-18 17:43:53,100 INFO [train.py:996] (1/4) Epoch 2, batch 13900, loss[loss=0.3223, simple_loss=0.3857, pruned_loss=0.1294, over 16965.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3763, pruned_loss=0.1294, over 4271428.46 frames. ], batch size: 60, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:44:00,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=266370.0, ans=0.125 2023-06-18 17:44:22,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266430.0, ans=0.1 2023-06-18 17:45:23,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-18 17:45:35,911 INFO [train.py:996] (1/4) Epoch 2, batch 13950, loss[loss=0.3409, simple_loss=0.3906, pruned_loss=0.1456, over 21834.00 frames. ], tot_loss[loss=0.3224, simple_loss=0.3789, pruned_loss=0.133, over 4283491.05 frames. ], batch size: 371, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:45:37,801 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.843e+02 4.662e+02 6.006e+02 1.294e+03, threshold=9.323e+02, percent-clipped=7.0 2023-06-18 17:45:49,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=266670.0, ans=0.125 2023-06-18 17:45:50,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=266730.0, ans=0.125 2023-06-18 17:46:04,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=266730.0, ans=0.125 2023-06-18 17:46:06,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=266790.0, ans=0.125 2023-06-18 17:46:10,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-18 17:47:14,066 INFO [train.py:996] (1/4) Epoch 2, batch 14000, loss[loss=0.2656, simple_loss=0.3502, pruned_loss=0.0905, over 21825.00 frames. ], tot_loss[loss=0.3164, simple_loss=0.3736, pruned_loss=0.1296, over 4288999.84 frames. ], batch size: 316, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:47:19,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.09 vs. limit=6.0 2023-06-18 17:47:25,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266970.0, ans=0.1 2023-06-18 17:48:01,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=267090.0, ans=0.125 2023-06-18 17:48:14,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=267150.0, ans=0.04949747468305833 2023-06-18 17:48:48,745 INFO [train.py:996] (1/4) Epoch 2, batch 14050, loss[loss=0.2837, simple_loss=0.3339, pruned_loss=0.1168, over 21880.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3645, pruned_loss=0.1231, over 4290532.89 frames. ], batch size: 107, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:48:50,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.887e+02 3.656e+02 4.389e+02 9.715e+02, threshold=7.312e+02, percent-clipped=1.0 2023-06-18 17:48:59,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267270.0, ans=0.1 2023-06-18 17:50:03,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=267510.0, ans=0.125 2023-06-18 17:50:23,576 INFO [train.py:996] (1/4) Epoch 2, batch 14100, loss[loss=0.372, simple_loss=0.4006, pruned_loss=0.1718, over 21633.00 frames. ], tot_loss[loss=0.3021, simple_loss=0.3576, pruned_loss=0.1233, over 4283253.48 frames. ], batch size: 441, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:50:28,939 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-18 17:50:32,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-18 17:50:44,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=267630.0, ans=0.1 2023-06-18 17:51:18,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=267750.0, ans=0.95 2023-06-18 17:51:52,568 INFO [train.py:996] (1/4) Epoch 2, batch 14150, loss[loss=0.2595, simple_loss=0.3367, pruned_loss=0.09117, over 21411.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3594, pruned_loss=0.1233, over 4270629.23 frames. ], batch size: 211, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:51:59,080 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.602e+02 4.448e+02 5.500e+02 9.616e+02, threshold=8.896e+02, percent-clipped=7.0 2023-06-18 17:52:00,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=267870.0, ans=10.0 2023-06-18 17:52:08,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=267870.0, ans=0.1 2023-06-18 17:52:21,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=267930.0, ans=0.2 2023-06-18 17:52:45,055 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-18 17:53:19,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-18 17:53:21,316 INFO [train.py:996] (1/4) Epoch 2, batch 14200, loss[loss=0.2791, simple_loss=0.3339, pruned_loss=0.1121, over 21775.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3567, pruned_loss=0.1203, over 4276399.73 frames. ], batch size: 102, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:53:22,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-18 17:53:43,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=268170.0, ans=0.125 2023-06-18 17:53:43,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=268170.0, ans=0.0 2023-06-18 17:53:50,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=268230.0, ans=0.2 2023-06-18 17:54:54,774 INFO [train.py:996] (1/4) Epoch 2, batch 14250, loss[loss=0.2475, simple_loss=0.3181, pruned_loss=0.08845, over 21633.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3525, pruned_loss=0.1203, over 4272442.16 frames. ], batch size: 391, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:54:56,154 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 3.115e+02 4.292e+02 5.783e+02 1.043e+03, threshold=8.584e+02, percent-clipped=1.0 2023-06-18 17:55:16,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=268470.0, ans=0.0 2023-06-18 17:55:19,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=268530.0, ans=0.125 2023-06-18 17:55:45,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=268590.0, ans=0.2 2023-06-18 17:55:53,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=22.5 2023-06-18 17:56:11,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=268710.0, ans=0.125 2023-06-18 17:56:11,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-18 17:56:33,158 INFO [train.py:996] (1/4) Epoch 2, batch 14300, loss[loss=0.2348, simple_loss=0.3082, pruned_loss=0.08072, over 18303.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3568, pruned_loss=0.1195, over 4232673.89 frames. ], batch size: 71, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:56:52,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=268770.0, ans=0.0 2023-06-18 17:57:18,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=268890.0, ans=0.2 2023-06-18 17:57:50,452 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-06-18 17:58:01,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269010.0, ans=0.1 2023-06-18 17:58:09,475 INFO [train.py:996] (1/4) Epoch 2, batch 14350, loss[loss=0.3003, simple_loss=0.3597, pruned_loss=0.1205, over 21839.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3624, pruned_loss=0.1199, over 4224048.20 frames. ], batch size: 332, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:58:11,149 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.215e+02 4.287e+02 5.391e+02 1.265e+03, threshold=8.575e+02, percent-clipped=5.0 2023-06-18 17:58:41,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=269130.0, ans=22.5 2023-06-18 17:58:49,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=269190.0, ans=0.125 2023-06-18 17:59:50,460 INFO [train.py:996] (1/4) Epoch 2, batch 14400, loss[loss=0.316, simple_loss=0.3532, pruned_loss=0.1394, over 21815.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3617, pruned_loss=0.122, over 4227201.98 frames. ], batch size: 118, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 18:00:25,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-18 18:00:40,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269550.0, ans=0.1 2023-06-18 18:01:24,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=269670.0, ans=0.0 2023-06-18 18:01:25,574 INFO [train.py:996] (1/4) Epoch 2, batch 14450, loss[loss=0.3085, simple_loss=0.3546, pruned_loss=0.1313, over 21877.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3555, pruned_loss=0.1223, over 4241746.06 frames. ], batch size: 107, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 18:01:26,957 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 3.300e+02 3.933e+02 4.836e+02 8.413e+02, threshold=7.867e+02, percent-clipped=0.0 2023-06-18 18:01:45,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=269730.0, ans=0.0 2023-06-18 18:01:55,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=269730.0, ans=0.125 2023-06-18 18:02:17,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=269850.0, ans=0.1 2023-06-18 18:02:18,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=269850.0, ans=10.0 2023-06-18 18:02:25,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=269850.0, ans=0.2 2023-06-18 18:02:36,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=269910.0, ans=0.1 2023-06-18 18:03:01,241 INFO [train.py:996] (1/4) Epoch 2, batch 14500, loss[loss=0.33, simple_loss=0.3642, pruned_loss=0.1479, over 21434.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3513, pruned_loss=0.1215, over 4257448.23 frames. ], batch size: 508, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:04:43,607 INFO [train.py:996] (1/4) Epoch 2, batch 14550, loss[loss=0.3354, simple_loss=0.3913, pruned_loss=0.1398, over 21580.00 frames. ], tot_loss[loss=0.3029, simple_loss=0.3573, pruned_loss=0.1242, over 4267126.53 frames. ], batch size: 414, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:04:45,441 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.923e+02 3.262e+02 3.772e+02 5.838e+02, threshold=6.523e+02, percent-clipped=0.0 2023-06-18 18:05:37,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=270450.0, ans=0.125 2023-06-18 18:06:01,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-18 18:06:09,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=270510.0, ans=0.125 2023-06-18 18:06:12,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=270510.0, ans=0.125 2023-06-18 18:06:17,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.13 vs. limit=15.0 2023-06-18 18:06:21,021 INFO [train.py:996] (1/4) Epoch 2, batch 14600, loss[loss=0.3225, simple_loss=0.3803, pruned_loss=0.1323, over 21468.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3657, pruned_loss=0.129, over 4273929.65 frames. ], batch size: 131, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:07:50,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=270810.0, ans=0.1 2023-06-18 18:07:56,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-18 18:07:58,639 INFO [train.py:996] (1/4) Epoch 2, batch 14650, loss[loss=0.2802, simple_loss=0.3421, pruned_loss=0.1091, over 21746.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3642, pruned_loss=0.1255, over 4252783.73 frames. ], batch size: 112, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:08:00,045 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.284e+02 3.921e+02 4.845e+02 9.187e+02, threshold=7.842e+02, percent-clipped=12.0 2023-06-18 18:08:09,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=270870.0, ans=0.1 2023-06-18 18:08:11,795 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2023-06-18 18:08:12,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=270930.0, ans=0.2 2023-06-18 18:08:26,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=270930.0, ans=0.0 2023-06-18 18:09:05,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=271050.0, ans=0.125 2023-06-18 18:09:15,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=271050.0, ans=0.2 2023-06-18 18:09:34,781 INFO [train.py:996] (1/4) Epoch 2, batch 14700, loss[loss=0.3356, simple_loss=0.4092, pruned_loss=0.131, over 21716.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3574, pruned_loss=0.1182, over 4257789.72 frames. ], batch size: 441, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:09:44,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=271170.0, ans=0.125 2023-06-18 18:09:45,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=271170.0, ans=0.0 2023-06-18 18:09:48,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=271230.0, ans=0.125 2023-06-18 18:10:01,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=271230.0, ans=0.125 2023-06-18 18:10:17,202 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:11:03,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-18 18:11:13,564 INFO [train.py:996] (1/4) Epoch 2, batch 14750, loss[loss=0.3509, simple_loss=0.4076, pruned_loss=0.1471, over 21454.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3667, pruned_loss=0.1225, over 4263193.75 frames. ], batch size: 131, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:11:15,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.023e+02 4.317e+02 6.398e+02 9.994e+02, threshold=8.633e+02, percent-clipped=9.0 2023-06-18 18:11:41,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=271530.0, ans=0.125 2023-06-18 18:11:58,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=271590.0, ans=0.05 2023-06-18 18:12:43,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271710.0, ans=0.1 2023-06-18 18:12:51,293 INFO [train.py:996] (1/4) Epoch 2, batch 14800, loss[loss=0.3399, simple_loss=0.3756, pruned_loss=0.1521, over 21379.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.3774, pruned_loss=0.1301, over 4268793.89 frames. ], batch size: 507, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:12:59,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=271770.0, ans=0.125 2023-06-18 18:13:25,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.47 vs. limit=15.0 2023-06-18 18:13:34,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=271830.0, ans=0.125 2023-06-18 18:14:08,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-18 18:14:22,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=272010.0, ans=0.2 2023-06-18 18:14:31,692 INFO [train.py:996] (1/4) Epoch 2, batch 14850, loss[loss=0.2485, simple_loss=0.2981, pruned_loss=0.09939, over 21232.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3681, pruned_loss=0.1284, over 4271153.68 frames. ], batch size: 176, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:14:33,115 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.362e+02 3.948e+02 5.141e+02 8.278e+02, threshold=7.896e+02, percent-clipped=0.0 2023-06-18 18:14:47,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272070.0, ans=0.1 2023-06-18 18:15:35,119 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:15:44,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=272250.0, ans=0.025 2023-06-18 18:15:47,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=272250.0, ans=0.1 2023-06-18 18:16:10,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=272310.0, ans=0.0 2023-06-18 18:16:11,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=272310.0, ans=0.125 2023-06-18 18:16:13,870 INFO [train.py:996] (1/4) Epoch 2, batch 14900, loss[loss=0.3292, simple_loss=0.377, pruned_loss=0.1407, over 21819.00 frames. ], tot_loss[loss=0.3184, simple_loss=0.3731, pruned_loss=0.1318, over 4270059.99 frames. ], batch size: 247, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:16:24,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=272370.0, ans=0.125 2023-06-18 18:16:26,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=272370.0, ans=0.0 2023-06-18 18:16:40,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=272370.0, ans=0.125 2023-06-18 18:16:42,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=272370.0, ans=0.0 2023-06-18 18:17:04,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=272490.0, ans=0.025 2023-06-18 18:17:41,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=272610.0, ans=0.125 2023-06-18 18:18:07,143 INFO [train.py:996] (1/4) Epoch 2, batch 14950, loss[loss=0.3161, simple_loss=0.3806, pruned_loss=0.1258, over 20981.00 frames. ], tot_loss[loss=0.3191, simple_loss=0.3745, pruned_loss=0.1319, over 4264758.68 frames. ], batch size: 608, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:18:08,881 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.099e+02 3.898e+02 5.275e+02 1.469e+03, threshold=7.796e+02, percent-clipped=9.0 2023-06-18 18:18:18,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=272670.0, ans=0.04949747468305833 2023-06-18 18:18:43,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=272790.0, ans=0.0 2023-06-18 18:18:59,464 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-18 18:19:12,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=272910.0, ans=0.035 2023-06-18 18:19:12,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=272910.0, ans=0.0 2023-06-18 18:19:13,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-18 18:19:33,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=272910.0, ans=0.125 2023-06-18 18:19:44,484 INFO [train.py:996] (1/4) Epoch 2, batch 15000, loss[loss=0.3437, simple_loss=0.3868, pruned_loss=0.1503, over 21787.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.3765, pruned_loss=0.1332, over 4271860.77 frames. ], batch size: 332, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:19:44,485 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 18:19:59,942 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2784, simple_loss=0.3732, pruned_loss=0.09186, over 1796401.00 frames. 2023-06-18 18:19:59,942 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 18:20:06,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=272970.0, ans=0.025 2023-06-18 18:20:28,226 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-18 18:20:48,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=273090.0, ans=0.1 2023-06-18 18:21:20,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=273150.0, ans=0.125 2023-06-18 18:21:30,674 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-18 18:21:39,233 INFO [train.py:996] (1/4) Epoch 2, batch 15050, loss[loss=0.308, simple_loss=0.3727, pruned_loss=0.1217, over 21421.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3787, pruned_loss=0.1341, over 4268585.91 frames. ], batch size: 194, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:21:42,432 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 3.625e+02 4.289e+02 5.445e+02 1.034e+03, threshold=8.577e+02, percent-clipped=3.0 2023-06-18 18:21:43,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-18 18:21:56,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=273330.0, ans=0.125 2023-06-18 18:23:16,046 INFO [train.py:996] (1/4) Epoch 2, batch 15100, loss[loss=0.3431, simple_loss=0.3981, pruned_loss=0.144, over 21371.00 frames. ], tot_loss[loss=0.3248, simple_loss=0.382, pruned_loss=0.1337, over 4272541.43 frames. ], batch size: 176, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:23:19,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-18 18:23:24,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=273570.0, ans=0.125 2023-06-18 18:23:37,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.85 vs. limit=22.5 2023-06-18 18:23:46,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=273630.0, ans=0.0 2023-06-18 18:23:48,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=273630.0, ans=0.0 2023-06-18 18:23:48,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=273630.0, ans=0.125 2023-06-18 18:23:49,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-18 18:24:29,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=273750.0, ans=0.125 2023-06-18 18:24:29,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=273750.0, ans=0.125 2023-06-18 18:24:31,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=273750.0, ans=0.125 2023-06-18 18:24:52,900 INFO [train.py:996] (1/4) Epoch 2, batch 15150, loss[loss=0.3381, simple_loss=0.3569, pruned_loss=0.1596, over 21420.00 frames. ], tot_loss[loss=0.3235, simple_loss=0.3783, pruned_loss=0.1343, over 4272110.73 frames. ], batch size: 475, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:24:56,352 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.235e+02 3.924e+02 4.419e+02 1.242e+03, threshold=7.848e+02, percent-clipped=3.0 2023-06-18 18:25:19,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=273930.0, ans=0.125 2023-06-18 18:26:29,405 INFO [train.py:996] (1/4) Epoch 2, batch 15200, loss[loss=0.2381, simple_loss=0.3002, pruned_loss=0.08794, over 21746.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3667, pruned_loss=0.1281, over 4274508.30 frames. ], batch size: 124, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:28:05,727 INFO [train.py:996] (1/4) Epoch 2, batch 15250, loss[loss=0.3151, simple_loss=0.357, pruned_loss=0.1366, over 21219.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.3607, pruned_loss=0.1265, over 4273061.02 frames. ], batch size: 159, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:28:08,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.957e+02 3.425e+02 4.072e+02 6.895e+02, threshold=6.850e+02, percent-clipped=0.0 2023-06-18 18:29:43,139 INFO [train.py:996] (1/4) Epoch 2, batch 15300, loss[loss=0.3235, simple_loss=0.3782, pruned_loss=0.1344, over 21817.00 frames. ], tot_loss[loss=0.3147, simple_loss=0.3664, pruned_loss=0.1315, over 4268295.24 frames. ], batch size: 282, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:30:08,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=274830.0, ans=0.0 2023-06-18 18:30:27,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=274830.0, ans=0.125 2023-06-18 18:30:38,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-18 18:30:48,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=274950.0, ans=0.0 2023-06-18 18:30:58,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=274950.0, ans=0.125 2023-06-18 18:30:58,223 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:30:59,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=274950.0, ans=0.125 2023-06-18 18:31:19,241 INFO [train.py:996] (1/4) Epoch 2, batch 15350, loss[loss=0.3606, simple_loss=0.402, pruned_loss=0.1596, over 21434.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3741, pruned_loss=0.1349, over 4267616.88 frames. ], batch size: 211, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:31:21,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-18 18:31:22,365 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.404e+02 4.041e+02 5.108e+02 1.058e+03, threshold=8.082e+02, percent-clipped=7.0 2023-06-18 18:31:31,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-18 18:32:09,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=275190.0, ans=0.0 2023-06-18 18:32:38,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=275310.0, ans=0.125 2023-06-18 18:32:38,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=275310.0, ans=0.025 2023-06-18 18:32:49,574 INFO [train.py:996] (1/4) Epoch 2, batch 15400, loss[loss=0.2566, simple_loss=0.3273, pruned_loss=0.09294, over 21782.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3755, pruned_loss=0.1326, over 4259038.26 frames. ], batch size: 112, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:33:09,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=275430.0, ans=0.0 2023-06-18 18:33:46,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-18 18:33:49,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-18 18:34:01,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=275550.0, ans=0.125 2023-06-18 18:34:24,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275610.0, ans=0.1 2023-06-18 18:34:27,580 INFO [train.py:996] (1/4) Epoch 2, batch 15450, loss[loss=0.3264, simple_loss=0.3864, pruned_loss=0.1333, over 21875.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3726, pruned_loss=0.1312, over 4258770.68 frames. ], batch size: 351, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:34:30,679 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.287e+02 3.877e+02 4.804e+02 8.434e+02, threshold=7.754e+02, percent-clipped=1.0 2023-06-18 18:36:05,300 INFO [train.py:996] (1/4) Epoch 2, batch 15500, loss[loss=0.2802, simple_loss=0.3202, pruned_loss=0.1201, over 20191.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3732, pruned_loss=0.1299, over 4246893.75 frames. ], batch size: 702, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:36:40,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=276030.0, ans=0.125 2023-06-18 18:37:00,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=276090.0, ans=0.125 2023-06-18 18:37:20,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=276150.0, ans=0.125 2023-06-18 18:37:45,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=276210.0, ans=0.95 2023-06-18 18:37:47,891 INFO [train.py:996] (1/4) Epoch 2, batch 15550, loss[loss=0.2374, simple_loss=0.3157, pruned_loss=0.07955, over 21642.00 frames. ], tot_loss[loss=0.3104, simple_loss=0.3686, pruned_loss=0.1261, over 4250097.21 frames. ], batch size: 263, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:37:51,076 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.141e+02 3.857e+02 4.872e+02 7.208e+02, threshold=7.715e+02, percent-clipped=0.0 2023-06-18 18:38:08,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=276270.0, ans=0.125 2023-06-18 18:38:12,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=276330.0, ans=0.04949747468305833 2023-06-18 18:38:37,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=276390.0, ans=0.09899494936611666 2023-06-18 18:39:10,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=276510.0, ans=0.09899494936611666 2023-06-18 18:39:24,629 INFO [train.py:996] (1/4) Epoch 2, batch 15600, loss[loss=0.3057, simple_loss=0.3742, pruned_loss=0.1186, over 21641.00 frames. ], tot_loss[loss=0.3039, simple_loss=0.3608, pruned_loss=0.1235, over 4252239.49 frames. ], batch size: 263, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:39:41,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=276570.0, ans=10.0 2023-06-18 18:40:26,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=276750.0, ans=0.125 2023-06-18 18:41:13,300 INFO [train.py:996] (1/4) Epoch 2, batch 15650, loss[loss=0.2819, simple_loss=0.3305, pruned_loss=0.1167, over 21868.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3578, pruned_loss=0.1221, over 4247650.93 frames. ], batch size: 373, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:41:16,411 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.146e+02 3.937e+02 5.420e+02 1.080e+03, threshold=7.874e+02, percent-clipped=10.0 2023-06-18 18:41:30,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=276870.0, ans=0.07 2023-06-18 18:41:44,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=276930.0, ans=0.125 2023-06-18 18:41:45,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=276930.0, ans=0.1 2023-06-18 18:42:06,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=277050.0, ans=0.5 2023-06-18 18:42:07,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=277050.0, ans=0.0 2023-06-18 18:42:49,643 INFO [train.py:996] (1/4) Epoch 2, batch 15700, loss[loss=0.3497, simple_loss=0.3856, pruned_loss=0.1569, over 21537.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3543, pruned_loss=0.122, over 4250916.93 frames. ], batch size: 441, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:42:53,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=277170.0, ans=0.2 2023-06-18 18:42:59,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=277170.0, ans=0.95 2023-06-18 18:43:00,847 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-18 18:43:06,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-18 18:43:44,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=277350.0, ans=0.0 2023-06-18 18:43:46,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=277350.0, ans=0.125 2023-06-18 18:43:53,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=277410.0, ans=0.125 2023-06-18 18:44:09,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-06-18 18:44:13,561 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:44:19,882 INFO [train.py:996] (1/4) Epoch 2, batch 15750, loss[loss=0.2676, simple_loss=0.3237, pruned_loss=0.1057, over 21406.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3488, pruned_loss=0.121, over 4262037.71 frames. ], batch size: 194, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:44:28,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 3.151e+02 3.926e+02 5.162e+02 7.477e+02, threshold=7.853e+02, percent-clipped=0.0 2023-06-18 18:44:54,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=277530.0, ans=0.0 2023-06-18 18:46:01,542 INFO [train.py:996] (1/4) Epoch 2, batch 15800, loss[loss=0.3063, simple_loss=0.3566, pruned_loss=0.128, over 21597.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.345, pruned_loss=0.1209, over 4259070.38 frames. ], batch size: 415, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:46:11,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277770.0, ans=0.1 2023-06-18 18:46:46,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=277890.0, ans=0.0 2023-06-18 18:47:18,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=278010.0, ans=0.0 2023-06-18 18:47:34,329 INFO [train.py:996] (1/4) Epoch 2, batch 15850, loss[loss=0.26, simple_loss=0.3117, pruned_loss=0.1041, over 21718.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3486, pruned_loss=0.1234, over 4256214.08 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:47:37,251 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.427e+02 3.232e+02 3.999e+02 5.094e+02 1.481e+03, threshold=7.998e+02, percent-clipped=6.0 2023-06-18 18:47:39,091 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:47:46,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=278070.0, ans=0.0 2023-06-18 18:47:52,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=278130.0, ans=0.125 2023-06-18 18:48:13,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-18 18:48:32,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-18 18:49:10,698 INFO [train.py:996] (1/4) Epoch 2, batch 15900, loss[loss=0.2733, simple_loss=0.3558, pruned_loss=0.09541, over 20793.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3489, pruned_loss=0.1231, over 4260662.01 frames. ], batch size: 608, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:49:16,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=278370.0, ans=0.0 2023-06-18 18:50:02,432 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:50:41,654 INFO [train.py:996] (1/4) Epoch 2, batch 15950, loss[loss=0.2789, simple_loss=0.3191, pruned_loss=0.1193, over 21365.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.349, pruned_loss=0.12, over 4266632.67 frames. ], batch size: 144, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:50:49,855 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.268e+02 3.930e+02 5.229e+02 1.203e+03, threshold=7.860e+02, percent-clipped=8.0 2023-06-18 18:52:18,713 INFO [train.py:996] (1/4) Epoch 2, batch 16000, loss[loss=0.3061, simple_loss=0.3821, pruned_loss=0.1151, over 21713.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.3512, pruned_loss=0.1177, over 4275523.52 frames. ], batch size: 414, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:53:16,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=279150.0, ans=0.125 2023-06-18 18:53:33,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=279210.0, ans=0.125 2023-06-18 18:53:33,955 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2023-06-18 18:53:49,673 INFO [train.py:996] (1/4) Epoch 2, batch 16050, loss[loss=0.2013, simple_loss=0.2892, pruned_loss=0.05676, over 21454.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3503, pruned_loss=0.1136, over 4283956.92 frames. ], batch size: 211, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:53:57,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 3.161e+02 3.861e+02 4.688e+02 7.896e+02, threshold=7.722e+02, percent-clipped=1.0 2023-06-18 18:54:09,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=279270.0, ans=0.125 2023-06-18 18:54:36,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=279390.0, ans=0.125 2023-06-18 18:54:47,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=279390.0, ans=0.0 2023-06-18 18:55:22,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279510.0, ans=0.1 2023-06-18 18:55:25,215 INFO [train.py:996] (1/4) Epoch 2, batch 16100, loss[loss=0.3034, simple_loss=0.3605, pruned_loss=0.1231, over 21520.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3551, pruned_loss=0.1169, over 4284668.45 frames. ], batch size: 131, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:55:26,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=279570.0, ans=15.0 2023-06-18 18:56:12,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=279690.0, ans=0.0 2023-06-18 18:56:52,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=279810.0, ans=0.025 2023-06-18 18:56:55,119 INFO [train.py:996] (1/4) Epoch 2, batch 16150, loss[loss=0.3766, simple_loss=0.4164, pruned_loss=0.1684, over 21855.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3561, pruned_loss=0.121, over 4283822.96 frames. ], batch size: 414, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:57:02,504 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.360e+02 4.498e+02 6.275e+02 1.287e+03, threshold=8.996e+02, percent-clipped=10.0 2023-06-18 18:57:39,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-18 18:57:54,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=279990.0, ans=0.125 2023-06-18 18:58:00,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=280050.0, ans=0.09899494936611666 2023-06-18 18:58:16,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=280110.0, ans=0.2 2023-06-18 18:58:16,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=280110.0, ans=0.125 2023-06-18 18:58:21,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=280110.0, ans=0.125 2023-06-18 18:58:35,856 INFO [train.py:996] (1/4) Epoch 2, batch 16200, loss[loss=0.4033, simple_loss=0.4378, pruned_loss=0.1844, over 21800.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.3596, pruned_loss=0.122, over 4290961.34 frames. ], batch size: 441, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 18:58:36,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280170.0, ans=0.1 2023-06-18 18:59:08,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=280230.0, ans=0.0 2023-06-18 18:59:22,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280290.0, ans=0.1 2023-06-18 19:00:11,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=280470.0, ans=0.125 2023-06-18 19:00:17,021 INFO [train.py:996] (1/4) Epoch 2, batch 16250, loss[loss=0.3172, simple_loss=0.3871, pruned_loss=0.1236, over 19963.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3616, pruned_loss=0.123, over 4285914.73 frames. ], batch size: 703, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:00:19,947 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.939e+02 3.512e+02 5.146e+02 1.306e+03, threshold=7.023e+02, percent-clipped=4.0 2023-06-18 19:00:42,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=280530.0, ans=0.035 2023-06-18 19:01:14,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=280650.0, ans=0.0 2023-06-18 19:01:48,540 INFO [train.py:996] (1/4) Epoch 2, batch 16300, loss[loss=0.2093, simple_loss=0.2874, pruned_loss=0.06566, over 21584.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3544, pruned_loss=0.1176, over 4279024.14 frames. ], batch size: 230, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:01:52,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=280770.0, ans=0.125 2023-06-18 19:02:05,076 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.268e-01 2023-06-18 19:02:20,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=280830.0, ans=0.5 2023-06-18 19:02:22,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-18 19:02:44,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=280950.0, ans=0.0 2023-06-18 19:03:31,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-18 19:03:31,605 INFO [train.py:996] (1/4) Epoch 2, batch 16350, loss[loss=0.3728, simple_loss=0.4192, pruned_loss=0.1632, over 21278.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3565, pruned_loss=0.1195, over 4281044.59 frames. ], batch size: 143, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:03:34,721 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.295e+02 4.517e+02 5.318e+02 1.033e+03, threshold=9.034e+02, percent-clipped=9.0 2023-06-18 19:03:54,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=281130.0, ans=0.125 2023-06-18 19:04:20,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=281250.0, ans=0.2 2023-06-18 19:04:34,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-18 19:05:08,335 INFO [train.py:996] (1/4) Epoch 2, batch 16400, loss[loss=0.306, simple_loss=0.3512, pruned_loss=0.1304, over 21682.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.3609, pruned_loss=0.1219, over 4281101.01 frames. ], batch size: 263, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:05:25,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=281430.0, ans=0.2 2023-06-18 19:06:16,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=281610.0, ans=0.125 2023-06-18 19:06:43,705 INFO [train.py:996] (1/4) Epoch 2, batch 16450, loss[loss=0.3187, simple_loss=0.3532, pruned_loss=0.1421, over 21346.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3603, pruned_loss=0.1235, over 4282407.60 frames. ], batch size: 608, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:06:46,790 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.379e+02 3.250e+02 3.733e+02 4.517e+02 7.172e+02, threshold=7.466e+02, percent-clipped=0.0 2023-06-18 19:07:04,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=281730.0, ans=0.0 2023-06-18 19:07:12,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.84 vs. limit=22.5 2023-06-18 19:07:31,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=281850.0, ans=0.0 2023-06-18 19:08:08,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.18 vs. limit=22.5 2023-06-18 19:08:18,608 INFO [train.py:996] (1/4) Epoch 2, batch 16500, loss[loss=0.2456, simple_loss=0.2988, pruned_loss=0.09622, over 21529.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3597, pruned_loss=0.1239, over 4288529.06 frames. ], batch size: 211, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:08:41,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=282030.0, ans=0.2 2023-06-18 19:08:49,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=282090.0, ans=0.2 2023-06-18 19:08:59,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2023-06-18 19:09:08,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-18 19:09:09,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=282150.0, ans=0.125 2023-06-18 19:09:11,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=282150.0, ans=0.125 2023-06-18 19:09:57,179 INFO [train.py:996] (1/4) Epoch 2, batch 16550, loss[loss=0.2986, simple_loss=0.3789, pruned_loss=0.1091, over 21299.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3553, pruned_loss=0.1189, over 4274164.83 frames. ], batch size: 548, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:09:59,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-18 19:10:00,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.451e+02 4.279e+02 5.005e+02 9.425e+02, threshold=8.558e+02, percent-clipped=2.0 2023-06-18 19:10:03,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=282270.0, ans=0.0 2023-06-18 19:10:06,004 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.31 vs. limit=6.0 2023-06-18 19:10:08,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=282270.0, ans=0.125 2023-06-18 19:10:10,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-18 19:11:05,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.03 vs. limit=10.0 2023-06-18 19:11:25,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=282510.0, ans=0.0 2023-06-18 19:11:30,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=282510.0, ans=0.0 2023-06-18 19:11:31,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=282510.0, ans=0.125 2023-06-18 19:11:32,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-18 19:11:36,186 INFO [train.py:996] (1/4) Epoch 2, batch 16600, loss[loss=0.3477, simple_loss=0.4286, pruned_loss=0.1334, over 21567.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.3668, pruned_loss=0.1241, over 4271647.26 frames. ], batch size: 230, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:11:47,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=282570.0, ans=0.0 2023-06-18 19:13:15,406 INFO [train.py:996] (1/4) Epoch 2, batch 16650, loss[loss=0.3385, simple_loss=0.3937, pruned_loss=0.1416, over 21598.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.3762, pruned_loss=0.128, over 4271160.49 frames. ], batch size: 389, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:13:17,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=282870.0, ans=0.125 2023-06-18 19:13:18,771 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.486e+02 3.999e+02 5.162e+02 7.260e+02, threshold=7.998e+02, percent-clipped=0.0 2023-06-18 19:13:46,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=282930.0, ans=0.0 2023-06-18 19:14:01,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=282930.0, ans=15.0 2023-06-18 19:14:16,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.90 vs. limit=10.0 2023-06-18 19:14:53,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.55 vs. limit=6.0 2023-06-18 19:14:55,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=283110.0, ans=0.125 2023-06-18 19:15:10,015 INFO [train.py:996] (1/4) Epoch 2, batch 16700, loss[loss=0.3024, simple_loss=0.3895, pruned_loss=0.1077, over 20784.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3743, pruned_loss=0.1265, over 4271798.44 frames. ], batch size: 608, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:15:20,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=283170.0, ans=0.125 2023-06-18 19:15:49,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-18 19:16:20,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=283350.0, ans=0.125 2023-06-18 19:16:47,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=283410.0, ans=0.125 2023-06-18 19:16:51,996 INFO [train.py:996] (1/4) Epoch 2, batch 16750, loss[loss=0.3307, simple_loss=0.4035, pruned_loss=0.1289, over 21300.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3776, pruned_loss=0.1292, over 4271605.07 frames. ], batch size: 549, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:16:55,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.344e+02 3.957e+02 4.930e+02 8.837e+02, threshold=7.914e+02, percent-clipped=3.0 2023-06-18 19:17:02,618 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:17:33,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283590.0, ans=0.1 2023-06-18 19:17:55,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=283650.0, ans=0.125 2023-06-18 19:17:58,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=283650.0, ans=0.2 2023-06-18 19:18:24,052 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.22 vs. limit=15.0 2023-06-18 19:18:30,454 INFO [train.py:996] (1/4) Epoch 2, batch 16800, loss[loss=0.2921, simple_loss=0.3359, pruned_loss=0.1241, over 21368.00 frames. ], tot_loss[loss=0.3206, simple_loss=0.3813, pruned_loss=0.1299, over 4268934.45 frames. ], batch size: 159, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:18:37,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-18 19:18:40,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=283770.0, ans=0.0 2023-06-18 19:18:48,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.06 vs. limit=10.0 2023-06-18 19:19:00,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=283830.0, ans=0.125 2023-06-18 19:19:21,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=283890.0, ans=0.1 2023-06-18 19:19:31,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=283950.0, ans=0.125 2023-06-18 19:19:33,123 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=22.5 2023-06-18 19:20:03,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=284010.0, ans=0.0 2023-06-18 19:20:06,221 INFO [train.py:996] (1/4) Epoch 2, batch 16850, loss[loss=0.2857, simple_loss=0.3359, pruned_loss=0.1178, over 21491.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3776, pruned_loss=0.1291, over 4276074.42 frames. ], batch size: 177, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:20:09,184 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 3.797e+02 4.349e+02 5.361e+02 8.347e+02, threshold=8.698e+02, percent-clipped=3.0 2023-06-18 19:20:09,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=284070.0, ans=0.125 2023-06-18 19:20:17,289 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:21:32,324 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=12.0 2023-06-18 19:21:37,620 INFO [train.py:996] (1/4) Epoch 2, batch 16900, loss[loss=0.2814, simple_loss=0.3364, pruned_loss=0.1132, over 21512.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3725, pruned_loss=0.1273, over 4269612.34 frames. ], batch size: 389, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:21:43,163 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-06-18 19:21:50,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=284370.0, ans=0.125 2023-06-18 19:22:37,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=284550.0, ans=0.04949747468305833 2023-06-18 19:23:13,326 INFO [train.py:996] (1/4) Epoch 2, batch 16950, loss[loss=0.2907, simple_loss=0.3349, pruned_loss=0.1232, over 21571.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3657, pruned_loss=0.1247, over 4273649.70 frames. ], batch size: 212, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:23:16,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.116e+02 4.142e+02 5.213e+02 8.477e+02, threshold=8.284e+02, percent-clipped=0.0 2023-06-18 19:23:38,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=284730.0, ans=0.0 2023-06-18 19:24:10,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-18 19:24:24,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=284850.0, ans=0.2 2023-06-18 19:24:44,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=284970.0, ans=0.125 2023-06-18 19:24:45,708 INFO [train.py:996] (1/4) Epoch 2, batch 17000, loss[loss=0.2947, simple_loss=0.3257, pruned_loss=0.1318, over 20325.00 frames. ], tot_loss[loss=0.3077, simple_loss=0.3631, pruned_loss=0.1261, over 4285301.81 frames. ], batch size: 703, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:25:06,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.20 vs. limit=6.0 2023-06-18 19:26:22,063 INFO [train.py:996] (1/4) Epoch 2, batch 17050, loss[loss=0.2941, simple_loss=0.3606, pruned_loss=0.1139, over 21465.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3706, pruned_loss=0.1288, over 4282602.28 frames. ], batch size: 211, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:26:25,249 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.160e+02 3.853e+02 4.663e+02 1.166e+03, threshold=7.706e+02, percent-clipped=2.0 2023-06-18 19:26:27,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=285270.0, ans=0.125 2023-06-18 19:27:13,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=285390.0, ans=0.2 2023-06-18 19:27:25,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=285450.0, ans=0.0 2023-06-18 19:27:51,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=285510.0, ans=0.0 2023-06-18 19:27:56,595 INFO [train.py:996] (1/4) Epoch 2, batch 17100, loss[loss=0.3148, simple_loss=0.3536, pruned_loss=0.1381, over 21387.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3689, pruned_loss=0.13, over 4291057.73 frames. ], batch size: 176, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:28:01,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285570.0, ans=0.1 2023-06-18 19:28:47,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=285690.0, ans=0.125 2023-06-18 19:28:59,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=285750.0, ans=0.125 2023-06-18 19:28:59,098 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:29:12,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=12.0 2023-06-18 19:29:26,176 INFO [train.py:996] (1/4) Epoch 2, batch 17150, loss[loss=0.2835, simple_loss=0.3292, pruned_loss=0.1189, over 21563.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3635, pruned_loss=0.1285, over 4297851.09 frames. ], batch size: 548, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:29:29,307 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.459e+02 4.285e+02 5.134e+02 9.644e+02, threshold=8.570e+02, percent-clipped=5.0 2023-06-18 19:29:34,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=285870.0, ans=0.1 2023-06-18 19:29:37,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=285870.0, ans=0.125 2023-06-18 19:30:08,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=285930.0, ans=0.0 2023-06-18 19:30:48,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=12.0 2023-06-18 19:30:58,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=286110.0, ans=0.125 2023-06-18 19:31:02,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=286170.0, ans=0.1 2023-06-18 19:31:03,075 INFO [train.py:996] (1/4) Epoch 2, batch 17200, loss[loss=0.3365, simple_loss=0.3847, pruned_loss=0.1441, over 21757.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3638, pruned_loss=0.129, over 4300309.05 frames. ], batch size: 332, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:31:11,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=286170.0, ans=0.125 2023-06-18 19:31:52,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-18 19:32:11,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=286350.0, ans=0.2 2023-06-18 19:32:50,084 INFO [train.py:996] (1/4) Epoch 2, batch 17250, loss[loss=0.3493, simple_loss=0.4019, pruned_loss=0.1484, over 21661.00 frames. ], tot_loss[loss=0.3169, simple_loss=0.3695, pruned_loss=0.1321, over 4297403.38 frames. ], batch size: 263, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:32:53,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.381e+02 4.035e+02 5.054e+02 8.566e+02, threshold=8.070e+02, percent-clipped=0.0 2023-06-18 19:34:28,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=286710.0, ans=0.125 2023-06-18 19:34:33,913 INFO [train.py:996] (1/4) Epoch 2, batch 17300, loss[loss=0.3923, simple_loss=0.4279, pruned_loss=0.1784, over 21429.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3762, pruned_loss=0.1353, over 4292123.24 frames. ], batch size: 471, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:35:58,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=287010.0, ans=0.125 2023-06-18 19:36:18,561 INFO [train.py:996] (1/4) Epoch 2, batch 17350, loss[loss=0.3078, simple_loss=0.3685, pruned_loss=0.1236, over 21823.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3776, pruned_loss=0.1355, over 4287529.87 frames. ], batch size: 282, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:36:23,402 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 3.256e+02 3.949e+02 5.005e+02 1.104e+03, threshold=7.898e+02, percent-clipped=4.0 2023-06-18 19:37:16,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=287250.0, ans=0.0 2023-06-18 19:37:50,472 INFO [train.py:996] (1/4) Epoch 2, batch 17400, loss[loss=0.2581, simple_loss=0.321, pruned_loss=0.09756, over 21674.00 frames. ], tot_loss[loss=0.3173, simple_loss=0.3738, pruned_loss=0.1304, over 4292600.69 frames. ], batch size: 247, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:37:51,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-18 19:37:55,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=287370.0, ans=0.125 2023-06-18 19:37:56,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2023-06-18 19:38:08,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=287430.0, ans=0.04949747468305833 2023-06-18 19:38:33,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287490.0, ans=0.1 2023-06-18 19:39:10,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.51 vs. limit=15.0 2023-06-18 19:39:13,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=287610.0, ans=0.125 2023-06-18 19:39:24,238 INFO [train.py:996] (1/4) Epoch 2, batch 17450, loss[loss=0.2387, simple_loss=0.3354, pruned_loss=0.07103, over 21607.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3683, pruned_loss=0.1261, over 4284168.02 frames. ], batch size: 389, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:39:27,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287670.0, ans=0.1 2023-06-18 19:39:28,531 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.947e+02 3.682e+02 4.725e+02 7.588e+02, threshold=7.364e+02, percent-clipped=0.0 2023-06-18 19:40:10,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=287790.0, ans=0.125 2023-06-18 19:40:11,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=287790.0, ans=0.125 2023-06-18 19:40:32,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=287850.0, ans=0.0 2023-06-18 19:40:54,705 INFO [train.py:996] (1/4) Epoch 2, batch 17500, loss[loss=0.294, simple_loss=0.3477, pruned_loss=0.1201, over 21866.00 frames. ], tot_loss[loss=0.3059, simple_loss=0.3654, pruned_loss=0.1232, over 4279502.30 frames. ], batch size: 414, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:41:16,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=288030.0, ans=0.125 2023-06-18 19:41:54,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=288150.0, ans=0.2 2023-06-18 19:41:57,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=288150.0, ans=0.125 2023-06-18 19:42:06,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=288150.0, ans=0.1 2023-06-18 19:42:09,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=288210.0, ans=0.125 2023-06-18 19:42:25,477 INFO [train.py:996] (1/4) Epoch 2, batch 17550, loss[loss=0.2532, simple_loss=0.3324, pruned_loss=0.08694, over 21792.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3642, pruned_loss=0.1204, over 4282555.05 frames. ], batch size: 112, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:42:28,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-18 19:42:29,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=288270.0, ans=0.0 2023-06-18 19:42:30,238 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 3.260e+02 4.297e+02 5.739e+02 1.320e+03, threshold=8.594e+02, percent-clipped=8.0 2023-06-18 19:42:33,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=288270.0, ans=0.125 2023-06-18 19:43:46,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=288510.0, ans=0.2 2023-06-18 19:44:01,360 INFO [train.py:996] (1/4) Epoch 2, batch 17600, loss[loss=0.303, simple_loss=0.3595, pruned_loss=0.1232, over 21965.00 frames. ], tot_loss[loss=0.3016, simple_loss=0.364, pruned_loss=0.1196, over 4277259.27 frames. ], batch size: 317, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:44:19,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=288630.0, ans=0.125 2023-06-18 19:44:20,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=288630.0, ans=0.0 2023-06-18 19:45:28,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=288810.0, ans=0.0 2023-06-18 19:45:34,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=288810.0, ans=0.0 2023-06-18 19:45:39,884 INFO [train.py:996] (1/4) Epoch 2, batch 17650, loss[loss=0.2612, simple_loss=0.3408, pruned_loss=0.09084, over 21215.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3613, pruned_loss=0.1208, over 4261814.27 frames. ], batch size: 549, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:45:40,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-18 19:45:44,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.915e+02 4.145e+02 5.677e+02 8.803e+02, threshold=8.289e+02, percent-clipped=2.0 2023-06-18 19:46:22,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=288930.0, ans=0.1 2023-06-18 19:46:28,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=288990.0, ans=0.125 2023-06-18 19:46:28,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=288990.0, ans=0.1 2023-06-18 19:46:50,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=289050.0, ans=0.09899494936611666 2023-06-18 19:47:17,239 INFO [train.py:996] (1/4) Epoch 2, batch 17700, loss[loss=0.3037, simple_loss=0.3627, pruned_loss=0.1223, over 21392.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3553, pruned_loss=0.1169, over 4262449.84 frames. ], batch size: 194, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:48:25,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=289350.0, ans=0.0 2023-06-18 19:48:55,989 INFO [train.py:996] (1/4) Epoch 2, batch 17750, loss[loss=0.3637, simple_loss=0.4191, pruned_loss=0.1542, over 21256.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3639, pruned_loss=0.121, over 4265727.85 frames. ], batch size: 143, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:49:10,430 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.081e+02 4.126e+02 5.116e+02 1.162e+03, threshold=8.253e+02, percent-clipped=4.0 2023-06-18 19:49:12,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-18 19:49:31,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289530.0, ans=0.1 2023-06-18 19:49:36,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=289530.0, ans=0.025 2023-06-18 19:49:44,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=289590.0, ans=0.5 2023-06-18 19:49:53,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.45 vs. limit=22.5 2023-06-18 19:50:04,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=289650.0, ans=0.125 2023-06-18 19:50:06,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=289650.0, ans=0.125 2023-06-18 19:50:12,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=289710.0, ans=0.0 2023-06-18 19:50:30,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.81 vs. limit=15.0 2023-06-18 19:50:40,986 INFO [train.py:996] (1/4) Epoch 2, batch 17800, loss[loss=0.3093, simple_loss=0.3768, pruned_loss=0.1209, over 21875.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.364, pruned_loss=0.1205, over 4267660.74 frames. ], batch size: 372, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:50:50,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=289770.0, ans=0.0 2023-06-18 19:51:05,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=289830.0, ans=0.2 2023-06-18 19:51:33,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-18 19:51:41,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=289950.0, ans=0.125 2023-06-18 19:52:02,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=289950.0, ans=0.0 2023-06-18 19:52:30,844 INFO [train.py:996] (1/4) Epoch 2, batch 17850, loss[loss=0.3571, simple_loss=0.3808, pruned_loss=0.1667, over 20135.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3661, pruned_loss=0.1224, over 4265360.37 frames. ], batch size: 703, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:52:31,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290070.0, ans=0.1 2023-06-18 19:52:35,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.113e+02 3.565e+02 4.214e+02 1.146e+03, threshold=7.130e+02, percent-clipped=1.0 2023-06-18 19:52:55,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=290130.0, ans=0.125 2023-06-18 19:53:19,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=290190.0, ans=0.2 2023-06-18 19:53:56,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=290310.0, ans=0.1 2023-06-18 19:54:09,665 INFO [train.py:996] (1/4) Epoch 2, batch 17900, loss[loss=0.2798, simple_loss=0.3701, pruned_loss=0.09471, over 21757.00 frames. ], tot_loss[loss=0.3113, simple_loss=0.3717, pruned_loss=0.1255, over 4264330.76 frames. ], batch size: 332, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:54:17,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-18 19:55:19,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=290550.0, ans=0.125 2023-06-18 19:55:47,647 INFO [train.py:996] (1/4) Epoch 2, batch 17950, loss[loss=0.2587, simple_loss=0.3293, pruned_loss=0.09409, over 21432.00 frames. ], tot_loss[loss=0.307, simple_loss=0.372, pruned_loss=0.121, over 4262272.45 frames. ], batch size: 194, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:55:48,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=290670.0, ans=0.0 2023-06-18 19:55:52,412 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.329e+02 4.288e+02 5.696e+02 7.966e+02, threshold=8.576e+02, percent-clipped=5.0 2023-06-18 19:56:22,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=290790.0, ans=0.125 2023-06-18 19:57:14,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=290910.0, ans=0.5 2023-06-18 19:57:16,249 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-06-18 19:57:18,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=290910.0, ans=0.1 2023-06-18 19:57:22,921 INFO [train.py:996] (1/4) Epoch 2, batch 18000, loss[loss=0.2749, simple_loss=0.3265, pruned_loss=0.1116, over 21804.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3644, pruned_loss=0.1201, over 4255032.28 frames. ], batch size: 98, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 19:57:22,922 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 19:57:38,931 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2951, simple_loss=0.3927, pruned_loss=0.09871, over 1796401.00 frames. 2023-06-18 19:57:38,932 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 19:57:44,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=290970.0, ans=0.125 2023-06-18 19:57:51,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=290970.0, ans=0.0 2023-06-18 19:58:01,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-18 19:58:12,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=291030.0, ans=0.04949747468305833 2023-06-18 19:59:15,532 INFO [train.py:996] (1/4) Epoch 2, batch 18050, loss[loss=0.2748, simple_loss=0.3247, pruned_loss=0.1124, over 21874.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3595, pruned_loss=0.1192, over 4251821.96 frames. ], batch size: 113, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 19:59:21,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=291270.0, ans=0.125 2023-06-18 19:59:24,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.442e+02 4.328e+02 5.219e+02 8.565e+02, threshold=8.656e+02, percent-clipped=0.0 2023-06-18 19:59:46,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=291330.0, ans=0.125 2023-06-18 19:59:49,184 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:00:54,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=291510.0, ans=0.125 2023-06-18 20:00:57,636 INFO [train.py:996] (1/4) Epoch 2, batch 18100, loss[loss=0.3449, simple_loss=0.3879, pruned_loss=0.1509, over 21662.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3649, pruned_loss=0.1225, over 4246962.17 frames. ], batch size: 351, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:01:31,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-18 20:02:34,119 INFO [train.py:996] (1/4) Epoch 2, batch 18150, loss[loss=0.3055, simple_loss=0.3682, pruned_loss=0.1214, over 21738.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3662, pruned_loss=0.1223, over 4241314.35 frames. ], batch size: 282, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:02:38,922 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.253e+02 3.964e+02 4.730e+02 7.645e+02, threshold=7.929e+02, percent-clipped=0.0 2023-06-18 20:02:40,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=291870.0, ans=0.125 2023-06-18 20:03:03,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-18 20:03:06,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-18 20:04:09,554 INFO [train.py:996] (1/4) Epoch 2, batch 18200, loss[loss=0.2592, simple_loss=0.3138, pruned_loss=0.1023, over 21576.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3576, pruned_loss=0.1207, over 4251106.71 frames. ], batch size: 247, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:04:18,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=292170.0, ans=0.2 2023-06-18 20:04:44,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=292230.0, ans=0.1 2023-06-18 20:05:04,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2023-06-18 20:05:22,226 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-18 20:05:30,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=292410.0, ans=0.125 2023-06-18 20:05:39,481 INFO [train.py:996] (1/4) Epoch 2, batch 18250, loss[loss=0.2357, simple_loss=0.2942, pruned_loss=0.08859, over 21540.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3472, pruned_loss=0.1161, over 4259695.95 frames. ], batch size: 230, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:05:44,119 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.938e+02 3.797e+02 4.811e+02 7.621e+02, threshold=7.594e+02, percent-clipped=0.0 2023-06-18 20:06:04,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=292530.0, ans=0.2 2023-06-18 20:06:32,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=292590.0, ans=0.125 2023-06-18 20:06:36,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=292590.0, ans=0.0 2023-06-18 20:06:56,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=292650.0, ans=0.2 2023-06-18 20:07:10,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=292710.0, ans=0.1 2023-06-18 20:07:15,030 INFO [train.py:996] (1/4) Epoch 2, batch 18300, loss[loss=0.3082, simple_loss=0.4072, pruned_loss=0.1046, over 21716.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.3494, pruned_loss=0.1172, over 4265582.75 frames. ], batch size: 298, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:08:50,533 INFO [train.py:996] (1/4) Epoch 2, batch 18350, loss[loss=0.3083, simple_loss=0.3588, pruned_loss=0.129, over 21294.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3539, pruned_loss=0.1179, over 4258240.81 frames. ], batch size: 176, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:08:51,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=293070.0, ans=0.125 2023-06-18 20:08:54,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=293070.0, ans=0.1 2023-06-18 20:08:55,231 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.344e+02 4.081e+02 5.632e+02 1.157e+03, threshold=8.162e+02, percent-clipped=13.0 2023-06-18 20:09:46,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=293190.0, ans=0.125 2023-06-18 20:10:27,220 INFO [train.py:996] (1/4) Epoch 2, batch 18400, loss[loss=0.2985, simple_loss=0.3754, pruned_loss=0.1108, over 21258.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3487, pruned_loss=0.1156, over 4256346.60 frames. ], batch size: 551, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:11:33,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=293550.0, ans=0.2 2023-06-18 20:11:45,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=293550.0, ans=0.2 2023-06-18 20:12:01,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=293610.0, ans=0.04949747468305833 2023-06-18 20:12:08,210 INFO [train.py:996] (1/4) Epoch 2, batch 18450, loss[loss=0.2273, simple_loss=0.2977, pruned_loss=0.0784, over 21642.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3472, pruned_loss=0.1117, over 4256525.80 frames. ], batch size: 263, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:12:12,705 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.811e+02 3.741e+02 4.962e+02 8.715e+02, threshold=7.483e+02, percent-clipped=1.0 2023-06-18 20:12:24,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=293730.0, ans=0.125 2023-06-18 20:13:45,330 INFO [train.py:996] (1/4) Epoch 2, batch 18500, loss[loss=0.2636, simple_loss=0.348, pruned_loss=0.08959, over 21633.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3433, pruned_loss=0.1103, over 4258606.83 frames. ], batch size: 414, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:14:10,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=294030.0, ans=0.0 2023-06-18 20:14:28,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=294090.0, ans=0.125 2023-06-18 20:15:17,125 INFO [train.py:996] (1/4) Epoch 2, batch 18550, loss[loss=0.2564, simple_loss=0.3211, pruned_loss=0.09588, over 21718.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3394, pruned_loss=0.1097, over 4260343.86 frames. ], batch size: 316, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:15:26,535 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.027e+02 3.740e+02 5.224e+02 1.354e+03, threshold=7.479e+02, percent-clipped=4.0 2023-06-18 20:15:31,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=294270.0, ans=0.125 2023-06-18 20:16:20,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=294390.0, ans=0.2 2023-06-18 20:16:51,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=294510.0, ans=0.125 2023-06-18 20:16:58,788 INFO [train.py:996] (1/4) Epoch 2, batch 18600, loss[loss=0.2439, simple_loss=0.3132, pruned_loss=0.08731, over 21560.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3396, pruned_loss=0.1114, over 4265270.40 frames. ], batch size: 230, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:17:04,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-18 20:17:18,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=294630.0, ans=0.04949747468305833 2023-06-18 20:17:31,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.56 vs. limit=15.0 2023-06-18 20:17:32,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.06 vs. limit=22.5 2023-06-18 20:17:50,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294690.0, ans=0.1 2023-06-18 20:18:35,034 INFO [train.py:996] (1/4) Epoch 2, batch 18650, loss[loss=0.2656, simple_loss=0.317, pruned_loss=0.1071, over 21790.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.338, pruned_loss=0.1114, over 4266847.57 frames. ], batch size: 118, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:18:40,797 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.101e+02 3.636e+02 4.483e+02 8.727e+02, threshold=7.272e+02, percent-clipped=3.0 2023-06-18 20:18:43,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-18 20:18:45,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.28 vs. limit=15.0 2023-06-18 20:19:04,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=294930.0, ans=0.125 2023-06-18 20:19:49,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=295110.0, ans=0.0 2023-06-18 20:20:06,181 INFO [train.py:996] (1/4) Epoch 2, batch 18700, loss[loss=0.2913, simple_loss=0.3458, pruned_loss=0.1184, over 21880.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3358, pruned_loss=0.1134, over 4267532.55 frames. ], batch size: 107, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:21:04,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-06-18 20:21:06,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=295290.0, ans=0.0 2023-06-18 20:21:40,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=295470.0, ans=0.1 2023-06-18 20:21:41,679 INFO [train.py:996] (1/4) Epoch 2, batch 18750, loss[loss=0.2648, simple_loss=0.3133, pruned_loss=0.1081, over 21506.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3405, pruned_loss=0.1177, over 4281070.05 frames. ], batch size: 212, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:21:51,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.96 vs. limit=10.0 2023-06-18 20:21:52,347 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.155e+02 3.822e+02 4.527e+02 9.030e+02, threshold=7.645e+02, percent-clipped=1.0 2023-06-18 20:22:48,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=295650.0, ans=0.125 2023-06-18 20:22:59,701 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.44 vs. limit=15.0 2023-06-18 20:23:17,269 INFO [train.py:996] (1/4) Epoch 2, batch 18800, loss[loss=0.2343, simple_loss=0.3199, pruned_loss=0.07438, over 21883.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3448, pruned_loss=0.1174, over 4273291.16 frames. ], batch size: 316, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:23:42,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=295830.0, ans=0.125 2023-06-18 20:23:45,704 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:24:24,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=295950.0, ans=0.125 2023-06-18 20:24:56,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=296070.0, ans=0.0 2023-06-18 20:24:57,824 INFO [train.py:996] (1/4) Epoch 2, batch 18850, loss[loss=0.262, simple_loss=0.3141, pruned_loss=0.1049, over 21633.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3396, pruned_loss=0.1117, over 4272699.55 frames. ], batch size: 263, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:25:03,588 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.896e+02 3.612e+02 4.924e+02 1.009e+03, threshold=7.223e+02, percent-clipped=2.0 2023-06-18 20:26:04,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=296250.0, ans=0.125 2023-06-18 20:26:26,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=296310.0, ans=0.125 2023-06-18 20:26:26,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=296310.0, ans=0.125 2023-06-18 20:26:29,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=296370.0, ans=0.09899494936611666 2023-06-18 20:26:29,572 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-18 20:26:30,498 INFO [train.py:996] (1/4) Epoch 2, batch 18900, loss[loss=0.3273, simple_loss=0.353, pruned_loss=0.1508, over 21433.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3373, pruned_loss=0.1134, over 4252525.06 frames. ], batch size: 194, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:26:48,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=296430.0, ans=0.125 2023-06-18 20:27:23,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=296490.0, ans=0.125 2023-06-18 20:27:56,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=296610.0, ans=0.0 2023-06-18 20:28:07,307 INFO [train.py:996] (1/4) Epoch 2, batch 18950, loss[loss=0.3031, simple_loss=0.3883, pruned_loss=0.1089, over 21757.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3399, pruned_loss=0.1164, over 4254313.54 frames. ], batch size: 298, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:28:18,298 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.002e+02 3.745e+02 4.553e+02 7.623e+02, threshold=7.489e+02, percent-clipped=1.0 2023-06-18 20:29:08,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-18 20:29:12,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=296850.0, ans=0.0 2023-06-18 20:29:14,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=296850.0, ans=0.2 2023-06-18 20:29:37,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=296910.0, ans=0.0 2023-06-18 20:29:44,719 INFO [train.py:996] (1/4) Epoch 2, batch 19000, loss[loss=0.4097, simple_loss=0.4508, pruned_loss=0.1843, over 21590.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3496, pruned_loss=0.1179, over 4264875.09 frames. ], batch size: 415, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:30:48,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=297090.0, ans=0.04949747468305833 2023-06-18 20:30:59,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.21 vs. limit=15.0 2023-06-18 20:31:05,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=297150.0, ans=0.035 2023-06-18 20:31:08,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=297210.0, ans=0.125 2023-06-18 20:31:28,459 INFO [train.py:996] (1/4) Epoch 2, batch 19050, loss[loss=0.2968, simple_loss=0.3516, pruned_loss=0.1211, over 21421.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.3562, pruned_loss=0.123, over 4270817.33 frames. ], batch size: 131, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:31:34,523 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.373e+02 3.128e+02 4.146e+02 5.526e+02 1.033e+03, threshold=8.291e+02, percent-clipped=8.0 2023-06-18 20:31:42,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=297330.0, ans=0.125 2023-06-18 20:32:08,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=297390.0, ans=0.125 2023-06-18 20:32:39,142 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:33:02,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297510.0, ans=0.1 2023-06-18 20:33:05,251 INFO [train.py:996] (1/4) Epoch 2, batch 19100, loss[loss=0.36, simple_loss=0.381, pruned_loss=0.1695, over 20153.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3549, pruned_loss=0.1242, over 4263635.52 frames. ], batch size: 707, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:33:12,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=297570.0, ans=0.125 2023-06-18 20:33:29,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=297630.0, ans=0.125 2023-06-18 20:34:06,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297750.0, ans=0.1 2023-06-18 20:34:09,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=297750.0, ans=0.0 2023-06-18 20:34:22,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=297810.0, ans=0.1 2023-06-18 20:34:34,554 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.12 vs. limit=15.0 2023-06-18 20:34:44,309 INFO [train.py:996] (1/4) Epoch 2, batch 19150, loss[loss=0.3742, simple_loss=0.456, pruned_loss=0.1463, over 20800.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.358, pruned_loss=0.1258, over 4267572.20 frames. ], batch size: 607, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:34:44,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=297870.0, ans=0.0 2023-06-18 20:34:51,108 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.456e+02 4.497e+02 5.703e+02 1.044e+03, threshold=8.993e+02, percent-clipped=4.0 2023-06-18 20:35:10,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=297930.0, ans=0.95 2023-06-18 20:36:01,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=298050.0, ans=0.125 2023-06-18 20:36:09,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=298110.0, ans=0.07 2023-06-18 20:36:27,162 INFO [train.py:996] (1/4) Epoch 2, batch 19200, loss[loss=0.3052, simple_loss=0.3952, pruned_loss=0.1076, over 21652.00 frames. ], tot_loss[loss=0.3093, simple_loss=0.3677, pruned_loss=0.1255, over 4267739.64 frames. ], batch size: 263, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:36:30,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=298170.0, ans=0.0 2023-06-18 20:36:50,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-18 20:37:58,665 INFO [train.py:996] (1/4) Epoch 2, batch 19250, loss[loss=0.3128, simple_loss=0.3852, pruned_loss=0.1202, over 19917.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3646, pruned_loss=0.1178, over 4272370.91 frames. ], batch size: 702, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:38:06,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=298470.0, ans=0.0 2023-06-18 20:38:09,018 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.933e+02 3.534e+02 4.386e+02 8.060e+02, threshold=7.069e+02, percent-clipped=0.0 2023-06-18 20:38:33,934 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-18 20:38:34,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=298530.0, ans=0.125 2023-06-18 20:38:48,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=298590.0, ans=0.125 2023-06-18 20:39:34,547 INFO [train.py:996] (1/4) Epoch 2, batch 19300, loss[loss=0.2915, simple_loss=0.3571, pruned_loss=0.113, over 21069.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3617, pruned_loss=0.1182, over 4277582.03 frames. ], batch size: 608, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:39:34,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=298770.0, ans=0.125 2023-06-18 20:39:49,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=298770.0, ans=10.0 2023-06-18 20:40:26,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-18 20:40:27,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=298890.0, ans=0.125 2023-06-18 20:40:50,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=298950.0, ans=0.125 2023-06-18 20:41:17,371 INFO [train.py:996] (1/4) Epoch 2, batch 19350, loss[loss=0.2901, simple_loss=0.3571, pruned_loss=0.1116, over 21703.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3559, pruned_loss=0.1135, over 4276870.81 frames. ], batch size: 351, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:41:27,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=299070.0, ans=0.0 2023-06-18 20:41:28,556 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 3.011e+02 3.707e+02 4.151e+02 9.500e+02, threshold=7.414e+02, percent-clipped=4.0 2023-06-18 20:41:39,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=22.5 2023-06-18 20:41:40,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=299130.0, ans=0.0 2023-06-18 20:41:41,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=299130.0, ans=0.2 2023-06-18 20:41:48,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=299130.0, ans=0.125 2023-06-18 20:42:02,514 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-18 20:42:46,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=299310.0, ans=0.125 2023-06-18 20:42:54,482 INFO [train.py:996] (1/4) Epoch 2, batch 19400, loss[loss=0.3334, simple_loss=0.3822, pruned_loss=0.1423, over 21812.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3546, pruned_loss=0.1131, over 4272223.16 frames. ], batch size: 391, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:43:22,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=299430.0, ans=0.1 2023-06-18 20:43:27,715 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-06-18 20:43:33,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=299490.0, ans=0.125 2023-06-18 20:43:42,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=299490.0, ans=0.05 2023-06-18 20:44:16,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=299610.0, ans=0.125 2023-06-18 20:44:25,378 INFO [train.py:996] (1/4) Epoch 2, batch 19450, loss[loss=0.2998, simple_loss=0.3382, pruned_loss=0.1307, over 21290.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3533, pruned_loss=0.117, over 4275345.77 frames. ], batch size: 144, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:44:36,324 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.184e+02 3.692e+02 4.694e+02 9.525e+02, threshold=7.383e+02, percent-clipped=2.0 2023-06-18 20:44:50,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=299730.0, ans=0.05 2023-06-18 20:45:09,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=299790.0, ans=10.0 2023-06-18 20:45:12,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=299790.0, ans=0.125 2023-06-18 20:45:27,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=299850.0, ans=0.0 2023-06-18 20:45:35,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=299910.0, ans=0.95 2023-06-18 20:46:07,386 INFO [train.py:996] (1/4) Epoch 2, batch 19500, loss[loss=0.26, simple_loss=0.3246, pruned_loss=0.09776, over 21682.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3491, pruned_loss=0.1182, over 4273295.01 frames. ], batch size: 298, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:46:09,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=299970.0, ans=0.04949747468305833 2023-06-18 20:46:09,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=299970.0, ans=0.0 2023-06-18 20:46:25,326 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:46:50,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=300090.0, ans=0.125 2023-06-18 20:47:27,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=300210.0, ans=0.125 2023-06-18 20:47:45,721 INFO [train.py:996] (1/4) Epoch 2, batch 19550, loss[loss=0.2395, simple_loss=0.3073, pruned_loss=0.08589, over 21760.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3441, pruned_loss=0.1156, over 4273314.41 frames. ], batch size: 124, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:47:46,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=300270.0, ans=0.07 2023-06-18 20:47:51,798 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 3.032e+02 3.762e+02 4.813e+02 9.306e+02, threshold=7.523e+02, percent-clipped=3.0 2023-06-18 20:47:52,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=300270.0, ans=0.125 2023-06-18 20:49:00,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.82 vs. limit=6.0 2023-06-18 20:49:17,046 INFO [train.py:996] (1/4) Epoch 2, batch 19600, loss[loss=0.3193, simple_loss=0.3724, pruned_loss=0.1331, over 21888.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3448, pruned_loss=0.1154, over 4280369.21 frames. ], batch size: 124, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:49:25,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.19 vs. limit=15.0 2023-06-18 20:50:33,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-18 20:50:47,876 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.83 vs. limit=22.5 2023-06-18 20:50:49,880 INFO [train.py:996] (1/4) Epoch 2, batch 19650, loss[loss=0.3005, simple_loss=0.3572, pruned_loss=0.1219, over 21674.00 frames. ], tot_loss[loss=0.298, simple_loss=0.3526, pruned_loss=0.1217, over 4288858.68 frames. ], batch size: 389, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:50:53,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=300870.0, ans=0.125 2023-06-18 20:50:55,965 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 3.349e+02 4.086e+02 5.431e+02 7.953e+02, threshold=8.171e+02, percent-clipped=1.0 2023-06-18 20:52:23,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=301170.0, ans=0.125 2023-06-18 20:52:24,573 INFO [train.py:996] (1/4) Epoch 2, batch 19700, loss[loss=0.247, simple_loss=0.3184, pruned_loss=0.08785, over 21428.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3558, pruned_loss=0.1216, over 4284364.98 frames. ], batch size: 211, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:52:30,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-18 20:53:06,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=301290.0, ans=0.0 2023-06-18 20:53:58,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.48 vs. limit=15.0 2023-06-18 20:54:03,048 INFO [train.py:996] (1/4) Epoch 2, batch 19750, loss[loss=0.2871, simple_loss=0.3401, pruned_loss=0.117, over 21874.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3662, pruned_loss=0.1235, over 4289435.82 frames. ], batch size: 107, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:54:09,458 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.167e+02 3.934e+02 5.557e+02 1.096e+03, threshold=7.868e+02, percent-clipped=5.0 2023-06-18 20:54:40,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-18 20:54:46,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=12.0 2023-06-18 20:55:40,454 INFO [train.py:996] (1/4) Epoch 2, batch 19800, loss[loss=0.2955, simple_loss=0.3421, pruned_loss=0.1245, over 21380.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3635, pruned_loss=0.1231, over 4288943.08 frames. ], batch size: 131, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:55:42,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=301770.0, ans=0.125 2023-06-18 20:55:53,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=301770.0, ans=0.0 2023-06-18 20:56:02,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=301830.0, ans=0.125 2023-06-18 20:56:32,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=301890.0, ans=0.125 2023-06-18 20:56:32,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=301890.0, ans=0.125 2023-06-18 20:57:00,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.09 vs. limit=22.5 2023-06-18 20:57:22,412 INFO [train.py:996] (1/4) Epoch 2, batch 19850, loss[loss=0.2646, simple_loss=0.3367, pruned_loss=0.09628, over 21455.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3558, pruned_loss=0.1172, over 4283402.20 frames. ], batch size: 211, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:57:24,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=302070.0, ans=0.125 2023-06-18 20:57:28,505 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.803e+02 3.716e+02 4.795e+02 8.783e+02, threshold=7.432e+02, percent-clipped=5.0 2023-06-18 20:57:30,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=302070.0, ans=0.125 2023-06-18 20:58:43,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-18 20:58:50,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=302310.0, ans=0.0 2023-06-18 20:59:00,045 INFO [train.py:996] (1/4) Epoch 2, batch 19900, loss[loss=0.2423, simple_loss=0.3055, pruned_loss=0.08948, over 21673.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3551, pruned_loss=0.114, over 4284126.13 frames. ], batch size: 282, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 20:59:12,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=302370.0, ans=0.0 2023-06-18 20:59:47,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=302490.0, ans=0.1 2023-06-18 21:00:07,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.00 vs. limit=5.0 2023-06-18 21:00:11,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=302550.0, ans=0.0 2023-06-18 21:00:23,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=302610.0, ans=0.125 2023-06-18 21:00:35,670 INFO [train.py:996] (1/4) Epoch 2, batch 19950, loss[loss=0.2304, simple_loss=0.2828, pruned_loss=0.08898, over 21516.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3486, pruned_loss=0.1121, over 4280916.10 frames. ], batch size: 195, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:00:46,804 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.768e+02 3.439e+02 5.262e+02 1.066e+03, threshold=6.877e+02, percent-clipped=5.0 2023-06-18 21:00:59,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=302730.0, ans=0.125 2023-06-18 21:01:15,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=302730.0, ans=0.0 2023-06-18 21:01:33,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=302790.0, ans=0.07 2023-06-18 21:01:55,204 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:02:16,318 INFO [train.py:996] (1/4) Epoch 2, batch 20000, loss[loss=0.3053, simple_loss=0.3605, pruned_loss=0.125, over 21867.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3494, pruned_loss=0.1134, over 4281541.26 frames. ], batch size: 332, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:03:20,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=303150.0, ans=0.0 2023-06-18 21:03:46,341 INFO [train.py:996] (1/4) Epoch 2, batch 20050, loss[loss=0.3269, simple_loss=0.3621, pruned_loss=0.1458, over 20043.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3536, pruned_loss=0.1181, over 4289521.13 frames. ], batch size: 702, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:03:56,938 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.190e+02 3.752e+02 4.909e+02 8.771e+02, threshold=7.503e+02, percent-clipped=6.0 2023-06-18 21:04:09,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303330.0, ans=0.1 2023-06-18 21:05:33,689 INFO [train.py:996] (1/4) Epoch 2, batch 20100, loss[loss=0.2854, simple_loss=0.3629, pruned_loss=0.104, over 21430.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3574, pruned_loss=0.1212, over 4291284.02 frames. ], batch size: 211, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:05:34,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=303570.0, ans=0.1 2023-06-18 21:05:53,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=303630.0, ans=0.0 2023-06-18 21:06:13,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=303690.0, ans=0.125 2023-06-18 21:06:35,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=303750.0, ans=0.125 2023-06-18 21:06:57,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-18 21:07:17,556 INFO [train.py:996] (1/4) Epoch 2, batch 20150, loss[loss=0.3407, simple_loss=0.3894, pruned_loss=0.146, over 20686.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3685, pruned_loss=0.1259, over 4290714.89 frames. ], batch size: 607, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:07:24,240 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.342e+02 4.169e+02 5.156e+02 8.825e+02, threshold=8.338e+02, percent-clipped=3.0 2023-06-18 21:08:31,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=304050.0, ans=0.2 2023-06-18 21:08:58,612 INFO [train.py:996] (1/4) Epoch 2, batch 20200, loss[loss=0.3068, simple_loss=0.3934, pruned_loss=0.1101, over 21819.00 frames. ], tot_loss[loss=0.3172, simple_loss=0.3753, pruned_loss=0.1295, over 4280845.73 frames. ], batch size: 316, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:09:41,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=304290.0, ans=0.0 2023-06-18 21:10:23,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=304410.0, ans=0.125 2023-06-18 21:10:35,637 INFO [train.py:996] (1/4) Epoch 2, batch 20250, loss[loss=0.3221, simple_loss=0.3668, pruned_loss=0.1387, over 21479.00 frames. ], tot_loss[loss=0.314, simple_loss=0.3747, pruned_loss=0.1266, over 4280091.45 frames. ], batch size: 548, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:10:39,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=304470.0, ans=0.2 2023-06-18 21:10:40,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=304470.0, ans=0.125 2023-06-18 21:10:41,290 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.84 vs. limit=10.0 2023-06-18 21:10:41,723 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 3.323e+02 4.182e+02 5.137e+02 1.003e+03, threshold=8.365e+02, percent-clipped=2.0 2023-06-18 21:10:47,052 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:11:35,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=304590.0, ans=0.125 2023-06-18 21:11:57,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=304710.0, ans=0.125 2023-06-18 21:12:09,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=304710.0, ans=0.125 2023-06-18 21:12:09,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=304710.0, ans=0.125 2023-06-18 21:12:13,276 INFO [train.py:996] (1/4) Epoch 2, batch 20300, loss[loss=0.2572, simple_loss=0.349, pruned_loss=0.08268, over 19916.00 frames. ], tot_loss[loss=0.3081, simple_loss=0.3706, pruned_loss=0.1228, over 4274089.39 frames. ], batch size: 703, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:12:27,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=304770.0, ans=0.2 2023-06-18 21:12:29,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-18 21:13:16,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=304950.0, ans=0.125 2023-06-18 21:13:48,915 INFO [train.py:996] (1/4) Epoch 2, batch 20350, loss[loss=0.314, simple_loss=0.3632, pruned_loss=0.1324, over 21895.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.3696, pruned_loss=0.1231, over 4278203.38 frames. ], batch size: 316, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:13:50,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-18 21:13:55,470 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.188e+02 3.897e+02 4.973e+02 9.485e+02, threshold=7.794e+02, percent-clipped=2.0 2023-06-18 21:14:06,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=305070.0, ans=0.2 2023-06-18 21:14:17,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=305130.0, ans=0.125 2023-06-18 21:14:41,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=22.5 2023-06-18 21:15:26,545 INFO [train.py:996] (1/4) Epoch 2, batch 20400, loss[loss=0.3224, simple_loss=0.3759, pruned_loss=0.1344, over 21456.00 frames. ], tot_loss[loss=0.315, simple_loss=0.3743, pruned_loss=0.1278, over 4284179.19 frames. ], batch size: 211, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:15:46,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=305430.0, ans=0.125 2023-06-18 21:16:45,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=305610.0, ans=0.125 2023-06-18 21:17:02,142 INFO [train.py:996] (1/4) Epoch 2, batch 20450, loss[loss=0.316, simple_loss=0.3638, pruned_loss=0.1341, over 21858.00 frames. ], tot_loss[loss=0.3187, simple_loss=0.3752, pruned_loss=0.1311, over 4285398.58 frames. ], batch size: 107, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:17:07,772 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.641e+02 3.573e+02 4.608e+02 6.565e+02 1.538e+03, threshold=9.216e+02, percent-clipped=19.0 2023-06-18 21:17:18,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=305670.0, ans=0.125 2023-06-18 21:17:46,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-18 21:18:33,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-18 21:18:37,633 INFO [train.py:996] (1/4) Epoch 2, batch 20500, loss[loss=0.3537, simple_loss=0.4474, pruned_loss=0.13, over 19846.00 frames. ], tot_loss[loss=0.3186, simple_loss=0.3724, pruned_loss=0.1324, over 4281740.65 frames. ], batch size: 702, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:18:51,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=305970.0, ans=0.0 2023-06-18 21:19:31,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-18 21:19:33,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306090.0, ans=0.1 2023-06-18 21:19:37,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=306150.0, ans=0.2 2023-06-18 21:19:38,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=306150.0, ans=0.2 2023-06-18 21:20:19,075 INFO [train.py:996] (1/4) Epoch 2, batch 20550, loss[loss=0.2541, simple_loss=0.3022, pruned_loss=0.103, over 21077.00 frames. ], tot_loss[loss=0.3113, simple_loss=0.3638, pruned_loss=0.1294, over 4279747.84 frames. ], batch size: 143, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:20:25,442 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.461e+02 4.145e+02 5.402e+02 8.194e+02, threshold=8.291e+02, percent-clipped=0.0 2023-06-18 21:20:30,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=306270.0, ans=0.2 2023-06-18 21:21:09,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=306390.0, ans=0.125 2023-06-18 21:21:37,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=306510.0, ans=0.0 2023-06-18 21:21:55,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=306570.0, ans=0.0 2023-06-18 21:21:56,395 INFO [train.py:996] (1/4) Epoch 2, batch 20600, loss[loss=0.3148, simple_loss=0.3789, pruned_loss=0.1254, over 21800.00 frames. ], tot_loss[loss=0.308, simple_loss=0.3645, pruned_loss=0.1258, over 4280834.22 frames. ], batch size: 332, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:22:03,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.17 vs. limit=15.0 2023-06-18 21:22:09,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=306570.0, ans=0.125 2023-06-18 21:23:08,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306750.0, ans=0.1 2023-06-18 21:23:15,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=12.0 2023-06-18 21:23:30,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=306810.0, ans=0.125 2023-06-18 21:23:31,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=306870.0, ans=0.125 2023-06-18 21:23:32,690 INFO [train.py:996] (1/4) Epoch 2, batch 20650, loss[loss=0.2709, simple_loss=0.3151, pruned_loss=0.1134, over 21072.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3603, pruned_loss=0.1259, over 4274889.37 frames. ], batch size: 608, lr: 1.56e-02, grad_scale: 64.0 2023-06-18 21:23:38,854 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 3.072e+02 3.756e+02 5.105e+02 7.352e+02, threshold=7.512e+02, percent-clipped=0.0 2023-06-18 21:23:47,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=306930.0, ans=0.125 2023-06-18 21:24:15,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-18 21:24:43,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307050.0, ans=0.1 2023-06-18 21:24:48,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.86 vs. limit=8.0 2023-06-18 21:24:54,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307110.0, ans=0.1 2023-06-18 21:25:12,200 INFO [train.py:996] (1/4) Epoch 2, batch 20700, loss[loss=0.3227, simple_loss=0.3602, pruned_loss=0.1427, over 20103.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3519, pruned_loss=0.1204, over 4261989.28 frames. ], batch size: 703, lr: 1.56e-02, grad_scale: 64.0 2023-06-18 21:25:19,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=307170.0, ans=0.0 2023-06-18 21:25:39,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=307230.0, ans=0.125 2023-06-18 21:26:29,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=8.0 2023-06-18 21:26:49,437 INFO [train.py:996] (1/4) Epoch 2, batch 20750, loss[loss=0.3548, simple_loss=0.4365, pruned_loss=0.1366, over 21689.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3521, pruned_loss=0.1187, over 4244876.73 frames. ], batch size: 414, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:26:50,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=307470.0, ans=0.125 2023-06-18 21:26:57,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.977e+02 3.559e+02 4.590e+02 7.850e+02, threshold=7.118e+02, percent-clipped=2.0 2023-06-18 21:27:11,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=307530.0, ans=0.0 2023-06-18 21:27:23,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=307530.0, ans=0.125 2023-06-18 21:27:31,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=307530.0, ans=0.2 2023-06-18 21:27:42,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=307590.0, ans=0.125 2023-06-18 21:28:26,172 INFO [train.py:996] (1/4) Epoch 2, batch 20800, loss[loss=0.2868, simple_loss=0.3345, pruned_loss=0.1196, over 21454.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3549, pruned_loss=0.1196, over 4246209.06 frames. ], batch size: 389, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:28:45,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=307830.0, ans=0.125 2023-06-18 21:29:47,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=308010.0, ans=0.0 2023-06-18 21:30:02,860 INFO [train.py:996] (1/4) Epoch 2, batch 20850, loss[loss=0.1907, simple_loss=0.262, pruned_loss=0.05971, over 21568.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3479, pruned_loss=0.1171, over 4247394.51 frames. ], batch size: 230, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:30:06,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=308070.0, ans=0.0 2023-06-18 21:30:06,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.40 vs. limit=12.0 2023-06-18 21:30:16,768 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.417e+02 4.186e+02 5.469e+02 9.109e+02, threshold=8.373e+02, percent-clipped=11.0 2023-06-18 21:30:38,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=308130.0, ans=0.0 2023-06-18 21:31:22,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=308310.0, ans=0.125 2023-06-18 21:31:37,918 INFO [train.py:996] (1/4) Epoch 2, batch 20900, loss[loss=0.2656, simple_loss=0.3374, pruned_loss=0.09691, over 21726.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3486, pruned_loss=0.1191, over 4262575.61 frames. ], batch size: 298, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:31:47,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=308370.0, ans=0.0 2023-06-18 21:31:55,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=308370.0, ans=0.0 2023-06-18 21:31:58,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=308430.0, ans=0.125 2023-06-18 21:32:14,436 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:32:55,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=308610.0, ans=0.125 2023-06-18 21:32:57,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-18 21:33:12,545 INFO [train.py:996] (1/4) Epoch 2, batch 20950, loss[loss=0.2442, simple_loss=0.3104, pruned_loss=0.08899, over 21701.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3408, pruned_loss=0.1132, over 4264997.38 frames. ], batch size: 298, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:33:21,938 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 3.051e+02 3.719e+02 4.723e+02 9.435e+02, threshold=7.438e+02, percent-clipped=1.0 2023-06-18 21:34:00,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=308790.0, ans=0.2 2023-06-18 21:34:44,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-18 21:34:48,316 INFO [train.py:996] (1/4) Epoch 2, batch 21000, loss[loss=0.3319, simple_loss=0.3894, pruned_loss=0.1372, over 21729.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3401, pruned_loss=0.1139, over 4249106.40 frames. ], batch size: 389, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:34:48,316 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 21:35:04,490 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2933, simple_loss=0.3899, pruned_loss=0.09838, over 1796401.00 frames. 2023-06-18 21:35:04,491 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 21:35:28,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=309030.0, ans=0.1 2023-06-18 21:36:08,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=309150.0, ans=0.0 2023-06-18 21:36:36,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=309210.0, ans=0.0 2023-06-18 21:36:36,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=309210.0, ans=0.025 2023-06-18 21:36:40,612 INFO [train.py:996] (1/4) Epoch 2, batch 21050, loss[loss=0.2973, simple_loss=0.3441, pruned_loss=0.1252, over 21848.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3386, pruned_loss=0.1148, over 4250035.75 frames. ], batch size: 98, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:36:55,147 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.843e+02 3.389e+02 4.157e+02 8.301e+02, threshold=6.779e+02, percent-clipped=3.0 2023-06-18 21:36:55,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=309270.0, ans=0.05 2023-06-18 21:36:57,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=309270.0, ans=0.015 2023-06-18 21:37:09,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.49 vs. limit=10.0 2023-06-18 21:38:16,141 INFO [train.py:996] (1/4) Epoch 2, batch 21100, loss[loss=0.2369, simple_loss=0.2761, pruned_loss=0.09885, over 20790.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3363, pruned_loss=0.1145, over 4244566.58 frames. ], batch size: 608, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:39:01,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=22.5 2023-06-18 21:39:05,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=309690.0, ans=0.125 2023-06-18 21:39:51,664 INFO [train.py:996] (1/4) Epoch 2, batch 21150, loss[loss=0.2636, simple_loss=0.304, pruned_loss=0.1116, over 21541.00 frames. ], tot_loss[loss=0.281, simple_loss=0.3324, pruned_loss=0.1148, over 4243634.37 frames. ], batch size: 212, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:39:53,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=309870.0, ans=0.2 2023-06-18 21:39:55,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=309870.0, ans=0.125 2023-06-18 21:40:05,891 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.856e+02 3.188e+02 4.098e+02 8.101e+02, threshold=6.375e+02, percent-clipped=2.0 2023-06-18 21:40:49,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=309990.0, ans=0.125 2023-06-18 21:41:15,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=310110.0, ans=0.125 2023-06-18 21:41:21,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=310110.0, ans=0.2 2023-06-18 21:41:27,062 INFO [train.py:996] (1/4) Epoch 2, batch 21200, loss[loss=0.2647, simple_loss=0.3056, pruned_loss=0.1119, over 21184.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3279, pruned_loss=0.1139, over 4240287.05 frames. ], batch size: 159, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:42:07,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=310290.0, ans=0.125 2023-06-18 21:42:08,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=310290.0, ans=0.125 2023-06-18 21:42:32,491 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:43:02,666 INFO [train.py:996] (1/4) Epoch 2, batch 21250, loss[loss=0.247, simple_loss=0.2949, pruned_loss=0.09953, over 21647.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3247, pruned_loss=0.1122, over 4242052.29 frames. ], batch size: 282, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:43:11,959 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.405e+02 2.991e+02 3.536e+02 4.577e+02 9.525e+02, threshold=7.072e+02, percent-clipped=11.0 2023-06-18 21:43:31,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=310530.0, ans=0.0 2023-06-18 21:44:38,886 INFO [train.py:996] (1/4) Epoch 2, batch 21300, loss[loss=0.3569, simple_loss=0.3928, pruned_loss=0.1605, over 21796.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3346, pruned_loss=0.1164, over 4256255.64 frames. ], batch size: 441, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:44:46,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=310770.0, ans=0.125 2023-06-18 21:45:16,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=310830.0, ans=0.0 2023-06-18 21:46:04,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=311010.0, ans=0.1 2023-06-18 21:46:16,017 INFO [train.py:996] (1/4) Epoch 2, batch 21350, loss[loss=0.242, simple_loss=0.3279, pruned_loss=0.07811, over 21769.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3414, pruned_loss=0.118, over 4266085.10 frames. ], batch size: 298, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:46:19,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=311070.0, ans=0.125 2023-06-18 21:46:30,277 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.675e+02 5.052e+02 5.900e+02 9.607e+02, threshold=1.010e+03, percent-clipped=12.0 2023-06-18 21:46:58,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-18 21:47:01,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=311190.0, ans=0.125 2023-06-18 21:47:02,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=311190.0, ans=0.125 2023-06-18 21:47:05,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=311190.0, ans=0.0 2023-06-18 21:47:30,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-18 21:47:47,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=311310.0, ans=0.125 2023-06-18 21:47:53,565 INFO [train.py:996] (1/4) Epoch 2, batch 21400, loss[loss=0.305, simple_loss=0.3598, pruned_loss=0.1251, over 21800.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3443, pruned_loss=0.117, over 4271814.91 frames. ], batch size: 247, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:48:14,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=311370.0, ans=0.125 2023-06-18 21:48:37,005 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-18 21:49:02,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=311550.0, ans=0.125 2023-06-18 21:49:05,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=311550.0, ans=0.02 2023-06-18 21:49:05,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=311550.0, ans=0.0 2023-06-18 21:49:09,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=311550.0, ans=0.2 2023-06-18 21:49:15,327 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:49:34,472 INFO [train.py:996] (1/4) Epoch 2, batch 21450, loss[loss=0.3155, simple_loss=0.3639, pruned_loss=0.1335, over 21805.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3492, pruned_loss=0.1197, over 4276062.17 frames. ], batch size: 298, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:49:48,712 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.034e+02 3.817e+02 5.220e+02 1.129e+03, threshold=7.634e+02, percent-clipped=2.0 2023-06-18 21:50:37,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=311850.0, ans=0.2 2023-06-18 21:50:38,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=311850.0, ans=0.125 2023-06-18 21:51:15,121 INFO [train.py:996] (1/4) Epoch 2, batch 21500, loss[loss=0.2708, simple_loss=0.311, pruned_loss=0.1154, over 21461.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3481, pruned_loss=0.1211, over 4274889.94 frames. ], batch size: 195, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:51:41,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=312030.0, ans=0.0 2023-06-18 21:51:55,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=312090.0, ans=0.0 2023-06-18 21:52:30,873 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-18 21:52:43,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=312210.0, ans=0.2 2023-06-18 21:52:51,486 INFO [train.py:996] (1/4) Epoch 2, batch 21550, loss[loss=0.2578, simple_loss=0.3129, pruned_loss=0.1013, over 21728.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3394, pruned_loss=0.1171, over 4258193.59 frames. ], batch size: 351, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:52:51,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=312270.0, ans=0.125 2023-06-18 21:53:05,527 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.167e+02 4.080e+02 5.090e+02 8.174e+02, threshold=8.161e+02, percent-clipped=3.0 2023-06-18 21:53:43,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=312390.0, ans=0.0 2023-06-18 21:54:28,551 INFO [train.py:996] (1/4) Epoch 2, batch 21600, loss[loss=0.3366, simple_loss=0.3667, pruned_loss=0.1532, over 21222.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3349, pruned_loss=0.1145, over 4264674.22 frames. ], batch size: 471, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:54:49,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=312630.0, ans=0.0 2023-06-18 21:55:17,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.46 vs. limit=15.0 2023-06-18 21:56:04,245 INFO [train.py:996] (1/4) Epoch 2, batch 21650, loss[loss=0.2324, simple_loss=0.2865, pruned_loss=0.08913, over 16496.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3379, pruned_loss=0.1115, over 4266296.21 frames. ], batch size: 62, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:56:18,343 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 2.957e+02 3.372e+02 4.177e+02 8.367e+02, threshold=6.745e+02, percent-clipped=1.0 2023-06-18 21:56:31,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=312930.0, ans=0.05 2023-06-18 21:56:50,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=312990.0, ans=0.1 2023-06-18 21:56:58,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-18 21:57:25,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=313110.0, ans=0.125 2023-06-18 21:57:34,152 INFO [train.py:996] (1/4) Epoch 2, batch 21700, loss[loss=0.2734, simple_loss=0.3267, pruned_loss=0.11, over 21498.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3388, pruned_loss=0.1091, over 4273581.88 frames. ], batch size: 389, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:57:56,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=313170.0, ans=0.125 2023-06-18 21:57:56,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=313170.0, ans=0.125 2023-06-18 21:58:00,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=313230.0, ans=0.125 2023-06-18 21:58:05,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-18 21:58:06,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=313230.0, ans=0.0 2023-06-18 21:58:35,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=313350.0, ans=0.125 2023-06-18 21:58:37,110 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:58:41,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=313350.0, ans=0.95 2023-06-18 21:58:54,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=313410.0, ans=0.125 2023-06-18 21:59:09,010 INFO [train.py:996] (1/4) Epoch 2, batch 21750, loss[loss=0.2959, simple_loss=0.3259, pruned_loss=0.1329, over 21565.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3345, pruned_loss=0.1098, over 4267768.51 frames. ], batch size: 195, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:59:28,452 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.954e+02 3.507e+02 4.548e+02 1.201e+03, threshold=7.014e+02, percent-clipped=5.0 2023-06-18 21:59:40,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=313530.0, ans=0.035 2023-06-18 22:00:36,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=313710.0, ans=0.2 2023-06-18 22:00:50,248 INFO [train.py:996] (1/4) Epoch 2, batch 21800, loss[loss=0.3109, simple_loss=0.3533, pruned_loss=0.1342, over 21305.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3343, pruned_loss=0.1121, over 4271844.38 frames. ], batch size: 160, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:00:53,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.84 vs. limit=15.0 2023-06-18 22:01:15,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=313830.0, ans=0.0 2023-06-18 22:01:51,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=313950.0, ans=0.0 2023-06-18 22:01:54,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=313950.0, ans=0.05 2023-06-18 22:02:13,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=314010.0, ans=0.1 2023-06-18 22:02:26,276 INFO [train.py:996] (1/4) Epoch 2, batch 21850, loss[loss=0.3232, simple_loss=0.3747, pruned_loss=0.1358, over 21483.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3387, pruned_loss=0.1121, over 4242237.15 frames. ], batch size: 548, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:02:28,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=314070.0, ans=0.0 2023-06-18 22:02:40,356 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.114e+02 3.894e+02 4.746e+02 8.265e+02, threshold=7.787e+02, percent-clipped=3.0 2023-06-18 22:02:57,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=314130.0, ans=0.125 2023-06-18 22:03:20,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=314250.0, ans=0.0 2023-06-18 22:04:06,606 INFO [train.py:996] (1/4) Epoch 2, batch 21900, loss[loss=0.3674, simple_loss=0.3714, pruned_loss=0.1818, over 21447.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3407, pruned_loss=0.1138, over 4247210.42 frames. ], batch size: 508, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:04:08,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-18 22:04:26,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=314430.0, ans=0.5 2023-06-18 22:04:32,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=314430.0, ans=0.0 2023-06-18 22:04:56,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.61 vs. limit=22.5 2023-06-18 22:04:58,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=314550.0, ans=0.2 2023-06-18 22:05:12,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=314550.0, ans=0.2 2023-06-18 22:05:14,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=314550.0, ans=0.2 2023-06-18 22:05:36,725 INFO [train.py:996] (1/4) Epoch 2, batch 21950, loss[loss=0.2188, simple_loss=0.3053, pruned_loss=0.06611, over 21197.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3351, pruned_loss=0.1126, over 4250742.00 frames. ], batch size: 548, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:05:50,667 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.916e+02 3.382e+02 4.376e+02 8.385e+02, threshold=6.764e+02, percent-clipped=2.0 2023-06-18 22:05:52,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=314670.0, ans=0.0 2023-06-18 22:05:59,403 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-18 22:06:32,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=314850.0, ans=0.125 2023-06-18 22:07:10,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=314910.0, ans=0.0 2023-06-18 22:07:14,890 INFO [train.py:996] (1/4) Epoch 2, batch 22000, loss[loss=0.2145, simple_loss=0.286, pruned_loss=0.07147, over 21714.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3278, pruned_loss=0.1079, over 4262769.53 frames. ], batch size: 282, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:07:35,505 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-18 22:08:07,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=315090.0, ans=0.125 2023-06-18 22:08:14,775 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.79 vs. limit=6.0 2023-06-18 22:08:16,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=315150.0, ans=0.0 2023-06-18 22:09:01,060 INFO [train.py:996] (1/4) Epoch 2, batch 22050, loss[loss=0.3505, simple_loss=0.4094, pruned_loss=0.1458, over 21651.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3329, pruned_loss=0.1101, over 4263495.09 frames. ], batch size: 389, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:09:03,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=315270.0, ans=0.1 2023-06-18 22:09:08,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-06-18 22:09:10,803 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 3.309e+02 5.007e+02 6.581e+02 1.076e+03, threshold=1.001e+03, percent-clipped=24.0 2023-06-18 22:10:24,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=315510.0, ans=10.0 2023-06-18 22:10:39,896 INFO [train.py:996] (1/4) Epoch 2, batch 22100, loss[loss=0.353, simple_loss=0.4643, pruned_loss=0.1209, over 19866.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3472, pruned_loss=0.1172, over 4261220.84 frames. ], batch size: 702, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:10:44,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=315570.0, ans=0.125 2023-06-18 22:11:21,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=315690.0, ans=0.1 2023-06-18 22:11:47,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-18 22:11:48,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-18 22:11:53,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-18 22:12:11,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=315810.0, ans=0.125 2023-06-18 22:12:17,283 INFO [train.py:996] (1/4) Epoch 2, batch 22150, loss[loss=0.3285, simple_loss=0.3756, pruned_loss=0.1407, over 21774.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3508, pruned_loss=0.1201, over 4270265.05 frames. ], batch size: 389, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:12:26,469 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.397e+02 4.124e+02 4.864e+02 1.101e+03, threshold=8.247e+02, percent-clipped=1.0 2023-06-18 22:12:34,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315870.0, ans=0.1 2023-06-18 22:12:54,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=315990.0, ans=0.0 2023-06-18 22:12:55,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=315990.0, ans=0.125 2023-06-18 22:13:10,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-18 22:13:52,410 INFO [train.py:996] (1/4) Epoch 2, batch 22200, loss[loss=0.2953, simple_loss=0.3641, pruned_loss=0.1132, over 21775.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3532, pruned_loss=0.1224, over 4270636.45 frames. ], batch size: 298, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:14:01,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=316170.0, ans=0.0 2023-06-18 22:14:17,746 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:14:22,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=316230.0, ans=0.0 2023-06-18 22:14:27,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=316290.0, ans=0.125 2023-06-18 22:14:27,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.49 vs. limit=22.5 2023-06-18 22:15:17,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.47 vs. limit=10.0 2023-06-18 22:15:33,129 INFO [train.py:996] (1/4) Epoch 2, batch 22250, loss[loss=0.328, simple_loss=0.3854, pruned_loss=0.1352, over 21459.00 frames. ], tot_loss[loss=0.3074, simple_loss=0.3634, pruned_loss=0.1256, over 4277129.38 frames. ], batch size: 194, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:15:33,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=316470.0, ans=0.125 2023-06-18 22:15:36,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=316470.0, ans=0.0 2023-06-18 22:15:43,755 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 2.999e+02 3.846e+02 4.976e+02 1.173e+03, threshold=7.692e+02, percent-clipped=5.0 2023-06-18 22:15:50,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=12.0 2023-06-18 22:16:01,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.76 vs. limit=15.0 2023-06-18 22:16:22,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316650.0, ans=0.1 2023-06-18 22:16:34,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=316650.0, ans=0.0 2023-06-18 22:17:08,372 INFO [train.py:996] (1/4) Epoch 2, batch 22300, loss[loss=0.3455, simple_loss=0.3795, pruned_loss=0.1558, over 21895.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.3663, pruned_loss=0.1286, over 4266595.24 frames. ], batch size: 351, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:17:38,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=316890.0, ans=0.2 2023-06-18 22:18:26,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=317010.0, ans=0.0 2023-06-18 22:18:42,788 INFO [train.py:996] (1/4) Epoch 2, batch 22350, loss[loss=0.2491, simple_loss=0.3028, pruned_loss=0.09769, over 21233.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3635, pruned_loss=0.1285, over 4280503.38 frames. ], batch size: 608, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:18:53,859 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.080e+02 3.447e+02 4.334e+02 7.080e+02, threshold=6.895e+02, percent-clipped=0.0 2023-06-18 22:18:56,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=317070.0, ans=0.0 2023-06-18 22:20:04,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=317310.0, ans=0.0 2023-06-18 22:20:16,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=317310.0, ans=0.125 2023-06-18 22:20:19,399 INFO [train.py:996] (1/4) Epoch 2, batch 22400, loss[loss=0.2618, simple_loss=0.315, pruned_loss=0.1043, over 21695.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3602, pruned_loss=0.1245, over 4283935.21 frames. ], batch size: 112, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:20:19,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=317370.0, ans=0.125 2023-06-18 22:21:03,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=317490.0, ans=0.1 2023-06-18 22:21:33,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=317610.0, ans=0.125 2023-06-18 22:21:44,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-18 22:21:49,906 INFO [train.py:996] (1/4) Epoch 2, batch 22450, loss[loss=0.3326, simple_loss=0.3535, pruned_loss=0.1558, over 21302.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3549, pruned_loss=0.1239, over 4270883.55 frames. ], batch size: 473, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:21:57,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=317670.0, ans=0.0 2023-06-18 22:22:00,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.085e+02 3.597e+02 4.516e+02 1.181e+03, threshold=7.194e+02, percent-clipped=2.0 2023-06-18 22:22:27,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=317790.0, ans=0.025 2023-06-18 22:23:28,608 INFO [train.py:996] (1/4) Epoch 2, batch 22500, loss[loss=0.325, simple_loss=0.3815, pruned_loss=0.1342, over 21258.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3483, pruned_loss=0.1224, over 4275753.79 frames. ], batch size: 549, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:23:43,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=317970.0, ans=0.2 2023-06-18 22:24:11,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=318090.0, ans=0.125 2023-06-18 22:24:59,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318210.0, ans=0.1 2023-06-18 22:25:10,115 INFO [train.py:996] (1/4) Epoch 2, batch 22550, loss[loss=0.2653, simple_loss=0.3007, pruned_loss=0.115, over 20310.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3498, pruned_loss=0.1216, over 4274167.47 frames. ], batch size: 703, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:25:10,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=318270.0, ans=0.125 2023-06-18 22:25:26,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.552e+02 4.294e+02 6.006e+02 1.237e+03, threshold=8.588e+02, percent-clipped=14.0 2023-06-18 22:25:27,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.87 vs. limit=15.0 2023-06-18 22:25:58,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=318390.0, ans=0.0 2023-06-18 22:26:10,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=318390.0, ans=0.125 2023-06-18 22:26:40,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-18 22:26:51,924 INFO [train.py:996] (1/4) Epoch 2, batch 22600, loss[loss=0.3035, simple_loss=0.368, pruned_loss=0.1195, over 21738.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3534, pruned_loss=0.1215, over 4278066.56 frames. ], batch size: 351, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:27:44,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=318690.0, ans=0.125 2023-06-18 22:28:29,428 INFO [train.py:996] (1/4) Epoch 2, batch 22650, loss[loss=0.2662, simple_loss=0.3082, pruned_loss=0.1121, over 21496.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3491, pruned_loss=0.1204, over 4280610.96 frames. ], batch size: 212, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:28:31,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=318870.0, ans=0.125 2023-06-18 22:28:37,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=318870.0, ans=0.1 2023-06-18 22:28:41,851 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 3.125e+02 3.820e+02 4.472e+02 8.562e+02, threshold=7.640e+02, percent-clipped=0.0 2023-06-18 22:29:42,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319050.0, ans=0.1 2023-06-18 22:29:49,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=319110.0, ans=0.0 2023-06-18 22:30:07,827 INFO [train.py:996] (1/4) Epoch 2, batch 22700, loss[loss=0.2915, simple_loss=0.3352, pruned_loss=0.1239, over 21870.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3428, pruned_loss=0.1194, over 4259570.42 frames. ], batch size: 107, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:30:23,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=319230.0, ans=0.0 2023-06-18 22:31:16,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=319350.0, ans=0.2 2023-06-18 22:31:40,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=319410.0, ans=0.125 2023-06-18 22:31:46,685 INFO [train.py:996] (1/4) Epoch 2, batch 22750, loss[loss=0.339, simple_loss=0.3743, pruned_loss=0.1519, over 21805.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3447, pruned_loss=0.1229, over 4260145.09 frames. ], batch size: 247, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:31:59,211 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.020e+02 3.645e+02 4.350e+02 9.693e+02, threshold=7.290e+02, percent-clipped=3.0 2023-06-18 22:32:33,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=319590.0, ans=0.0 2023-06-18 22:32:42,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=319590.0, ans=0.1 2023-06-18 22:33:08,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=22.5 2023-06-18 22:33:24,560 INFO [train.py:996] (1/4) Epoch 2, batch 22800, loss[loss=0.2843, simple_loss=0.3454, pruned_loss=0.1116, over 21846.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3492, pruned_loss=0.1254, over 4273278.01 frames. ], batch size: 118, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:33:40,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=319770.0, ans=10.0 2023-06-18 22:33:43,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=319830.0, ans=0.0 2023-06-18 22:34:13,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=319890.0, ans=0.1 2023-06-18 22:34:19,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=319890.0, ans=0.0 2023-06-18 22:34:26,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=319950.0, ans=0.0 2023-06-18 22:34:31,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=319950.0, ans=0.0 2023-06-18 22:34:50,441 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-18 22:34:56,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=320010.0, ans=0.125 2023-06-18 22:35:01,228 INFO [train.py:996] (1/4) Epoch 2, batch 22850, loss[loss=0.2974, simple_loss=0.3314, pruned_loss=0.1317, over 21099.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3473, pruned_loss=0.1247, over 4263988.04 frames. ], batch size: 608, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:35:07,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=320070.0, ans=0.125 2023-06-18 22:35:09,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=320070.0, ans=0.0 2023-06-18 22:35:18,257 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.149e+02 3.783e+02 4.416e+02 9.029e+02, threshold=7.565e+02, percent-clipped=2.0 2023-06-18 22:35:38,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=320130.0, ans=0.0 2023-06-18 22:36:32,428 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:36:35,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=320310.0, ans=0.125 2023-06-18 22:36:38,191 INFO [train.py:996] (1/4) Epoch 2, batch 22900, loss[loss=0.2466, simple_loss=0.3174, pruned_loss=0.0879, over 21291.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3481, pruned_loss=0.1231, over 4255858.85 frames. ], batch size: 131, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:37:20,955 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:37:30,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=320490.0, ans=0.125 2023-06-18 22:37:35,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-06-18 22:37:38,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=320550.0, ans=0.0 2023-06-18 22:38:16,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320610.0, ans=0.1 2023-06-18 22:38:19,517 INFO [train.py:996] (1/4) Epoch 2, batch 22950, loss[loss=0.2728, simple_loss=0.3869, pruned_loss=0.07935, over 21608.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3617, pruned_loss=0.1219, over 4262606.75 frames. ], batch size: 389, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:38:31,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=320670.0, ans=0.1 2023-06-18 22:38:32,106 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 3.018e+02 3.691e+02 4.786e+02 9.826e+02, threshold=7.383e+02, percent-clipped=2.0 2023-06-18 22:38:52,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=320730.0, ans=0.2 2023-06-18 22:39:06,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=320790.0, ans=0.0 2023-06-18 22:39:30,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=320850.0, ans=0.0 2023-06-18 22:39:47,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=320910.0, ans=0.125 2023-06-18 22:39:53,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=320970.0, ans=0.0 2023-06-18 22:39:53,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=320970.0, ans=0.125 2023-06-18 22:39:54,954 INFO [train.py:996] (1/4) Epoch 2, batch 23000, loss[loss=0.256, simple_loss=0.3181, pruned_loss=0.09695, over 21451.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3607, pruned_loss=0.1186, over 4264259.85 frames. ], batch size: 211, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:40:23,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=321030.0, ans=0.0 2023-06-18 22:40:44,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=321090.0, ans=0.0 2023-06-18 22:40:50,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=321090.0, ans=0.125 2023-06-18 22:41:17,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=321210.0, ans=0.0 2023-06-18 22:41:32,456 INFO [train.py:996] (1/4) Epoch 2, batch 23050, loss[loss=0.3358, simple_loss=0.3854, pruned_loss=0.1431, over 21684.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3619, pruned_loss=0.1213, over 4264953.27 frames. ], batch size: 351, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:41:54,414 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-18 22:41:54,748 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.225e+02 4.039e+02 5.218e+02 8.181e+02, threshold=8.078e+02, percent-clipped=3.0 2023-06-18 22:42:25,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=321390.0, ans=0.125 2023-06-18 22:43:08,184 INFO [train.py:996] (1/4) Epoch 2, batch 23100, loss[loss=0.2716, simple_loss=0.3123, pruned_loss=0.1154, over 21298.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3558, pruned_loss=0.1207, over 4262728.13 frames. ], batch size: 176, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:44:02,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=321690.0, ans=0.0 2023-06-18 22:44:35,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=321810.0, ans=0.0 2023-06-18 22:44:36,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.71 vs. limit=6.0 2023-06-18 22:44:37,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=321810.0, ans=0.0 2023-06-18 22:44:42,549 INFO [train.py:996] (1/4) Epoch 2, batch 23150, loss[loss=0.2775, simple_loss=0.3331, pruned_loss=0.1109, over 21916.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3501, pruned_loss=0.1203, over 4271056.47 frames. ], batch size: 316, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:44:52,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=321870.0, ans=0.125 2023-06-18 22:44:57,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-06-18 22:44:58,931 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.109e+02 3.658e+02 4.384e+02 7.114e+02, threshold=7.315e+02, percent-clipped=0.0 2023-06-18 22:45:01,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=321930.0, ans=0.1 2023-06-18 22:45:37,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322050.0, ans=0.1 2023-06-18 22:45:45,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-18 22:46:12,260 INFO [train.py:996] (1/4) Epoch 2, batch 23200, loss[loss=0.3073, simple_loss=0.3457, pruned_loss=0.1345, over 21835.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3503, pruned_loss=0.1214, over 4281109.79 frames. ], batch size: 247, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:47:16,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322350.0, ans=0.1 2023-06-18 22:47:16,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=322350.0, ans=0.2 2023-06-18 22:47:47,996 INFO [train.py:996] (1/4) Epoch 2, batch 23250, loss[loss=0.3028, simple_loss=0.3528, pruned_loss=0.1264, over 21805.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3489, pruned_loss=0.1219, over 4280078.64 frames. ], batch size: 282, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:47:59,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=322470.0, ans=0.2 2023-06-18 22:48:04,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.062e+02 3.496e+02 4.224e+02 8.959e+02, threshold=6.992e+02, percent-clipped=2.0 2023-06-18 22:48:25,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=322530.0, ans=0.015 2023-06-18 22:48:43,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-18 22:49:00,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=322650.0, ans=10.0 2023-06-18 22:49:25,577 INFO [train.py:996] (1/4) Epoch 2, batch 23300, loss[loss=0.3277, simple_loss=0.3933, pruned_loss=0.131, over 21681.00 frames. ], tot_loss[loss=0.3036, simple_loss=0.3585, pruned_loss=0.1243, over 4282287.93 frames. ], batch size: 441, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:49:26,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=322770.0, ans=0.0 2023-06-18 22:49:50,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=322830.0, ans=0.1 2023-06-18 22:51:02,216 INFO [train.py:996] (1/4) Epoch 2, batch 23350, loss[loss=0.307, simple_loss=0.4083, pruned_loss=0.1029, over 20761.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3636, pruned_loss=0.1234, over 4280865.01 frames. ], batch size: 607, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:51:18,954 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 3.248e+02 3.923e+02 4.916e+02 7.049e+02, threshold=7.847e+02, percent-clipped=1.0 2023-06-18 22:52:30,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=323310.0, ans=0.125 2023-06-18 22:52:37,479 INFO [train.py:996] (1/4) Epoch 2, batch 23400, loss[loss=0.2294, simple_loss=0.2984, pruned_loss=0.08026, over 21425.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3544, pruned_loss=0.118, over 4275608.76 frames. ], batch size: 211, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:52:42,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=323370.0, ans=0.2 2023-06-18 22:54:20,631 INFO [train.py:996] (1/4) Epoch 2, batch 23450, loss[loss=0.3452, simple_loss=0.3766, pruned_loss=0.1569, over 21591.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.3545, pruned_loss=0.1202, over 4276750.40 frames. ], batch size: 507, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:54:33,414 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.119e+02 3.774e+02 4.736e+02 8.725e+02, threshold=7.548e+02, percent-clipped=2.0 2023-06-18 22:55:31,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=323850.0, ans=0.125 2023-06-18 22:55:59,317 INFO [train.py:996] (1/4) Epoch 2, batch 23500, loss[loss=0.2753, simple_loss=0.3327, pruned_loss=0.109, over 21465.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3554, pruned_loss=0.1222, over 4281742.74 frames. ], batch size: 131, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:56:13,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=324030.0, ans=0.125 2023-06-18 22:56:28,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=324030.0, ans=0.125 2023-06-18 22:56:39,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=324090.0, ans=0.05 2023-06-18 22:56:45,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-06-18 22:56:51,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=324150.0, ans=0.0 2023-06-18 22:57:17,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-18 22:57:36,240 INFO [train.py:996] (1/4) Epoch 2, batch 23550, loss[loss=0.2978, simple_loss=0.344, pruned_loss=0.1259, over 21416.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.3513, pruned_loss=0.1229, over 4279412.98 frames. ], batch size: 389, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:57:48,485 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.262e+02 3.808e+02 4.439e+02 7.936e+02, threshold=7.617e+02, percent-clipped=1.0 2023-06-18 22:57:58,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=324330.0, ans=0.2 2023-06-18 22:58:22,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=324390.0, ans=0.125 2023-06-18 22:58:41,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=324450.0, ans=0.125 2023-06-18 22:58:48,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=6.0 2023-06-18 22:58:57,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=324510.0, ans=0.125 2023-06-18 22:59:05,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=324510.0, ans=0.125 2023-06-18 22:59:13,992 INFO [train.py:996] (1/4) Epoch 2, batch 23600, loss[loss=0.3285, simple_loss=0.3856, pruned_loss=0.1357, over 21805.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3522, pruned_loss=0.1237, over 4286192.13 frames. ], batch size: 124, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:59:35,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=324630.0, ans=0.04949747468305833 2023-06-18 23:00:13,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=324690.0, ans=0.125 2023-06-18 23:00:31,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=324810.0, ans=0.0 2023-06-18 23:00:57,796 INFO [train.py:996] (1/4) Epoch 2, batch 23650, loss[loss=0.3112, simple_loss=0.3789, pruned_loss=0.1218, over 21840.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3516, pruned_loss=0.1215, over 4275972.89 frames. ], batch size: 371, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:01:10,456 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 3.434e+02 4.143e+02 5.445e+02 9.457e+02, threshold=8.285e+02, percent-clipped=4.0 2023-06-18 23:01:10,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=324870.0, ans=0.125 2023-06-18 23:01:56,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=325050.0, ans=0.125 2023-06-18 23:02:03,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=325050.0, ans=0.125 2023-06-18 23:02:39,090 INFO [train.py:996] (1/4) Epoch 2, batch 23700, loss[loss=0.3176, simple_loss=0.3646, pruned_loss=0.1353, over 21289.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.3546, pruned_loss=0.1202, over 4279704.81 frames. ], batch size: 159, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:02:41,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=325170.0, ans=0.0 2023-06-18 23:03:04,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=325230.0, ans=0.125 2023-06-18 23:03:21,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=325290.0, ans=0.125 2023-06-18 23:04:20,797 INFO [train.py:996] (1/4) Epoch 2, batch 23750, loss[loss=0.3125, simple_loss=0.3757, pruned_loss=0.1246, over 21691.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3569, pruned_loss=0.1206, over 4277039.00 frames. ], batch size: 351, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:04:35,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=325470.0, ans=0.125 2023-06-18 23:04:38,215 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.156e+02 3.937e+02 4.892e+02 1.167e+03, threshold=7.875e+02, percent-clipped=3.0 2023-06-18 23:05:00,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=325530.0, ans=0.125 2023-06-18 23:05:48,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-18 23:05:53,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325710.0, ans=0.1 2023-06-18 23:06:06,049 INFO [train.py:996] (1/4) Epoch 2, batch 23800, loss[loss=0.3163, simple_loss=0.3883, pruned_loss=0.1222, over 21713.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3548, pruned_loss=0.1178, over 4268851.88 frames. ], batch size: 247, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:07:09,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=325950.0, ans=0.125 2023-06-18 23:07:35,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=326010.0, ans=0.125 2023-06-18 23:07:35,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=326010.0, ans=0.125 2023-06-18 23:07:48,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=326010.0, ans=0.09899494936611666 2023-06-18 23:07:51,517 INFO [train.py:996] (1/4) Epoch 2, batch 23850, loss[loss=0.3289, simple_loss=0.3892, pruned_loss=0.1343, over 21812.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3653, pruned_loss=0.1205, over 4268405.58 frames. ], batch size: 282, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:08:09,320 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 3.353e+02 4.244e+02 5.255e+02 8.980e+02, threshold=8.488e+02, percent-clipped=3.0 2023-06-18 23:08:19,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=326130.0, ans=0.125 2023-06-18 23:08:21,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-18 23:08:51,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=326250.0, ans=0.0 2023-06-18 23:09:17,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=326310.0, ans=0.125 2023-06-18 23:09:30,855 INFO [train.py:996] (1/4) Epoch 2, batch 23900, loss[loss=0.3016, simple_loss=0.3719, pruned_loss=0.1157, over 21251.00 frames. ], tot_loss[loss=0.3115, simple_loss=0.3736, pruned_loss=0.1247, over 4271234.59 frames. ], batch size: 143, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:10:28,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-18 23:11:09,862 INFO [train.py:996] (1/4) Epoch 2, batch 23950, loss[loss=0.2706, simple_loss=0.3179, pruned_loss=0.1117, over 21855.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3658, pruned_loss=0.124, over 4258309.57 frames. ], batch size: 107, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:11:14,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=326670.0, ans=0.0 2023-06-18 23:11:28,003 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.116e+02 3.830e+02 4.465e+02 7.558e+02, threshold=7.660e+02, percent-clipped=0.0 2023-06-18 23:12:37,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=326910.0, ans=0.0 2023-06-18 23:12:50,758 INFO [train.py:996] (1/4) Epoch 2, batch 24000, loss[loss=0.3547, simple_loss=0.4207, pruned_loss=0.1444, over 18273.00 frames. ], tot_loss[loss=0.3093, simple_loss=0.3657, pruned_loss=0.1265, over 4253786.69 frames. ], batch size: 60, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:12:50,759 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-18 23:13:09,343 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2897, simple_loss=0.3899, pruned_loss=0.09475, over 1796401.00 frames. 2023-06-18 23:13:09,344 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-18 23:13:44,006 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:14:05,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=327090.0, ans=0.125 2023-06-18 23:14:27,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-18 23:14:48,740 INFO [train.py:996] (1/4) Epoch 2, batch 24050, loss[loss=0.2501, simple_loss=0.3303, pruned_loss=0.08501, over 21625.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3671, pruned_loss=0.1267, over 4261046.48 frames. ], batch size: 230, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:14:51,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=327270.0, ans=0.125 2023-06-18 23:14:58,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=327270.0, ans=0.125 2023-06-18 23:15:06,320 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.470e+02 4.161e+02 4.943e+02 1.064e+03, threshold=8.323e+02, percent-clipped=4.0 2023-06-18 23:15:17,556 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2023-06-18 23:15:26,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=327330.0, ans=0.0 2023-06-18 23:15:28,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=327330.0, ans=0.125 2023-06-18 23:15:53,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327450.0, ans=0.1 2023-06-18 23:15:54,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=327450.0, ans=0.0 2023-06-18 23:16:30,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=327510.0, ans=0.125 2023-06-18 23:16:34,954 INFO [train.py:996] (1/4) Epoch 2, batch 24100, loss[loss=0.408, simple_loss=0.4463, pruned_loss=0.1848, over 21463.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.3664, pruned_loss=0.1241, over 4265087.73 frames. ], batch size: 471, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:17:26,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327690.0, ans=0.125 2023-06-18 23:17:43,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=327750.0, ans=0.125 2023-06-18 23:18:01,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=327810.0, ans=0.125 2023-06-18 23:18:14,206 INFO [train.py:996] (1/4) Epoch 2, batch 24150, loss[loss=0.3229, simple_loss=0.3826, pruned_loss=0.1316, over 21875.00 frames. ], tot_loss[loss=0.3094, simple_loss=0.3661, pruned_loss=0.1264, over 4268134.53 frames. ], batch size: 124, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:18:26,898 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.990e+02 3.404e+02 4.259e+02 8.342e+02, threshold=6.809e+02, percent-clipped=1.0 2023-06-18 23:19:10,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-18 23:19:55,233 INFO [train.py:996] (1/4) Epoch 2, batch 24200, loss[loss=0.2876, simple_loss=0.3603, pruned_loss=0.1074, over 21650.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3662, pruned_loss=0.127, over 4268088.91 frames. ], batch size: 247, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:19:58,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=328170.0, ans=0.0 2023-06-18 23:20:41,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=328290.0, ans=0.1 2023-06-18 23:21:21,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=328410.0, ans=0.07 2023-06-18 23:21:41,613 INFO [train.py:996] (1/4) Epoch 2, batch 24250, loss[loss=0.2303, simple_loss=0.3248, pruned_loss=0.0679, over 21743.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3613, pruned_loss=0.1173, over 4276337.92 frames. ], batch size: 298, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:21:59,857 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 3.026e+02 3.601e+02 5.036e+02 9.709e+02, threshold=7.202e+02, percent-clipped=3.0 2023-06-18 23:22:35,625 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-18 23:22:52,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=328650.0, ans=0.125 2023-06-18 23:23:12,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=328710.0, ans=0.125 2023-06-18 23:23:20,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=328770.0, ans=0.125 2023-06-18 23:23:21,442 INFO [train.py:996] (1/4) Epoch 2, batch 24300, loss[loss=0.1967, simple_loss=0.2824, pruned_loss=0.05551, over 21677.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3564, pruned_loss=0.1116, over 4278855.55 frames. ], batch size: 414, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:23:42,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-18 23:24:13,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=328890.0, ans=0.2 2023-06-18 23:24:31,621 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:25:04,604 INFO [train.py:996] (1/4) Epoch 2, batch 24350, loss[loss=0.3245, simple_loss=0.3706, pruned_loss=0.1392, over 21790.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3531, pruned_loss=0.1124, over 4284186.19 frames. ], batch size: 112, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:25:08,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=329070.0, ans=0.05 2023-06-18 23:25:14,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=329070.0, ans=0.0 2023-06-18 23:25:14,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=329070.0, ans=0.0 2023-06-18 23:25:17,419 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.896e+02 3.511e+02 4.657e+02 9.016e+02, threshold=7.022e+02, percent-clipped=4.0 2023-06-18 23:25:18,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-18 23:25:42,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=329190.0, ans=0.0 2023-06-18 23:25:52,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=329190.0, ans=0.1 2023-06-18 23:26:35,177 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:26:45,783 INFO [train.py:996] (1/4) Epoch 2, batch 24400, loss[loss=0.333, simple_loss=0.3941, pruned_loss=0.136, over 21590.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3574, pruned_loss=0.1166, over 4286643.55 frames. ], batch size: 389, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:26:47,876 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:27:17,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=329430.0, ans=0.0 2023-06-18 23:27:39,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=329490.0, ans=0.025 2023-06-18 23:27:52,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=329550.0, ans=0.125 2023-06-18 23:28:00,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=329550.0, ans=0.125 2023-06-18 23:28:25,838 INFO [train.py:996] (1/4) Epoch 2, batch 24450, loss[loss=0.2255, simple_loss=0.3044, pruned_loss=0.07334, over 21417.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3594, pruned_loss=0.1191, over 4287058.09 frames. ], batch size: 194, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:28:37,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=329670.0, ans=0.1 2023-06-18 23:28:38,607 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.395e+02 4.151e+02 4.993e+02 8.571e+02, threshold=8.301e+02, percent-clipped=4.0 2023-06-18 23:28:40,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=329730.0, ans=0.125 2023-06-18 23:28:47,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=329730.0, ans=0.0 2023-06-18 23:28:47,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-18 23:30:04,423 INFO [train.py:996] (1/4) Epoch 2, batch 24500, loss[loss=0.2591, simple_loss=0.3185, pruned_loss=0.09985, over 21538.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3588, pruned_loss=0.1183, over 4283305.39 frames. ], batch size: 212, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:30:12,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2023-06-18 23:30:39,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=330030.0, ans=0.2 2023-06-18 23:30:55,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.81 vs. limit=10.0 2023-06-18 23:31:20,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-06-18 23:31:20,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-18 23:31:44,652 INFO [train.py:996] (1/4) Epoch 2, batch 24550, loss[loss=0.4307, simple_loss=0.4576, pruned_loss=0.2019, over 21320.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3626, pruned_loss=0.1221, over 4278874.69 frames. ], batch size: 507, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:31:45,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=330270.0, ans=0.125 2023-06-18 23:31:49,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=330270.0, ans=0.0 2023-06-18 23:32:00,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=12.0 2023-06-18 23:32:01,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.061e+02 3.714e+02 4.494e+02 1.254e+03, threshold=7.429e+02, percent-clipped=1.0 2023-06-18 23:32:10,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=330330.0, ans=0.2 2023-06-18 23:32:15,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=330330.0, ans=0.125 2023-06-18 23:33:20,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=330510.0, ans=0.1 2023-06-18 23:33:22,616 INFO [train.py:996] (1/4) Epoch 2, batch 24600, loss[loss=0.2532, simple_loss=0.3027, pruned_loss=0.1019, over 21495.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.3589, pruned_loss=0.1228, over 4277093.21 frames. ], batch size: 195, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:33:25,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-18 23:33:26,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=330570.0, ans=0.125 2023-06-18 23:33:29,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=330570.0, ans=0.125 2023-06-18 23:34:19,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=330690.0, ans=0.0 2023-06-18 23:34:19,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=330690.0, ans=0.125 2023-06-18 23:34:33,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=330750.0, ans=0.125 2023-06-18 23:34:38,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=330750.0, ans=0.2 2023-06-18 23:34:42,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=330750.0, ans=0.1 2023-06-18 23:34:49,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=330810.0, ans=0.5 2023-06-18 23:34:50,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=330810.0, ans=0.0 2023-06-18 23:34:51,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-06-18 23:35:01,691 INFO [train.py:996] (1/4) Epoch 2, batch 24650, loss[loss=0.2511, simple_loss=0.3062, pruned_loss=0.09798, over 21959.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3512, pruned_loss=0.1217, over 4271120.52 frames. ], batch size: 113, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:35:11,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=330870.0, ans=0.2 2023-06-18 23:35:11,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=330870.0, ans=0.2 2023-06-18 23:35:19,495 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.192e+02 3.864e+02 5.203e+02 1.017e+03, threshold=7.727e+02, percent-clipped=5.0 2023-06-18 23:35:21,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=330930.0, ans=0.0 2023-06-18 23:35:51,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-18 23:36:35,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=331170.0, ans=0.0 2023-06-18 23:36:36,255 INFO [train.py:996] (1/4) Epoch 2, batch 24700, loss[loss=0.3002, simple_loss=0.3498, pruned_loss=0.1252, over 21452.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3502, pruned_loss=0.1199, over 4268675.43 frames. ], batch size: 389, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:36:56,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.48 vs. limit=15.0 2023-06-18 23:37:49,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=331350.0, ans=0.125 2023-06-18 23:37:53,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=331350.0, ans=0.05 2023-06-18 23:38:13,922 INFO [train.py:996] (1/4) Epoch 2, batch 24750, loss[loss=0.3077, simple_loss=0.3287, pruned_loss=0.1433, over 21386.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3415, pruned_loss=0.116, over 4244615.37 frames. ], batch size: 509, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:38:33,455 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 3.050e+02 3.889e+02 4.934e+02 8.372e+02, threshold=7.777e+02, percent-clipped=3.0 2023-06-18 23:39:23,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=331650.0, ans=0.0 2023-06-18 23:39:46,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=331770.0, ans=0.125 2023-06-18 23:39:47,033 INFO [train.py:996] (1/4) Epoch 2, batch 24800, loss[loss=0.2921, simple_loss=0.3492, pruned_loss=0.1175, over 21875.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3374, pruned_loss=0.1167, over 4255589.74 frames. ], batch size: 124, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:40:13,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=331830.0, ans=0.125 2023-06-18 23:40:52,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=331950.0, ans=0.125 2023-06-18 23:41:15,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=332010.0, ans=0.125 2023-06-18 23:41:21,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=332010.0, ans=0.0 2023-06-18 23:41:26,055 INFO [train.py:996] (1/4) Epoch 2, batch 24850, loss[loss=0.283, simple_loss=0.3263, pruned_loss=0.1198, over 21227.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3383, pruned_loss=0.1176, over 4260341.85 frames. ], batch size: 608, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:41:30,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332070.0, ans=0.1 2023-06-18 23:41:50,090 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 3.323e+02 4.351e+02 5.576e+02 8.938e+02, threshold=8.701e+02, percent-clipped=5.0 2023-06-18 23:42:25,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332190.0, ans=0.1 2023-06-18 23:42:31,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-18 23:42:32,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-18 23:42:35,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=332250.0, ans=0.125 2023-06-18 23:43:03,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=332370.0, ans=0.95 2023-06-18 23:43:09,979 INFO [train.py:996] (1/4) Epoch 2, batch 24900, loss[loss=0.4183, simple_loss=0.4455, pruned_loss=0.1955, over 21438.00 frames. ], tot_loss[loss=0.2892, simple_loss=0.3414, pruned_loss=0.1185, over 4260926.56 frames. ], batch size: 471, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:43:15,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=332370.0, ans=0.0 2023-06-18 23:43:37,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-18 23:43:52,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332430.0, ans=0.1 2023-06-18 23:44:31,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=332610.0, ans=0.0 2023-06-18 23:44:36,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=332610.0, ans=0.2 2023-06-18 23:44:46,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=332610.0, ans=0.125 2023-06-18 23:44:55,487 INFO [train.py:996] (1/4) Epoch 2, batch 24950, loss[loss=0.3283, simple_loss=0.3927, pruned_loss=0.1319, over 21377.00 frames. ], tot_loss[loss=0.3, simple_loss=0.3517, pruned_loss=0.1242, over 4261474.82 frames. ], batch size: 131, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:45:07,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=332670.0, ans=0.125 2023-06-18 23:45:15,197 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.427e+02 4.669e+02 5.544e+02 9.304e+02, threshold=9.338e+02, percent-clipped=1.0 2023-06-18 23:45:42,099 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=12.0 2023-06-18 23:46:40,037 INFO [train.py:996] (1/4) Epoch 2, batch 25000, loss[loss=0.2695, simple_loss=0.3243, pruned_loss=0.1074, over 21395.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3598, pruned_loss=0.1267, over 4267540.47 frames. ], batch size: 211, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:46:45,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=332970.0, ans=0.1 2023-06-18 23:47:05,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=333030.0, ans=0.125 2023-06-18 23:47:21,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=333090.0, ans=0.125 2023-06-18 23:47:36,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=333150.0, ans=0.125 2023-06-18 23:47:58,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=333210.0, ans=0.5 2023-06-18 23:48:16,993 INFO [train.py:996] (1/4) Epoch 2, batch 25050, loss[loss=0.2687, simple_loss=0.3138, pruned_loss=0.1118, over 21825.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3508, pruned_loss=0.1233, over 4274015.17 frames. ], batch size: 98, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:48:33,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=333270.0, ans=0.125 2023-06-18 23:48:36,477 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.031e+02 3.621e+02 4.496e+02 7.145e+02, threshold=7.242e+02, percent-clipped=0.0 2023-06-18 23:49:48,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=333510.0, ans=0.2 2023-06-18 23:49:48,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=333510.0, ans=0.025 2023-06-18 23:49:55,952 INFO [train.py:996] (1/4) Epoch 2, batch 25100, loss[loss=0.2788, simple_loss=0.3534, pruned_loss=0.1021, over 21333.00 frames. ], tot_loss[loss=0.2932, simple_loss=0.3448, pruned_loss=0.1208, over 4269533.26 frames. ], batch size: 176, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:50:37,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=333690.0, ans=0.1 2023-06-18 23:51:33,424 INFO [train.py:996] (1/4) Epoch 2, batch 25150, loss[loss=0.292, simple_loss=0.3609, pruned_loss=0.1116, over 21879.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3469, pruned_loss=0.1181, over 4263520.88 frames. ], batch size: 316, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:51:48,177 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 3.003e+02 3.483e+02 4.487e+02 9.549e+02, threshold=6.965e+02, percent-clipped=3.0 2023-06-18 23:52:24,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=334050.0, ans=0.125 2023-06-18 23:52:31,500 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-18 23:52:38,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334050.0, ans=0.1 2023-06-18 23:53:11,744 INFO [train.py:996] (1/4) Epoch 2, batch 25200, loss[loss=0.3154, simple_loss=0.391, pruned_loss=0.1199, over 21674.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3472, pruned_loss=0.1162, over 4249101.69 frames. ], batch size: 414, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:53:24,794 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:53:28,266 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-18 23:53:41,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=334230.0, ans=0.035 2023-06-18 23:54:39,776 INFO [train.py:996] (1/4) Epoch 2, batch 25250, loss[loss=0.2457, simple_loss=0.2974, pruned_loss=0.097, over 21320.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3437, pruned_loss=0.1136, over 4240148.83 frames. ], batch size: 131, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:54:46,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=334470.0, ans=0.0 2023-06-18 23:55:00,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-06-18 23:55:04,410 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.756e+02 3.618e+02 4.524e+02 8.260e+02, threshold=7.237e+02, percent-clipped=4.0 2023-06-18 23:55:06,439 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:55:22,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=334530.0, ans=0.0 2023-06-18 23:55:38,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=334590.0, ans=0.1 2023-06-18 23:55:49,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=334650.0, ans=0.035 2023-06-18 23:56:15,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-18 23:56:24,649 INFO [train.py:996] (1/4) Epoch 2, batch 25300, loss[loss=0.3645, simple_loss=0.4058, pruned_loss=0.1616, over 21198.00 frames. ], tot_loss[loss=0.284, simple_loss=0.342, pruned_loss=0.113, over 4248759.68 frames. ], batch size: 143, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:56:56,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-18 23:57:00,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=334830.0, ans=0.125 2023-06-18 23:58:10,259 INFO [train.py:996] (1/4) Epoch 2, batch 25350, loss[loss=0.2629, simple_loss=0.3433, pruned_loss=0.09127, over 21647.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3441, pruned_loss=0.1132, over 4238795.10 frames. ], batch size: 389, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:58:29,557 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.861e+02 3.471e+02 4.257e+02 9.448e+02, threshold=6.941e+02, percent-clipped=2.0 2023-06-18 23:58:59,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=12.0 2023-06-18 23:59:02,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=335250.0, ans=0.0 2023-06-18 23:59:40,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335310.0, ans=0.1 2023-06-18 23:59:44,346 INFO [train.py:996] (1/4) Epoch 2, batch 25400, loss[loss=0.2811, simple_loss=0.3224, pruned_loss=0.1199, over 21486.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3382, pruned_loss=0.1115, over 4243872.45 frames. ], batch size: 230, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:59:55,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=335370.0, ans=0.0 2023-06-19 00:00:09,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-19 00:00:44,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-19 00:01:22,818 INFO [train.py:996] (1/4) Epoch 2, batch 25450, loss[loss=0.2681, simple_loss=0.3371, pruned_loss=0.09956, over 21345.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.3409, pruned_loss=0.1151, over 4244378.41 frames. ], batch size: 159, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:01:47,343 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.942e+02 3.491e+02 4.451e+02 7.396e+02, threshold=6.982e+02, percent-clipped=1.0 2023-06-19 00:02:09,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=335790.0, ans=10.0 2023-06-19 00:03:05,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=335910.0, ans=0.125 2023-06-19 00:03:09,095 INFO [train.py:996] (1/4) Epoch 2, batch 25500, loss[loss=0.2918, simple_loss=0.367, pruned_loss=0.1083, over 21827.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3417, pruned_loss=0.1105, over 4243682.31 frames. ], batch size: 316, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:04:19,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=336150.0, ans=0.2 2023-06-19 00:04:23,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=336210.0, ans=0.2 2023-06-19 00:04:56,928 INFO [train.py:996] (1/4) Epoch 2, batch 25550, loss[loss=0.2407, simple_loss=0.3379, pruned_loss=0.0718, over 21554.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3485, pruned_loss=0.1106, over 4248556.67 frames. ], batch size: 263, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:04:57,485 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:05:12,146 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.591e+02 3.118e+02 3.638e+02 5.445e+02, threshold=6.236e+02, percent-clipped=0.0 2023-06-19 00:05:38,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=336390.0, ans=0.05 2023-06-19 00:06:26,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-19 00:06:37,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-19 00:06:38,365 INFO [train.py:996] (1/4) Epoch 2, batch 25600, loss[loss=0.3627, simple_loss=0.4086, pruned_loss=0.1584, over 21778.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3543, pruned_loss=0.1129, over 4263816.54 frames. ], batch size: 441, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:06:42,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=336570.0, ans=0.07 2023-06-19 00:06:53,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-19 00:06:55,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=336630.0, ans=0.125 2023-06-19 00:07:03,647 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.70 vs. limit=22.5 2023-06-19 00:07:37,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=336750.0, ans=0.1 2023-06-19 00:08:17,760 INFO [train.py:996] (1/4) Epoch 2, batch 25650, loss[loss=0.2664, simple_loss=0.3225, pruned_loss=0.1052, over 21407.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3549, pruned_loss=0.1157, over 4265451.79 frames. ], batch size: 131, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:08:31,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.010e+02 3.647e+02 4.694e+02 1.135e+03, threshold=7.294e+02, percent-clipped=6.0 2023-06-19 00:09:17,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=337050.0, ans=0.0 2023-06-19 00:09:35,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337110.0, ans=0.1 2023-06-19 00:09:57,155 INFO [train.py:996] (1/4) Epoch 2, batch 25700, loss[loss=0.2877, simple_loss=0.3332, pruned_loss=0.1211, over 21815.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.3523, pruned_loss=0.1174, over 4263880.87 frames. ], batch size: 371, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:10:03,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=337170.0, ans=0.0 2023-06-19 00:10:03,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=337170.0, ans=0.125 2023-06-19 00:10:11,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337230.0, ans=0.1 2023-06-19 00:10:23,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-19 00:10:51,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=337290.0, ans=0.125 2023-06-19 00:11:38,888 INFO [train.py:996] (1/4) Epoch 2, batch 25750, loss[loss=0.3048, simple_loss=0.3635, pruned_loss=0.1231, over 21824.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3565, pruned_loss=0.12, over 4266434.71 frames. ], batch size: 282, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:11:54,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 3.020e+02 3.881e+02 5.422e+02 1.342e+03, threshold=7.762e+02, percent-clipped=9.0 2023-06-19 00:11:57,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-19 00:12:06,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=337530.0, ans=0.04949747468305833 2023-06-19 00:13:05,171 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:13:26,833 INFO [train.py:996] (1/4) Epoch 2, batch 25800, loss[loss=0.3406, simple_loss=0.3928, pruned_loss=0.1442, over 21322.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3696, pruned_loss=0.126, over 4273693.00 frames. ], batch size: 548, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:13:48,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=337830.0, ans=0.1 2023-06-19 00:14:13,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=337890.0, ans=0.1 2023-06-19 00:14:42,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=338010.0, ans=0.125 2023-06-19 00:15:02,330 INFO [train.py:996] (1/4) Epoch 2, batch 25850, loss[loss=0.285, simple_loss=0.3483, pruned_loss=0.1109, over 21889.00 frames. ], tot_loss[loss=0.3104, simple_loss=0.3706, pruned_loss=0.1251, over 4273924.29 frames. ], batch size: 332, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:15:06,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.27 vs. limit=10.0 2023-06-19 00:15:26,946 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 3.253e+02 3.802e+02 4.832e+02 7.273e+02, threshold=7.603e+02, percent-clipped=0.0 2023-06-19 00:15:47,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-19 00:15:57,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-19 00:16:53,751 INFO [train.py:996] (1/4) Epoch 2, batch 25900, loss[loss=0.4364, simple_loss=0.4867, pruned_loss=0.1931, over 21673.00 frames. ], tot_loss[loss=0.3127, simple_loss=0.3723, pruned_loss=0.1265, over 4280591.93 frames. ], batch size: 441, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:17:25,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338430.0, ans=0.1 2023-06-19 00:18:39,796 INFO [train.py:996] (1/4) Epoch 2, batch 25950, loss[loss=0.3147, simple_loss=0.3714, pruned_loss=0.129, over 21318.00 frames. ], tot_loss[loss=0.3195, simple_loss=0.3799, pruned_loss=0.1295, over 4278179.85 frames. ], batch size: 176, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:18:43,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=338670.0, ans=0.1 2023-06-19 00:18:54,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 3.185e+02 3.771e+02 4.566e+02 7.877e+02, threshold=7.541e+02, percent-clipped=2.0 2023-06-19 00:20:13,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.82 vs. limit=6.0 2023-06-19 00:20:20,770 INFO [train.py:996] (1/4) Epoch 2, batch 26000, loss[loss=0.233, simple_loss=0.3064, pruned_loss=0.07984, over 16270.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3789, pruned_loss=0.1282, over 4271291.99 frames. ], batch size: 61, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:20:30,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=338970.0, ans=0.125 2023-06-19 00:22:00,574 INFO [train.py:996] (1/4) Epoch 2, batch 26050, loss[loss=0.374, simple_loss=0.396, pruned_loss=0.1761, over 21851.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.3793, pruned_loss=0.1304, over 4275881.17 frames. ], batch size: 510, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:22:14,502 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.427e+02 3.293e+02 3.871e+02 4.573e+02 8.054e+02, threshold=7.741e+02, percent-clipped=1.0 2023-06-19 00:22:34,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=339390.0, ans=0.2 2023-06-19 00:23:04,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=339450.0, ans=0.125 2023-06-19 00:23:28,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=339510.0, ans=0.125 2023-06-19 00:23:38,901 INFO [train.py:996] (1/4) Epoch 2, batch 26100, loss[loss=0.268, simple_loss=0.315, pruned_loss=0.1105, over 21349.00 frames. ], tot_loss[loss=0.3152, simple_loss=0.3723, pruned_loss=0.1291, over 4286257.57 frames. ], batch size: 176, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:24:37,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=339750.0, ans=0.125 2023-06-19 00:24:55,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=339750.0, ans=0.125 2023-06-19 00:25:05,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=339810.0, ans=0.0 2023-06-19 00:25:17,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=339810.0, ans=0.0 2023-06-19 00:25:20,008 INFO [train.py:996] (1/4) Epoch 2, batch 26150, loss[loss=0.2904, simple_loss=0.3457, pruned_loss=0.1175, over 21593.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3696, pruned_loss=0.1294, over 4290607.20 frames. ], batch size: 263, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:25:21,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339870.0, ans=0.1 2023-06-19 00:25:34,891 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.250e+02 3.966e+02 5.249e+02 8.349e+02, threshold=7.932e+02, percent-clipped=3.0 2023-06-19 00:25:43,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=339930.0, ans=0.2 2023-06-19 00:25:56,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=339990.0, ans=0.04949747468305833 2023-06-19 00:26:20,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=339990.0, ans=0.1 2023-06-19 00:26:31,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=340050.0, ans=0.1 2023-06-19 00:27:00,528 INFO [train.py:996] (1/4) Epoch 2, batch 26200, loss[loss=0.3347, simple_loss=0.4185, pruned_loss=0.1255, over 21674.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3694, pruned_loss=0.1262, over 4287940.43 frames. ], batch size: 414, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:27:12,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=340170.0, ans=0.125 2023-06-19 00:27:13,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-19 00:28:19,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.06 vs. limit=12.0 2023-06-19 00:28:28,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=340410.0, ans=0.2 2023-06-19 00:28:39,144 INFO [train.py:996] (1/4) Epoch 2, batch 26250, loss[loss=0.3308, simple_loss=0.4048, pruned_loss=0.1284, over 21419.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3725, pruned_loss=0.1248, over 4293332.76 frames. ], batch size: 548, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:28:49,967 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-19 00:28:51,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340470.0, ans=0.1 2023-06-19 00:28:53,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=340470.0, ans=0.125 2023-06-19 00:28:54,373 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.946e+02 3.629e+02 4.371e+02 7.049e+02, threshold=7.257e+02, percent-clipped=0.0 2023-06-19 00:30:08,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.57 vs. limit=6.0 2023-06-19 00:30:11,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=340710.0, ans=0.0 2023-06-19 00:30:14,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=340710.0, ans=0.0 2023-06-19 00:30:20,173 INFO [train.py:996] (1/4) Epoch 2, batch 26300, loss[loss=0.2926, simple_loss=0.3492, pruned_loss=0.118, over 21530.00 frames. ], tot_loss[loss=0.3093, simple_loss=0.3687, pruned_loss=0.1249, over 4288755.34 frames. ], batch size: 131, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:30:45,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340830.0, ans=0.1 2023-06-19 00:31:14,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=340890.0, ans=0.125 2023-06-19 00:31:17,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=340890.0, ans=0.125 2023-06-19 00:31:23,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=340890.0, ans=0.125 2023-06-19 00:31:41,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=340950.0, ans=0.125 2023-06-19 00:31:54,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=341010.0, ans=0.0 2023-06-19 00:32:00,708 INFO [train.py:996] (1/4) Epoch 2, batch 26350, loss[loss=0.3033, simple_loss=0.356, pruned_loss=0.1253, over 21840.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3681, pruned_loss=0.1266, over 4297734.78 frames. ], batch size: 282, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:32:29,476 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 3.077e+02 3.703e+02 4.775e+02 7.605e+02, threshold=7.406e+02, percent-clipped=1.0 2023-06-19 00:33:05,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=341250.0, ans=0.125 2023-06-19 00:33:24,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=341310.0, ans=0.125 2023-06-19 00:33:38,847 INFO [train.py:996] (1/4) Epoch 2, batch 26400, loss[loss=0.3225, simple_loss=0.3385, pruned_loss=0.1532, over 21502.00 frames. ], tot_loss[loss=0.3085, simple_loss=0.3629, pruned_loss=0.127, over 4274879.21 frames. ], batch size: 512, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:33:41,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=341370.0, ans=0.0 2023-06-19 00:33:43,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-19 00:33:45,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=341370.0, ans=0.125 2023-06-19 00:34:02,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-19 00:34:48,005 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:35:04,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-19 00:35:33,176 INFO [train.py:996] (1/4) Epoch 2, batch 26450, loss[loss=0.2873, simple_loss=0.3718, pruned_loss=0.1014, over 21663.00 frames. ], tot_loss[loss=0.3082, simple_loss=0.363, pruned_loss=0.1267, over 4262437.12 frames. ], batch size: 247, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:35:57,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=341670.0, ans=0.125 2023-06-19 00:35:58,240 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 3.117e+02 3.740e+02 5.003e+02 1.177e+03, threshold=7.481e+02, percent-clipped=6.0 2023-06-19 00:36:11,248 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-19 00:36:22,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-19 00:36:25,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=341790.0, ans=0.125 2023-06-19 00:36:26,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=341790.0, ans=0.125 2023-06-19 00:36:28,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=341790.0, ans=0.125 2023-06-19 00:37:14,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=341910.0, ans=0.125 2023-06-19 00:37:20,181 INFO [train.py:996] (1/4) Epoch 2, batch 26500, loss[loss=0.3586, simple_loss=0.4218, pruned_loss=0.1477, over 21689.00 frames. ], tot_loss[loss=0.3072, simple_loss=0.3644, pruned_loss=0.125, over 4268047.62 frames. ], batch size: 414, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:37:27,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=341970.0, ans=0.1 2023-06-19 00:38:04,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=342090.0, ans=0.0 2023-06-19 00:39:03,668 INFO [train.py:996] (1/4) Epoch 2, batch 26550, loss[loss=0.2664, simple_loss=0.3546, pruned_loss=0.08906, over 21614.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3605, pruned_loss=0.1205, over 4262361.83 frames. ], batch size: 389, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:39:10,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=342270.0, ans=0.1 2023-06-19 00:39:19,864 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.448e+02 4.337e+02 5.433e+02 9.319e+02, threshold=8.673e+02, percent-clipped=7.0 2023-06-19 00:39:33,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-06-19 00:40:42,948 INFO [train.py:996] (1/4) Epoch 2, batch 26600, loss[loss=0.2619, simple_loss=0.3184, pruned_loss=0.1027, over 21529.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3585, pruned_loss=0.1163, over 4264401.14 frames. ], batch size: 230, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:40:45,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.45 vs. limit=15.0 2023-06-19 00:40:48,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-19 00:42:18,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=342810.0, ans=0.0 2023-06-19 00:42:22,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=342870.0, ans=12.0 2023-06-19 00:42:22,716 INFO [train.py:996] (1/4) Epoch 2, batch 26650, loss[loss=0.2201, simple_loss=0.3089, pruned_loss=0.06569, over 20793.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3517, pruned_loss=0.1159, over 4249837.17 frames. ], batch size: 609, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:42:29,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=342870.0, ans=0.0 2023-06-19 00:42:29,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=342870.0, ans=0.1 2023-06-19 00:42:38,274 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 3.189e+02 3.874e+02 5.287e+02 9.951e+02, threshold=7.747e+02, percent-clipped=1.0 2023-06-19 00:42:49,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-19 00:43:04,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=342990.0, ans=10.0 2023-06-19 00:43:10,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=342990.0, ans=0.125 2023-06-19 00:43:37,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=343050.0, ans=0.0 2023-06-19 00:43:44,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-19 00:43:54,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-19 00:43:56,117 INFO [train.py:996] (1/4) Epoch 2, batch 26700, loss[loss=0.2737, simple_loss=0.3234, pruned_loss=0.1121, over 21760.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3441, pruned_loss=0.1115, over 4258477.85 frames. ], batch size: 247, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:44:05,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=343170.0, ans=0.125 2023-06-19 00:44:10,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=343170.0, ans=0.0 2023-06-19 00:44:13,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=343230.0, ans=0.2 2023-06-19 00:45:05,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=343350.0, ans=0.2 2023-06-19 00:45:30,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=343410.0, ans=0.125 2023-06-19 00:45:37,236 INFO [train.py:996] (1/4) Epoch 2, batch 26750, loss[loss=0.3272, simple_loss=0.3816, pruned_loss=0.1364, over 21307.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3425, pruned_loss=0.1094, over 4265401.56 frames. ], batch size: 159, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:45:39,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=343470.0, ans=0.0 2023-06-19 00:45:58,481 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.751e+02 3.226e+02 3.870e+02 9.468e+02, threshold=6.452e+02, percent-clipped=0.0 2023-06-19 00:46:39,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=343590.0, ans=0.0 2023-06-19 00:47:06,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-19 00:47:18,768 INFO [train.py:996] (1/4) Epoch 2, batch 26800, loss[loss=0.2779, simple_loss=0.3282, pruned_loss=0.1138, over 20050.00 frames. ], tot_loss[loss=0.292, simple_loss=0.352, pruned_loss=0.1161, over 4270756.54 frames. ], batch size: 703, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:47:36,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-19 00:47:57,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-19 00:48:17,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343890.0, ans=0.1 2023-06-19 00:48:38,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=344010.0, ans=0.125 2023-06-19 00:48:54,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=344010.0, ans=0.0 2023-06-19 00:48:58,641 INFO [train.py:996] (1/4) Epoch 2, batch 26850, loss[loss=0.2864, simple_loss=0.3187, pruned_loss=0.127, over 21405.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3544, pruned_loss=0.1197, over 4273932.03 frames. ], batch size: 177, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:49:20,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=344070.0, ans=0.125 2023-06-19 00:49:23,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=344070.0, ans=0.125 2023-06-19 00:49:26,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=344070.0, ans=0.0 2023-06-19 00:49:29,084 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.578e+02 3.525e+02 4.180e+02 5.123e+02 1.126e+03, threshold=8.361e+02, percent-clipped=11.0 2023-06-19 00:50:05,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=344250.0, ans=0.0 2023-06-19 00:50:37,543 INFO [train.py:996] (1/4) Epoch 2, batch 26900, loss[loss=0.233, simple_loss=0.2835, pruned_loss=0.0913, over 21566.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3452, pruned_loss=0.1179, over 4272426.27 frames. ], batch size: 247, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:51:05,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=344430.0, ans=0.1 2023-06-19 00:51:11,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=344430.0, ans=0.125 2023-06-19 00:51:21,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=344490.0, ans=0.125 2023-06-19 00:51:56,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-19 00:52:12,716 INFO [train.py:996] (1/4) Epoch 2, batch 26950, loss[loss=0.3313, simple_loss=0.4009, pruned_loss=0.1308, over 21706.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3441, pruned_loss=0.1179, over 4265900.34 frames. ], batch size: 298, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:52:43,524 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.137e+02 3.675e+02 4.908e+02 1.022e+03, threshold=7.351e+02, percent-clipped=1.0 2023-06-19 00:53:52,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-19 00:54:08,190 INFO [train.py:996] (1/4) Epoch 2, batch 27000, loss[loss=0.2575, simple_loss=0.3181, pruned_loss=0.09846, over 21225.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3432, pruned_loss=0.1141, over 4259600.08 frames. ], batch size: 159, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:54:08,190 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 00:54:25,469 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2623, simple_loss=0.361, pruned_loss=0.08186, over 1796401.00 frames. 2023-06-19 00:54:25,470 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 00:54:26,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=344970.0, ans=0.125 2023-06-19 00:54:32,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=344970.0, ans=0.125 2023-06-19 00:54:56,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=345030.0, ans=0.025 2023-06-19 00:55:15,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=345150.0, ans=0.125 2023-06-19 00:55:15,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345150.0, ans=0.1 2023-06-19 00:55:49,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=345210.0, ans=0.125 2023-06-19 00:56:06,864 INFO [train.py:996] (1/4) Epoch 2, batch 27050, loss[loss=0.2711, simple_loss=0.3379, pruned_loss=0.1021, over 21222.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3452, pruned_loss=0.1103, over 4264886.19 frames. ], batch size: 143, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:56:17,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=345270.0, ans=0.0 2023-06-19 00:56:23,034 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.868e+02 3.454e+02 4.544e+02 1.088e+03, threshold=6.909e+02, percent-clipped=2.0 2023-06-19 00:56:28,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=345330.0, ans=0.125 2023-06-19 00:56:30,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345330.0, ans=0.1 2023-06-19 00:56:31,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345330.0, ans=0.1 2023-06-19 00:56:36,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=345330.0, ans=0.125 2023-06-19 00:56:54,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=345450.0, ans=0.0 2023-06-19 00:57:40,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=345510.0, ans=0.0 2023-06-19 00:57:43,850 INFO [train.py:996] (1/4) Epoch 2, batch 27100, loss[loss=0.3056, simple_loss=0.3699, pruned_loss=0.1207, over 21739.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3484, pruned_loss=0.113, over 4273292.95 frames. ], batch size: 389, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:57:48,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345570.0, ans=0.1 2023-06-19 00:57:58,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=345570.0, ans=0.0 2023-06-19 00:59:20,174 INFO [train.py:996] (1/4) Epoch 2, batch 27150, loss[loss=0.3353, simple_loss=0.41, pruned_loss=0.1303, over 21749.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3616, pruned_loss=0.1166, over 4269401.49 frames. ], batch size: 351, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 00:59:28,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=345870.0, ans=0.125 2023-06-19 00:59:35,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=345930.0, ans=0.0 2023-06-19 00:59:36,357 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.470e+02 4.217e+02 5.291e+02 1.062e+03, threshold=8.433e+02, percent-clipped=9.0 2023-06-19 01:00:04,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=345990.0, ans=0.125 2023-06-19 01:00:55,724 INFO [train.py:996] (1/4) Epoch 2, batch 27200, loss[loss=0.3771, simple_loss=0.4243, pruned_loss=0.165, over 21648.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.368, pruned_loss=0.1185, over 4276801.72 frames. ], batch size: 389, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:00:59,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=346170.0, ans=0.0 2023-06-19 01:01:04,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=346170.0, ans=0.125 2023-06-19 01:02:02,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=346290.0, ans=0.0 2023-06-19 01:02:13,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=346350.0, ans=0.1 2023-06-19 01:02:19,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=346350.0, ans=0.0 2023-06-19 01:02:21,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=346410.0, ans=0.125 2023-06-19 01:02:23,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=346410.0, ans=0.0 2023-06-19 01:02:39,310 INFO [train.py:996] (1/4) Epoch 2, batch 27250, loss[loss=0.3396, simple_loss=0.3956, pruned_loss=0.1418, over 21802.00 frames. ], tot_loss[loss=0.3087, simple_loss=0.3707, pruned_loss=0.1234, over 4272224.39 frames. ], batch size: 124, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:02:59,700 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 3.070e+02 3.624e+02 4.371e+02 7.633e+02, threshold=7.247e+02, percent-clipped=0.0 2023-06-19 01:03:17,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=346530.0, ans=0.125 2023-06-19 01:03:35,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=346590.0, ans=0.2 2023-06-19 01:03:41,401 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-19 01:04:12,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=346710.0, ans=0.0 2023-06-19 01:04:21,665 INFO [train.py:996] (1/4) Epoch 2, batch 27300, loss[loss=0.2845, simple_loss=0.3689, pruned_loss=0.1001, over 20737.00 frames. ], tot_loss[loss=0.3147, simple_loss=0.3765, pruned_loss=0.1265, over 4275973.16 frames. ], batch size: 607, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:04:40,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=346770.0, ans=0.0 2023-06-19 01:06:09,251 INFO [train.py:996] (1/4) Epoch 2, batch 27350, loss[loss=0.314, simple_loss=0.3894, pruned_loss=0.1193, over 20797.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3781, pruned_loss=0.1274, over 4273598.38 frames. ], batch size: 607, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:06:25,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-19 01:06:35,161 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.251e+02 3.372e+02 3.927e+02 4.716e+02 8.245e+02, threshold=7.854e+02, percent-clipped=1.0 2023-06-19 01:06:42,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-19 01:06:49,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=347190.0, ans=0.125 2023-06-19 01:06:51,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=347190.0, ans=0.125 2023-06-19 01:07:41,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=347310.0, ans=0.125 2023-06-19 01:07:48,532 INFO [train.py:996] (1/4) Epoch 2, batch 27400, loss[loss=0.2941, simple_loss=0.3376, pruned_loss=0.1253, over 21273.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.3733, pruned_loss=0.127, over 4278509.61 frames. ], batch size: 176, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:08:05,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=347370.0, ans=0.0 2023-06-19 01:08:33,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=347490.0, ans=0.125 2023-06-19 01:08:36,017 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2023-06-19 01:08:40,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-19 01:09:06,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.91 vs. limit=22.5 2023-06-19 01:09:10,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=347610.0, ans=0.0 2023-06-19 01:09:29,478 INFO [train.py:996] (1/4) Epoch 2, batch 27450, loss[loss=0.2998, simple_loss=0.3656, pruned_loss=0.117, over 21914.00 frames. ], tot_loss[loss=0.3088, simple_loss=0.3672, pruned_loss=0.1252, over 4272690.50 frames. ], batch size: 317, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:09:45,218 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.415e+02 3.960e+02 5.232e+02 1.053e+03, threshold=7.919e+02, percent-clipped=3.0 2023-06-19 01:09:45,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=347730.0, ans=0.125 2023-06-19 01:10:40,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-19 01:10:45,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=347910.0, ans=0.2 2023-06-19 01:11:08,110 INFO [train.py:996] (1/4) Epoch 2, batch 27500, loss[loss=0.3198, simple_loss=0.3797, pruned_loss=0.13, over 21772.00 frames. ], tot_loss[loss=0.3093, simple_loss=0.3663, pruned_loss=0.1261, over 4282799.02 frames. ], batch size: 112, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:11:08,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=347970.0, ans=0.125 2023-06-19 01:11:32,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=348030.0, ans=0.04949747468305833 2023-06-19 01:12:00,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=348150.0, ans=0.2 2023-06-19 01:12:04,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=348150.0, ans=0.0 2023-06-19 01:12:47,547 INFO [train.py:996] (1/4) Epoch 2, batch 27550, loss[loss=0.238, simple_loss=0.3069, pruned_loss=0.08457, over 21752.00 frames. ], tot_loss[loss=0.3016, simple_loss=0.3599, pruned_loss=0.1217, over 4272545.43 frames. ], batch size: 316, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 01:12:48,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=348270.0, ans=0.125 2023-06-19 01:12:55,123 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-19 01:13:00,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=348270.0, ans=0.09899494936611666 2023-06-19 01:13:05,023 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.210e+02 3.897e+02 4.749e+02 7.014e+02, threshold=7.795e+02, percent-clipped=0.0 2023-06-19 01:13:08,592 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:13:13,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=348330.0, ans=0.0 2023-06-19 01:13:17,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=348330.0, ans=0.125 2023-06-19 01:13:48,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=348450.0, ans=0.2 2023-06-19 01:13:55,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=348510.0, ans=0.125 2023-06-19 01:14:24,901 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:14:29,346 INFO [train.py:996] (1/4) Epoch 2, batch 27600, loss[loss=0.2818, simple_loss=0.3345, pruned_loss=0.1145, over 15018.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3525, pruned_loss=0.1204, over 4268431.61 frames. ], batch size: 60, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:15:14,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=348690.0, ans=0.0 2023-06-19 01:15:20,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=348750.0, ans=0.125 2023-06-19 01:15:51,678 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-19 01:16:03,369 INFO [train.py:996] (1/4) Epoch 2, batch 27650, loss[loss=0.2637, simple_loss=0.3341, pruned_loss=0.0967, over 21721.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3461, pruned_loss=0.1191, over 4265210.55 frames. ], batch size: 247, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:16:05,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=348870.0, ans=0.125 2023-06-19 01:16:25,631 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.734e+02 3.535e+02 4.526e+02 5.756e+02 1.207e+03, threshold=9.051e+02, percent-clipped=8.0 2023-06-19 01:16:37,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=348930.0, ans=0.125 2023-06-19 01:17:00,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=349050.0, ans=0.09899494936611666 2023-06-19 01:17:37,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=349110.0, ans=0.0 2023-06-19 01:17:38,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=349110.0, ans=0.0 2023-06-19 01:17:48,108 INFO [train.py:996] (1/4) Epoch 2, batch 27700, loss[loss=0.3009, simple_loss=0.3612, pruned_loss=0.1203, over 21634.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3457, pruned_loss=0.1166, over 4270054.42 frames. ], batch size: 230, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:18:06,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=349230.0, ans=0.05 2023-06-19 01:18:23,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=349290.0, ans=0.0 2023-06-19 01:18:36,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-19 01:18:59,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=349350.0, ans=0.125 2023-06-19 01:19:27,716 INFO [train.py:996] (1/4) Epoch 2, batch 27750, loss[loss=0.255, simple_loss=0.3368, pruned_loss=0.08664, over 21692.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3489, pruned_loss=0.1151, over 4268311.56 frames. ], batch size: 298, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:19:43,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-19 01:19:45,250 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.361e+02 3.964e+02 5.094e+02 9.268e+02, threshold=7.928e+02, percent-clipped=1.0 2023-06-19 01:20:47,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=349710.0, ans=0.0 2023-06-19 01:21:06,408 INFO [train.py:996] (1/4) Epoch 2, batch 27800, loss[loss=0.3096, simple_loss=0.3531, pruned_loss=0.133, over 21472.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3482, pruned_loss=0.1158, over 4281371.40 frames. ], batch size: 194, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:21:33,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=349830.0, ans=0.1 2023-06-19 01:21:37,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.55 vs. limit=22.5 2023-06-19 01:22:47,538 INFO [train.py:996] (1/4) Epoch 2, batch 27850, loss[loss=0.2943, simple_loss=0.3451, pruned_loss=0.1218, over 21792.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.349, pruned_loss=0.1182, over 4289610.32 frames. ], batch size: 441, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:22:56,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-19 01:23:01,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=350070.0, ans=0.125 2023-06-19 01:23:06,194 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.389e+02 4.361e+02 6.049e+02 9.596e+02, threshold=8.723e+02, percent-clipped=7.0 2023-06-19 01:23:28,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-19 01:24:23,792 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-06-19 01:24:30,942 INFO [train.py:996] (1/4) Epoch 2, batch 27900, loss[loss=0.3832, simple_loss=0.479, pruned_loss=0.1437, over 20811.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3588, pruned_loss=0.1195, over 4287052.38 frames. ], batch size: 607, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:24:33,625 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-19 01:24:46,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=350430.0, ans=0.125 2023-06-19 01:25:09,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=350490.0, ans=0.125 2023-06-19 01:26:10,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=350610.0, ans=15.0 2023-06-19 01:26:13,829 INFO [train.py:996] (1/4) Epoch 2, batch 27950, loss[loss=0.2699, simple_loss=0.3427, pruned_loss=0.09859, over 21640.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3567, pruned_loss=0.1146, over 4280593.22 frames. ], batch size: 263, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:26:22,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=350670.0, ans=0.1 2023-06-19 01:26:32,394 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.236e+02 3.896e+02 4.908e+02 8.483e+02, threshold=7.791e+02, percent-clipped=0.0 2023-06-19 01:26:45,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=350730.0, ans=0.0 2023-06-19 01:27:18,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-19 01:27:53,645 INFO [train.py:996] (1/4) Epoch 2, batch 28000, loss[loss=0.3083, simple_loss=0.3598, pruned_loss=0.1283, over 21845.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3543, pruned_loss=0.1119, over 4287654.95 frames. ], batch size: 298, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:28:33,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=351090.0, ans=0.0 2023-06-19 01:29:06,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-19 01:29:35,431 INFO [train.py:996] (1/4) Epoch 2, batch 28050, loss[loss=0.2854, simple_loss=0.3247, pruned_loss=0.1231, over 21800.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3527, pruned_loss=0.1141, over 4294620.72 frames. ], batch size: 118, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:29:57,844 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.799e+02 3.165e+02 3.817e+02 7.021e+02, threshold=6.330e+02, percent-clipped=0.0 2023-06-19 01:29:59,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=351330.0, ans=0.125 2023-06-19 01:30:54,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=351450.0, ans=0.0 2023-06-19 01:31:06,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=351510.0, ans=0.125 2023-06-19 01:31:15,427 INFO [train.py:996] (1/4) Epoch 2, batch 28100, loss[loss=0.2993, simple_loss=0.3797, pruned_loss=0.1094, over 20800.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3527, pruned_loss=0.1152, over 4290079.20 frames. ], batch size: 608, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:32:37,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=351810.0, ans=0.125 2023-06-19 01:32:41,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351810.0, ans=0.1 2023-06-19 01:32:42,769 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.25 vs. limit=15.0 2023-06-19 01:32:54,279 INFO [train.py:996] (1/4) Epoch 2, batch 28150, loss[loss=0.267, simple_loss=0.3073, pruned_loss=0.1134, over 21464.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3451, pruned_loss=0.1153, over 4279930.46 frames. ], batch size: 212, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:33:11,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.407e+02 3.356e+02 3.949e+02 5.361e+02 1.113e+03, threshold=7.898e+02, percent-clipped=11.0 2023-06-19 01:33:12,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=351930.0, ans=0.125 2023-06-19 01:33:13,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351930.0, ans=0.1 2023-06-19 01:33:21,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=351930.0, ans=0.125 2023-06-19 01:33:21,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=351930.0, ans=0.1 2023-06-19 01:33:39,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=351990.0, ans=0.0 2023-06-19 01:34:29,888 INFO [train.py:996] (1/4) Epoch 2, batch 28200, loss[loss=0.3286, simple_loss=0.3757, pruned_loss=0.1408, over 21689.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3464, pruned_loss=0.117, over 4267257.70 frames. ], batch size: 298, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:34:41,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=352170.0, ans=0.125 2023-06-19 01:35:17,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=352290.0, ans=0.125 2023-06-19 01:36:10,823 INFO [train.py:996] (1/4) Epoch 2, batch 28250, loss[loss=0.3641, simple_loss=0.4688, pruned_loss=0.1297, over 19708.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3501, pruned_loss=0.1203, over 4264434.82 frames. ], batch size: 702, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:36:16,931 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.77 vs. limit=22.5 2023-06-19 01:36:18,743 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-19 01:36:38,453 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.575e+02 3.660e+02 4.283e+02 5.277e+02 9.711e+02, threshold=8.566e+02, percent-clipped=2.0 2023-06-19 01:37:39,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=352710.0, ans=0.125 2023-06-19 01:37:51,576 INFO [train.py:996] (1/4) Epoch 2, batch 28300, loss[loss=0.2057, simple_loss=0.2902, pruned_loss=0.06057, over 21329.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3475, pruned_loss=0.1174, over 4267208.52 frames. ], batch size: 194, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:38:18,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=352770.0, ans=0.1 2023-06-19 01:39:44,099 INFO [train.py:996] (1/4) Epoch 2, batch 28350, loss[loss=0.3144, simple_loss=0.35, pruned_loss=0.1394, over 21320.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3422, pruned_loss=0.1096, over 4266043.27 frames. ], batch size: 507, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:40:07,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-19 01:40:07,599 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.865e+02 3.652e+02 5.364e+02 1.153e+03, threshold=7.304e+02, percent-clipped=2.0 2023-06-19 01:40:12,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=353130.0, ans=0.125 2023-06-19 01:40:52,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=353250.0, ans=0.125 2023-06-19 01:40:54,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=353250.0, ans=0.125 2023-06-19 01:41:30,006 INFO [train.py:996] (1/4) Epoch 2, batch 28400, loss[loss=0.2837, simple_loss=0.3929, pruned_loss=0.08722, over 19814.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3388, pruned_loss=0.1101, over 4263626.09 frames. ], batch size: 704, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:42:38,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=353610.0, ans=0.0 2023-06-19 01:43:09,879 INFO [train.py:996] (1/4) Epoch 2, batch 28450, loss[loss=0.372, simple_loss=0.395, pruned_loss=0.1745, over 21683.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3476, pruned_loss=0.1164, over 4264147.80 frames. ], batch size: 508, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:43:12,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-19 01:43:20,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=353670.0, ans=0.0 2023-06-19 01:43:27,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.368e+02 4.115e+02 5.202e+02 1.060e+03, threshold=8.231e+02, percent-clipped=7.0 2023-06-19 01:44:46,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353910.0, ans=0.1 2023-06-19 01:44:50,993 INFO [train.py:996] (1/4) Epoch 2, batch 28500, loss[loss=0.3082, simple_loss=0.3621, pruned_loss=0.1272, over 21349.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3513, pruned_loss=0.1203, over 4274841.65 frames. ], batch size: 159, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:44:58,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=353970.0, ans=0.0 2023-06-19 01:45:44,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=354150.0, ans=0.125 2023-06-19 01:46:15,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=354150.0, ans=0.125 2023-06-19 01:46:20,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=354210.0, ans=0.125 2023-06-19 01:46:34,615 INFO [train.py:996] (1/4) Epoch 2, batch 28550, loss[loss=0.2885, simple_loss=0.3475, pruned_loss=0.1148, over 20762.00 frames. ], tot_loss[loss=0.3051, simple_loss=0.361, pruned_loss=0.1246, over 4278127.10 frames. ], batch size: 607, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:46:38,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=354270.0, ans=0.0 2023-06-19 01:46:51,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=354330.0, ans=0.125 2023-06-19 01:46:52,896 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.021e+02 3.809e+02 4.877e+02 1.502e+03, threshold=7.617e+02, percent-clipped=6.0 2023-06-19 01:48:17,806 INFO [train.py:996] (1/4) Epoch 2, batch 28600, loss[loss=0.3791, simple_loss=0.419, pruned_loss=0.1696, over 21800.00 frames. ], tot_loss[loss=0.3113, simple_loss=0.3687, pruned_loss=0.1269, over 4282096.32 frames. ], batch size: 441, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:48:18,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=354570.0, ans=0.2 2023-06-19 01:48:31,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=354570.0, ans=0.0 2023-06-19 01:48:51,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-19 01:49:18,733 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:49:58,540 INFO [train.py:996] (1/4) Epoch 2, batch 28650, loss[loss=0.2361, simple_loss=0.2891, pruned_loss=0.09161, over 21151.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3605, pruned_loss=0.1248, over 4284280.95 frames. ], batch size: 159, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:50:15,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-19 01:50:21,117 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.376e+02 3.990e+02 4.916e+02 8.510e+02, threshold=7.981e+02, percent-clipped=3.0 2023-06-19 01:50:41,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=354930.0, ans=0.2 2023-06-19 01:50:51,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-19 01:51:14,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=355050.0, ans=0.125 2023-06-19 01:51:20,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-19 01:51:29,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=355110.0, ans=0.125 2023-06-19 01:51:38,415 INFO [train.py:996] (1/4) Epoch 2, batch 28700, loss[loss=0.25, simple_loss=0.2876, pruned_loss=0.1062, over 20109.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.3567, pruned_loss=0.1241, over 4276698.76 frames. ], batch size: 704, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:52:00,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=355230.0, ans=0.2 2023-06-19 01:52:04,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=355230.0, ans=0.125 2023-06-19 01:53:18,219 INFO [train.py:996] (1/4) Epoch 2, batch 28750, loss[loss=0.2995, simple_loss=0.378, pruned_loss=0.1105, over 21740.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3564, pruned_loss=0.125, over 4282602.07 frames. ], batch size: 414, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:53:46,504 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 3.065e+02 3.653e+02 4.286e+02 6.736e+02, threshold=7.306e+02, percent-clipped=0.0 2023-06-19 01:54:14,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.42 vs. limit=6.0 2023-06-19 01:54:36,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=355710.0, ans=0.125 2023-06-19 01:54:54,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=355710.0, ans=0.0 2023-06-19 01:54:58,828 INFO [train.py:996] (1/4) Epoch 2, batch 28800, loss[loss=0.3768, simple_loss=0.4109, pruned_loss=0.1714, over 21354.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3601, pruned_loss=0.1253, over 4276740.05 frames. ], batch size: 507, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:56:01,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=355950.0, ans=0.125 2023-06-19 01:56:25,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=356010.0, ans=0.2 2023-06-19 01:56:40,221 INFO [train.py:996] (1/4) Epoch 2, batch 28850, loss[loss=0.308, simple_loss=0.3618, pruned_loss=0.127, over 21683.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.3615, pruned_loss=0.1271, over 4285678.28 frames. ], batch size: 389, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:57:01,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=356070.0, ans=0.125 2023-06-19 01:57:08,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=356070.0, ans=0.1 2023-06-19 01:57:09,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=356130.0, ans=0.125 2023-06-19 01:57:10,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=356130.0, ans=10.0 2023-06-19 01:57:12,521 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.103e+02 3.762e+02 4.499e+02 8.286e+02, threshold=7.524e+02, percent-clipped=2.0 2023-06-19 01:57:19,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=356130.0, ans=0.125 2023-06-19 01:57:50,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=356250.0, ans=0.125 2023-06-19 01:58:26,581 INFO [train.py:996] (1/4) Epoch 2, batch 28900, loss[loss=0.3413, simple_loss=0.3941, pruned_loss=0.1442, over 21331.00 frames. ], tot_loss[loss=0.3116, simple_loss=0.3645, pruned_loss=0.1294, over 4289101.25 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:58:41,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-19 01:59:03,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=356430.0, ans=0.125 2023-06-19 01:59:38,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356550.0, ans=0.1 2023-06-19 01:59:54,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.30 vs. limit=15.0 2023-06-19 02:00:05,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-19 02:00:19,027 INFO [train.py:996] (1/4) Epoch 2, batch 28950, loss[loss=0.219, simple_loss=0.277, pruned_loss=0.08044, over 21398.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3651, pruned_loss=0.1281, over 4281482.54 frames. ], batch size: 131, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:00:26,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=356670.0, ans=0.125 2023-06-19 02:00:28,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-19 02:00:37,077 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.281e+02 4.028e+02 5.318e+02 1.006e+03, threshold=8.055e+02, percent-clipped=4.0 2023-06-19 02:01:10,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=356790.0, ans=0.2 2023-06-19 02:01:10,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=356790.0, ans=0.2 2023-06-19 02:01:57,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=356910.0, ans=0.0 2023-06-19 02:02:00,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=356970.0, ans=0.125 2023-06-19 02:02:01,654 INFO [train.py:996] (1/4) Epoch 2, batch 29000, loss[loss=0.3573, simple_loss=0.4147, pruned_loss=0.15, over 17738.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3694, pruned_loss=0.1273, over 4276313.77 frames. ], batch size: 60, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:03:31,570 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.535e-03 2023-06-19 02:03:36,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=357210.0, ans=0.125 2023-06-19 02:03:44,477 INFO [train.py:996] (1/4) Epoch 2, batch 29050, loss[loss=0.2938, simple_loss=0.3401, pruned_loss=0.1237, over 21235.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.368, pruned_loss=0.1283, over 4285881.97 frames. ], batch size: 159, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:04:02,376 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.182e+02 3.653e+02 4.375e+02 6.472e+02, threshold=7.306e+02, percent-clipped=0.0 2023-06-19 02:04:04,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=357330.0, ans=0.125 2023-06-19 02:04:22,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=357390.0, ans=0.125 2023-06-19 02:05:25,352 INFO [train.py:996] (1/4) Epoch 2, batch 29100, loss[loss=0.2563, simple_loss=0.3002, pruned_loss=0.1061, over 21781.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3581, pruned_loss=0.1247, over 4285070.16 frames. ], batch size: 102, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:05:27,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=357570.0, ans=0.025 2023-06-19 02:05:32,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=357570.0, ans=0.125 2023-06-19 02:05:42,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-06-19 02:06:04,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=357690.0, ans=0.0 2023-06-19 02:07:04,286 INFO [train.py:996] (1/4) Epoch 2, batch 29150, loss[loss=0.3051, simple_loss=0.3592, pruned_loss=0.1255, over 21176.00 frames. ], tot_loss[loss=0.301, simple_loss=0.3573, pruned_loss=0.1223, over 4274666.73 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:07:06,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=357870.0, ans=0.1 2023-06-19 02:07:14,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=357870.0, ans=0.125 2023-06-19 02:07:21,741 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 3.617e+02 4.258e+02 5.180e+02 9.047e+02, threshold=8.516e+02, percent-clipped=9.0 2023-06-19 02:07:37,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-19 02:08:10,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-19 02:08:44,195 INFO [train.py:996] (1/4) Epoch 2, batch 29200, loss[loss=0.2697, simple_loss=0.317, pruned_loss=0.1112, over 21644.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.353, pruned_loss=0.1213, over 4270767.00 frames. ], batch size: 298, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:09:34,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-19 02:09:41,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=358290.0, ans=0.0 2023-06-19 02:09:41,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358290.0, ans=0.1 2023-06-19 02:09:42,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=358290.0, ans=0.0 2023-06-19 02:09:50,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=358350.0, ans=0.0 2023-06-19 02:10:06,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=358410.0, ans=0.125 2023-06-19 02:10:23,990 INFO [train.py:996] (1/4) Epoch 2, batch 29250, loss[loss=0.2494, simple_loss=0.3316, pruned_loss=0.08354, over 21637.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3475, pruned_loss=0.1165, over 4271610.06 frames. ], batch size: 263, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:10:46,909 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.694e+02 3.473e+02 5.021e+02 8.866e+02, threshold=6.946e+02, percent-clipped=1.0 2023-06-19 02:11:01,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=358530.0, ans=0.04949747468305833 2023-06-19 02:11:56,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=358710.0, ans=0.2 2023-06-19 02:12:04,056 INFO [train.py:996] (1/4) Epoch 2, batch 29300, loss[loss=0.2972, simple_loss=0.3349, pruned_loss=0.1297, over 20179.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3508, pruned_loss=0.1165, over 4274591.34 frames. ], batch size: 703, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:13:19,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=358950.0, ans=0.04949747468305833 2023-06-19 02:13:27,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=359010.0, ans=0.0 2023-06-19 02:13:44,859 INFO [train.py:996] (1/4) Epoch 2, batch 29350, loss[loss=0.2593, simple_loss=0.3151, pruned_loss=0.1018, over 21808.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.347, pruned_loss=0.116, over 4274302.69 frames. ], batch size: 118, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:14:13,782 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 2.964e+02 3.404e+02 4.114e+02 7.296e+02, threshold=6.809e+02, percent-clipped=1.0 2023-06-19 02:14:14,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=359130.0, ans=0.2 2023-06-19 02:14:22,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=359130.0, ans=0.125 2023-06-19 02:15:11,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=359310.0, ans=0.0 2023-06-19 02:15:14,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=359310.0, ans=0.1 2023-06-19 02:15:14,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=359310.0, ans=0.2 2023-06-19 02:15:21,989 INFO [train.py:996] (1/4) Epoch 2, batch 29400, loss[loss=0.2571, simple_loss=0.3231, pruned_loss=0.09553, over 21707.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3443, pruned_loss=0.1117, over 4258222.76 frames. ], batch size: 332, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:15:37,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=359370.0, ans=0.05 2023-06-19 02:17:03,073 INFO [train.py:996] (1/4) Epoch 2, batch 29450, loss[loss=0.3612, simple_loss=0.4145, pruned_loss=0.1539, over 21410.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3411, pruned_loss=0.1104, over 4256206.87 frames. ], batch size: 131, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:17:16,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=359670.0, ans=0.125 2023-06-19 02:17:23,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-19 02:17:25,752 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.078e+02 3.690e+02 4.615e+02 7.103e+02, threshold=7.380e+02, percent-clipped=1.0 2023-06-19 02:18:18,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=359910.0, ans=0.1 2023-06-19 02:18:37,417 INFO [train.py:996] (1/4) Epoch 2, batch 29500, loss[loss=0.308, simple_loss=0.3582, pruned_loss=0.1289, over 21849.00 frames. ], tot_loss[loss=0.291, simple_loss=0.35, pruned_loss=0.116, over 4259929.92 frames. ], batch size: 391, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:18:54,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=359970.0, ans=0.07 2023-06-19 02:19:03,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=360030.0, ans=0.125 2023-06-19 02:19:17,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=360030.0, ans=0.125 2023-06-19 02:19:28,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=360090.0, ans=0.125 2023-06-19 02:19:32,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=360090.0, ans=0.0 2023-06-19 02:19:54,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=360210.0, ans=0.1 2023-06-19 02:20:08,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=360210.0, ans=0.125 2023-06-19 02:20:17,260 INFO [train.py:996] (1/4) Epoch 2, batch 29550, loss[loss=0.2653, simple_loss=0.3214, pruned_loss=0.1046, over 21842.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3498, pruned_loss=0.1185, over 4275782.10 frames. ], batch size: 282, lr: 1.45e-02, grad_scale: 64.0 2023-06-19 02:20:49,797 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 3.148e+02 3.536e+02 4.853e+02 9.360e+02, threshold=7.072e+02, percent-clipped=2.0 2023-06-19 02:21:11,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=360390.0, ans=0.125 2023-06-19 02:21:21,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=360450.0, ans=0.125 2023-06-19 02:21:31,941 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-19 02:21:40,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=360510.0, ans=0.0 2023-06-19 02:22:10,003 INFO [train.py:996] (1/4) Epoch 2, batch 29600, loss[loss=0.3945, simple_loss=0.449, pruned_loss=0.17, over 21713.00 frames. ], tot_loss[loss=0.3018, simple_loss=0.3582, pruned_loss=0.1227, over 4280714.02 frames. ], batch size: 414, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 02:22:14,050 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-19 02:22:39,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=360630.0, ans=0.125 2023-06-19 02:22:43,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=360630.0, ans=0.125 2023-06-19 02:22:45,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-19 02:22:57,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=360690.0, ans=0.125 2023-06-19 02:23:10,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=360750.0, ans=0.125 2023-06-19 02:23:15,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=360750.0, ans=0.0 2023-06-19 02:23:34,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.56 vs. limit=6.0 2023-06-19 02:23:43,622 INFO [train.py:996] (1/4) Epoch 2, batch 29650, loss[loss=0.3617, simple_loss=0.388, pruned_loss=0.1676, over 21715.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3555, pruned_loss=0.1176, over 4276885.20 frames. ], batch size: 508, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:24:07,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.035e+02 3.587e+02 4.924e+02 8.544e+02, threshold=7.175e+02, percent-clipped=8.0 2023-06-19 02:24:07,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=360930.0, ans=0.125 2023-06-19 02:24:13,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-19 02:24:27,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=360990.0, ans=0.125 2023-06-19 02:25:09,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=361110.0, ans=0.0 2023-06-19 02:25:23,743 INFO [train.py:996] (1/4) Epoch 2, batch 29700, loss[loss=0.2798, simple_loss=0.3589, pruned_loss=0.1004, over 21191.00 frames. ], tot_loss[loss=0.2958, simple_loss=0.3564, pruned_loss=0.1176, over 4277981.28 frames. ], batch size: 143, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:26:57,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=361410.0, ans=0.1 2023-06-19 02:27:03,531 INFO [train.py:996] (1/4) Epoch 2, batch 29750, loss[loss=0.2872, simple_loss=0.3684, pruned_loss=0.103, over 21882.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.36, pruned_loss=0.1168, over 4280684.83 frames. ], batch size: 316, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:27:09,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=361470.0, ans=0.125 2023-06-19 02:27:27,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 3.188e+02 3.972e+02 5.342e+02 1.059e+03, threshold=7.944e+02, percent-clipped=5.0 2023-06-19 02:28:42,369 INFO [train.py:996] (1/4) Epoch 2, batch 29800, loss[loss=0.2438, simple_loss=0.3159, pruned_loss=0.08581, over 21414.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3615, pruned_loss=0.118, over 4291620.57 frames. ], batch size: 194, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:28:55,841 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:30:06,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-19 02:30:19,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=362010.0, ans=0.125 2023-06-19 02:30:22,308 INFO [train.py:996] (1/4) Epoch 2, batch 29850, loss[loss=0.2792, simple_loss=0.3405, pruned_loss=0.109, over 21458.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3554, pruned_loss=0.1152, over 4284353.27 frames. ], batch size: 131, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:30:37,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=362070.0, ans=0.1 2023-06-19 02:30:46,099 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 2.912e+02 3.664e+02 4.469e+02 7.842e+02, threshold=7.327e+02, percent-clipped=0.0 2023-06-19 02:30:53,420 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-06-19 02:31:00,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=362190.0, ans=0.0 2023-06-19 02:31:08,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=362190.0, ans=0.125 2023-06-19 02:31:12,210 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.722e-03 2023-06-19 02:31:14,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.50 vs. limit=10.0 2023-06-19 02:31:55,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=362310.0, ans=0.0 2023-06-19 02:32:06,364 INFO [train.py:996] (1/4) Epoch 2, batch 29900, loss[loss=0.3234, simple_loss=0.3692, pruned_loss=0.1388, over 21661.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.3544, pruned_loss=0.1161, over 4287224.56 frames. ], batch size: 263, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:32:17,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=362370.0, ans=0.1 2023-06-19 02:33:11,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=362550.0, ans=0.125 2023-06-19 02:33:49,213 INFO [train.py:996] (1/4) Epoch 2, batch 29950, loss[loss=0.3252, simple_loss=0.3519, pruned_loss=0.1492, over 20135.00 frames. ], tot_loss[loss=0.2999, simple_loss=0.3577, pruned_loss=0.121, over 4285487.70 frames. ], batch size: 702, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:33:59,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=362670.0, ans=0.125 2023-06-19 02:34:09,817 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.188e+02 4.013e+02 5.057e+02 1.029e+03, threshold=8.025e+02, percent-clipped=2.0 2023-06-19 02:34:48,581 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:35:00,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-19 02:35:19,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.50 vs. limit=22.5 2023-06-19 02:35:22,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=362910.0, ans=0.0 2023-06-19 02:35:29,983 INFO [train.py:996] (1/4) Epoch 2, batch 30000, loss[loss=0.275, simple_loss=0.3629, pruned_loss=0.09354, over 21679.00 frames. ], tot_loss[loss=0.3015, simple_loss=0.3606, pruned_loss=0.1212, over 4290378.52 frames. ], batch size: 389, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:35:29,984 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 02:35:45,235 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([0.4621, 1.0761, 1.7977, 1.2748, 1.1888, 1.8099, 1.7913, 0.9944], device='cuda:1') 2023-06-19 02:35:47,463 INFO [train.py:1028] (1/4) Epoch 2, validation: loss=0.2693, simple_loss=0.3684, pruned_loss=0.08513, over 1796401.00 frames. 2023-06-19 02:35:47,464 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 02:36:41,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=363090.0, ans=0.125 2023-06-19 02:36:43,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=363090.0, ans=0.035 2023-06-19 02:37:30,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=363210.0, ans=0.035 2023-06-19 02:37:36,948 INFO [train.py:996] (1/4) Epoch 2, batch 30050, loss[loss=0.2505, simple_loss=0.3585, pruned_loss=0.07125, over 19769.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.3628, pruned_loss=0.1169, over 4279161.02 frames. ], batch size: 702, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:37:40,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=363270.0, ans=0.125 2023-06-19 02:37:59,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-19 02:38:06,063 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.812e+02 3.422e+02 4.683e+02 8.613e+02, threshold=6.845e+02, percent-clipped=2.0 2023-06-19 02:38:24,425 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-19 02:38:49,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=363450.0, ans=0.125 2023-06-19 02:39:04,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-19 02:39:07,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=363510.0, ans=0.0 2023-06-19 02:39:15,581 INFO [train.py:996] (1/4) Epoch 2, batch 30100, loss[loss=0.2674, simple_loss=0.3181, pruned_loss=0.1083, over 21643.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3622, pruned_loss=0.1167, over 4270979.59 frames. ], batch size: 282, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:39:34,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-19 02:39:45,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-19 02:39:46,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=363630.0, ans=0.125 2023-06-19 02:39:48,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=363630.0, ans=0.025 2023-06-19 02:40:44,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=363810.0, ans=0.125 2023-06-19 02:41:02,498 INFO [train.py:996] (1/4) Epoch 2, batch 30150, loss[loss=0.3776, simple_loss=0.411, pruned_loss=0.1721, over 21604.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3593, pruned_loss=0.1202, over 4270426.63 frames. ], batch size: 415, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:41:05,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-19 02:41:17,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=363870.0, ans=0.125 2023-06-19 02:41:32,554 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.181e+02 3.774e+02 4.610e+02 8.129e+02, threshold=7.548e+02, percent-clipped=2.0 2023-06-19 02:42:14,161 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:42:25,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-06-19 02:42:47,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=364110.0, ans=0.015 2023-06-19 02:42:56,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-19 02:42:56,627 INFO [train.py:996] (1/4) Epoch 2, batch 30200, loss[loss=0.3099, simple_loss=0.3933, pruned_loss=0.1133, over 21314.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3604, pruned_loss=0.1177, over 4266815.28 frames. ], batch size: 549, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:43:00,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=364170.0, ans=0.125 2023-06-19 02:43:08,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=364170.0, ans=0.0 2023-06-19 02:43:14,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=22.5 2023-06-19 02:43:28,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=364290.0, ans=0.0 2023-06-19 02:43:47,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=364290.0, ans=0.1 2023-06-19 02:44:11,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=364350.0, ans=0.05 2023-06-19 02:44:39,225 INFO [train.py:996] (1/4) Epoch 2, batch 30250, loss[loss=0.4128, simple_loss=0.4833, pruned_loss=0.1711, over 21599.00 frames. ], tot_loss[loss=0.3062, simple_loss=0.3693, pruned_loss=0.1216, over 4263727.95 frames. ], batch size: 414, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:44:41,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.15 vs. limit=10.0 2023-06-19 02:44:58,837 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 3.107e+02 3.710e+02 5.079e+02 9.516e+02, threshold=7.420e+02, percent-clipped=5.0 2023-06-19 02:45:05,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=364530.0, ans=0.125 2023-06-19 02:45:10,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=364530.0, ans=0.125 2023-06-19 02:45:17,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=364590.0, ans=0.125 2023-06-19 02:45:48,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=364650.0, ans=0.125 2023-06-19 02:46:19,606 INFO [train.py:996] (1/4) Epoch 2, batch 30300, loss[loss=0.2419, simple_loss=0.2952, pruned_loss=0.09433, over 21499.00 frames. ], tot_loss[loss=0.304, simple_loss=0.366, pruned_loss=0.121, over 4260790.61 frames. ], batch size: 230, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:46:37,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.95 vs. limit=22.5 2023-06-19 02:46:52,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=364830.0, ans=0.0 2023-06-19 02:47:21,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=364890.0, ans=0.0 2023-06-19 02:47:23,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=364890.0, ans=0.125 2023-06-19 02:47:24,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=364950.0, ans=0.125 2023-06-19 02:47:41,178 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:48:03,253 INFO [train.py:996] (1/4) Epoch 2, batch 30350, loss[loss=0.4407, simple_loss=0.4859, pruned_loss=0.1977, over 21453.00 frames. ], tot_loss[loss=0.3059, simple_loss=0.366, pruned_loss=0.1228, over 4251641.77 frames. ], batch size: 471, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:48:21,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=365130.0, ans=0.1 2023-06-19 02:48:25,995 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.339e+02 3.934e+02 4.976e+02 9.196e+02, threshold=7.868e+02, percent-clipped=1.0 2023-06-19 02:48:49,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=365190.0, ans=0.035 2023-06-19 02:48:57,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-19 02:49:25,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-19 02:49:30,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-19 02:49:31,182 INFO [train.py:996] (1/4) Epoch 2, batch 30400, loss[loss=0.3158, simple_loss=0.3365, pruned_loss=0.1475, over 20308.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3604, pruned_loss=0.1205, over 4234190.59 frames. ], batch size: 703, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:50:31,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=365550.0, ans=10.0 2023-06-19 02:50:31,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-19 02:50:43,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=365610.0, ans=0.125 2023-06-19 02:50:50,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=365610.0, ans=0.95 2023-06-19 02:50:53,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=365610.0, ans=0.0 2023-06-19 02:50:56,367 INFO [train.py:996] (1/4) Epoch 2, batch 30450, loss[loss=0.3689, simple_loss=0.4781, pruned_loss=0.1299, over 19791.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3634, pruned_loss=0.1214, over 4180769.13 frames. ], batch size: 702, lr: 1.43e-02, grad_scale: 32.0 2023-06-19 02:50:56,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=365670.0, ans=0.09899494936611666 2023-06-19 02:51:01,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=365670.0, ans=0.1 2023-06-19 02:51:15,873 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.508e+02 4.343e+02 5.750e+02 8.532e+02 2.294e+03, threshold=1.150e+03, percent-clipped=29.0 2023-06-19 02:51:33,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=365790.0, ans=0.025 2023-06-19 02:51:56,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=365850.0, ans=0.125 2023-06-19 02:53:41,474 INFO [train.py:996] (1/4) Epoch 3, batch 0, loss[loss=0.3149, simple_loss=0.3482, pruned_loss=0.1408, over 21779.00 frames. ], tot_loss[loss=0.3149, simple_loss=0.3482, pruned_loss=0.1408, over 21779.00 frames. ], batch size: 102, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:53:41,474 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 02:53:57,713 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.2735, simple_loss=0.3782, pruned_loss=0.08435, over 1796401.00 frames. 2023-06-19 02:53:57,714 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 02:54:24,352 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:54:29,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-19 02:54:33,365 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:55:11,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=366174.0, ans=0.2 2023-06-19 02:55:18,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-19 02:55:36,568 INFO [train.py:996] (1/4) Epoch 3, batch 50, loss[loss=0.2996, simple_loss=0.3468, pruned_loss=0.1262, over 21643.00 frames. ], tot_loss[loss=0.3139, simple_loss=0.3745, pruned_loss=0.1267, over 965523.06 frames. ], batch size: 112, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:55:37,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=366234.0, ans=0.1 2023-06-19 02:56:10,881 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 3.611e+02 4.559e+02 6.599e+02 1.492e+03, threshold=9.117e+02, percent-clipped=9.0 2023-06-19 02:56:56,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-19 02:56:56,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=366474.0, ans=0.125 2023-06-19 02:57:09,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=366474.0, ans=0.0 2023-06-19 02:57:15,626 INFO [train.py:996] (1/4) Epoch 3, batch 100, loss[loss=0.4141, simple_loss=0.4529, pruned_loss=0.1877, over 21796.00 frames. ], tot_loss[loss=0.3186, simple_loss=0.3858, pruned_loss=0.1258, over 1702722.68 frames. ], batch size: 118, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:57:51,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=366654.0, ans=0.125 2023-06-19 02:58:06,682 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:58:26,575 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:58:30,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.69 vs. limit=15.0 2023-06-19 02:58:51,812 INFO [train.py:996] (1/4) Epoch 3, batch 150, loss[loss=0.2771, simple_loss=0.3486, pruned_loss=0.1028, over 21567.00 frames. ], tot_loss[loss=0.3149, simple_loss=0.384, pruned_loss=0.1229, over 2281529.38 frames. ], batch size: 230, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:58:52,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=366834.0, ans=0.1 2023-06-19 02:58:55,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=366834.0, ans=0.0 2023-06-19 02:59:25,656 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.063e+02 3.532e+02 4.732e+02 9.517e+02, threshold=7.065e+02, percent-clipped=1.0 2023-06-19 03:00:05,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=367074.0, ans=0.0 2023-06-19 03:00:08,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=367074.0, ans=0.125 2023-06-19 03:00:30,801 INFO [train.py:996] (1/4) Epoch 3, batch 200, loss[loss=0.2816, simple_loss=0.3525, pruned_loss=0.1054, over 21737.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3781, pruned_loss=0.1212, over 2718686.80 frames. ], batch size: 298, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 03:00:51,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=367194.0, ans=0.125 2023-06-19 03:00:57,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=367194.0, ans=0.125 2023-06-19 03:01:06,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-19 03:01:56,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=367374.0, ans=0.1 2023-06-19 03:01:59,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=367374.0, ans=0.125 2023-06-19 03:02:09,300 INFO [train.py:996] (1/4) Epoch 3, batch 250, loss[loss=0.3612, simple_loss=0.3985, pruned_loss=0.162, over 21799.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3741, pruned_loss=0.1212, over 3061490.21 frames. ], batch size: 441, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 03:02:14,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-19 03:02:42,267 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.832e+02 3.615e+02 5.126e+02 8.493e+02, threshold=7.230e+02, percent-clipped=8.0 2023-06-19 03:02:51,407 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-19 03:03:49,699 INFO [train.py:996] (1/4) Epoch 3, batch 300, loss[loss=0.2957, simple_loss=0.3816, pruned_loss=0.1049, over 21792.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3691, pruned_loss=0.1211, over 3339430.78 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:03:51,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=367734.0, ans=0.125 2023-06-19 03:04:18,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-19 03:05:30,745 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-19 03:05:31,305 INFO [train.py:996] (1/4) Epoch 3, batch 350, loss[loss=0.2754, simple_loss=0.3258, pruned_loss=0.1126, over 21347.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.361, pruned_loss=0.1184, over 3546716.44 frames. ], batch size: 131, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:06:01,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=368094.0, ans=0.0 2023-06-19 03:06:02,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=368094.0, ans=0.0 2023-06-19 03:06:05,955 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.954e+02 3.445e+02 4.197e+02 6.448e+02, threshold=6.891e+02, percent-clipped=0.0 2023-06-19 03:06:13,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=368154.0, ans=0.125 2023-06-19 03:06:43,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=368214.0, ans=0.0 2023-06-19 03:06:51,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=368274.0, ans=0.2 2023-06-19 03:07:01,959 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-19 03:07:12,356 INFO [train.py:996] (1/4) Epoch 3, batch 400, loss[loss=0.3006, simple_loss=0.3303, pruned_loss=0.1355, over 21320.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3493, pruned_loss=0.115, over 3713168.03 frames. ], batch size: 473, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:07:20,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=368334.0, ans=0.125 2023-06-19 03:07:25,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368334.0, ans=0.1 2023-06-19 03:07:38,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=368394.0, ans=0.125 2023-06-19 03:07:45,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=368394.0, ans=0.125 2023-06-19 03:07:45,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=368394.0, ans=0.1 2023-06-19 03:08:22,518 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:08:53,079 INFO [train.py:996] (1/4) Epoch 3, batch 450, loss[loss=0.2842, simple_loss=0.3788, pruned_loss=0.09482, over 20896.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3443, pruned_loss=0.1121, over 3833906.64 frames. ], batch size: 608, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:09:03,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=368634.0, ans=0.2 2023-06-19 03:09:03,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=368634.0, ans=0.5 2023-06-19 03:09:27,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.891e+02 3.614e+02 4.402e+02 7.378e+02, threshold=7.228e+02, percent-clipped=3.0 2023-06-19 03:09:58,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=368814.0, ans=0.2 2023-06-19 03:10:28,812 INFO [train.py:996] (1/4) Epoch 3, batch 500, loss[loss=0.1965, simple_loss=0.2696, pruned_loss=0.06174, over 21482.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3476, pruned_loss=0.1098, over 3938566.83 frames. ], batch size: 195, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:10:44,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=368934.0, ans=0.125 2023-06-19 03:10:55,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=368994.0, ans=0.125 2023-06-19 03:11:17,506 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:12:08,123 INFO [train.py:996] (1/4) Epoch 3, batch 550, loss[loss=0.438, simple_loss=0.499, pruned_loss=0.1885, over 21453.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3527, pruned_loss=0.11, over 3999795.07 frames. ], batch size: 507, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:12:46,959 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.117e+02 3.637e+02 4.984e+02 1.103e+03, threshold=7.274e+02, percent-clipped=1.0 2023-06-19 03:12:52,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=369354.0, ans=0.125 2023-06-19 03:13:09,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=369414.0, ans=0.0 2023-06-19 03:13:47,825 INFO [train.py:996] (1/4) Epoch 3, batch 600, loss[loss=0.2439, simple_loss=0.3066, pruned_loss=0.09059, over 21730.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3537, pruned_loss=0.1117, over 4060764.47 frames. ], batch size: 124, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:14:01,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=369534.0, ans=0.2 2023-06-19 03:14:11,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=369594.0, ans=0.2 2023-06-19 03:14:15,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=369594.0, ans=0.125 2023-06-19 03:15:18,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.02 vs. limit=22.5 2023-06-19 03:15:28,817 INFO [train.py:996] (1/4) Epoch 3, batch 650, loss[loss=0.3713, simple_loss=0.4415, pruned_loss=0.1506, over 21731.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3528, pruned_loss=0.1106, over 4099367.52 frames. ], batch size: 414, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:15:44,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=369834.0, ans=0.125 2023-06-19 03:16:02,412 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.260e+02 4.172e+02 5.495e+02 8.347e+02, threshold=8.343e+02, percent-clipped=4.0 2023-06-19 03:16:25,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=370014.0, ans=0.0 2023-06-19 03:16:33,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=370014.0, ans=0.0 2023-06-19 03:17:03,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=370074.0, ans=0.125 2023-06-19 03:17:09,749 INFO [train.py:996] (1/4) Epoch 3, batch 700, loss[loss=0.2487, simple_loss=0.311, pruned_loss=0.09314, over 21883.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.353, pruned_loss=0.1129, over 4147356.98 frames. ], batch size: 124, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:17:10,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-19 03:17:28,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=370194.0, ans=0.2 2023-06-19 03:17:31,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=370194.0, ans=0.0 2023-06-19 03:17:56,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=370254.0, ans=0.125 2023-06-19 03:18:04,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=370254.0, ans=0.2 2023-06-19 03:18:08,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-19 03:18:33,615 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:18:43,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=370374.0, ans=0.2 2023-06-19 03:18:45,025 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:18:49,213 INFO [train.py:996] (1/4) Epoch 3, batch 750, loss[loss=0.269, simple_loss=0.3198, pruned_loss=0.1091, over 21602.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3543, pruned_loss=0.1133, over 4178947.84 frames. ], batch size: 391, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:19:16,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=370494.0, ans=0.2 2023-06-19 03:19:28,420 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.022e+02 3.507e+02 4.070e+02 7.167e+02, threshold=7.014e+02, percent-clipped=0.0 2023-06-19 03:19:37,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=370554.0, ans=0.2 2023-06-19 03:19:55,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=370614.0, ans=0.0 2023-06-19 03:19:58,809 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-19 03:20:15,882 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:20:23,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370674.0, ans=0.1 2023-06-19 03:20:31,100 INFO [train.py:996] (1/4) Epoch 3, batch 800, loss[loss=0.2784, simple_loss=0.3244, pruned_loss=0.1162, over 21729.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3515, pruned_loss=0.1137, over 4195472.09 frames. ], batch size: 124, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:20:49,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=370734.0, ans=0.125 2023-06-19 03:21:25,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370854.0, ans=0.1 2023-06-19 03:22:06,201 INFO [train.py:996] (1/4) Epoch 3, batch 850, loss[loss=0.3293, simple_loss=0.3719, pruned_loss=0.1433, over 21624.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3474, pruned_loss=0.1134, over 4209146.03 frames. ], batch size: 471, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:22:20,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=371034.0, ans=0.125 2023-06-19 03:22:46,294 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.110e+02 3.682e+02 5.059e+02 8.553e+02, threshold=7.364e+02, percent-clipped=4.0 2023-06-19 03:23:14,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=371214.0, ans=0.1 2023-06-19 03:23:23,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=371214.0, ans=0.125 2023-06-19 03:23:43,031 INFO [train.py:996] (1/4) Epoch 3, batch 900, loss[loss=0.2677, simple_loss=0.3393, pruned_loss=0.09805, over 21505.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3436, pruned_loss=0.1129, over 4226087.01 frames. ], batch size: 195, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:23:50,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=371334.0, ans=0.125 2023-06-19 03:24:41,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=371514.0, ans=0.125 2023-06-19 03:25:01,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-19 03:25:02,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=371574.0, ans=0.125 2023-06-19 03:25:23,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=371634.0, ans=0.125 2023-06-19 03:25:24,166 INFO [train.py:996] (1/4) Epoch 3, batch 950, loss[loss=0.2835, simple_loss=0.331, pruned_loss=0.118, over 21326.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3421, pruned_loss=0.1122, over 4246128.58 frames. ], batch size: 143, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:25:31,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=371634.0, ans=0.2 2023-06-19 03:25:51,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=371694.0, ans=0.09899494936611666 2023-06-19 03:25:59,359 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.854e+02 3.566e+02 4.630e+02 9.213e+02, threshold=7.133e+02, percent-clipped=4.0 2023-06-19 03:26:04,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=371754.0, ans=0.2 2023-06-19 03:26:34,436 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=22.5 2023-06-19 03:27:03,896 INFO [train.py:996] (1/4) Epoch 3, batch 1000, loss[loss=0.3326, simple_loss=0.3835, pruned_loss=0.1408, over 21827.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3451, pruned_loss=0.1125, over 4258820.47 frames. ], batch size: 282, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:27:26,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=371994.0, ans=0.015 2023-06-19 03:27:32,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=371994.0, ans=0.125 2023-06-19 03:27:35,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=371994.0, ans=0.07 2023-06-19 03:27:37,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=371994.0, ans=0.2 2023-06-19 03:27:52,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=372054.0, ans=15.0 2023-06-19 03:28:04,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=372114.0, ans=0.2 2023-06-19 03:28:32,678 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-19 03:28:41,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=372174.0, ans=0.125 2023-06-19 03:28:48,330 INFO [train.py:996] (1/4) Epoch 3, batch 1050, loss[loss=0.2881, simple_loss=0.3491, pruned_loss=0.1136, over 21672.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3435, pruned_loss=0.1118, over 4262147.82 frames. ], batch size: 230, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:28:54,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=372234.0, ans=0.125 2023-06-19 03:29:13,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=372294.0, ans=0.125 2023-06-19 03:29:15,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=12.0 2023-06-19 03:29:24,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.077e+02 3.761e+02 4.435e+02 8.515e+02, threshold=7.523e+02, percent-clipped=2.0 2023-06-19 03:29:28,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=372354.0, ans=0.125 2023-06-19 03:29:54,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=372414.0, ans=0.125 2023-06-19 03:30:31,843 INFO [train.py:996] (1/4) Epoch 3, batch 1100, loss[loss=0.2232, simple_loss=0.2592, pruned_loss=0.09359, over 19956.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3435, pruned_loss=0.1113, over 4267080.18 frames. ], batch size: 703, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:30:49,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=372534.0, ans=0.2 2023-06-19 03:30:54,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=372594.0, ans=0.125 2023-06-19 03:31:38,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=372714.0, ans=0.0 2023-06-19 03:31:58,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=372774.0, ans=0.04949747468305833 2023-06-19 03:32:11,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=372774.0, ans=0.2 2023-06-19 03:32:17,140 INFO [train.py:996] (1/4) Epoch 3, batch 1150, loss[loss=0.315, simple_loss=0.3832, pruned_loss=0.1234, over 21760.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3444, pruned_loss=0.1115, over 4274085.78 frames. ], batch size: 351, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:32:26,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-19 03:32:32,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=372834.0, ans=0.2 2023-06-19 03:33:03,638 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.941e+02 3.564e+02 4.361e+02 9.852e+02, threshold=7.128e+02, percent-clipped=2.0 2023-06-19 03:33:27,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=373014.0, ans=0.125 2023-06-19 03:34:05,597 INFO [train.py:996] (1/4) Epoch 3, batch 1200, loss[loss=0.2911, simple_loss=0.3559, pruned_loss=0.1132, over 21296.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.345, pruned_loss=0.1113, over 4276626.34 frames. ], batch size: 159, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:34:12,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=373134.0, ans=0.125 2023-06-19 03:34:59,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=373254.0, ans=0.2 2023-06-19 03:35:27,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=373374.0, ans=0.125 2023-06-19 03:35:49,147 INFO [train.py:996] (1/4) Epoch 3, batch 1250, loss[loss=0.3147, simple_loss=0.3859, pruned_loss=0.1218, over 21626.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3492, pruned_loss=0.1127, over 4276982.81 frames. ], batch size: 414, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:36:07,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=373494.0, ans=0.0 2023-06-19 03:36:30,654 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 3.067e+02 3.657e+02 4.609e+02 8.051e+02, threshold=7.314e+02, percent-clipped=2.0 2023-06-19 03:37:03,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=373614.0, ans=0.0 2023-06-19 03:37:12,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=373674.0, ans=0.125 2023-06-19 03:37:14,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=373674.0, ans=0.125 2023-06-19 03:37:15,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=373674.0, ans=0.0 2023-06-19 03:37:29,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-19 03:37:33,704 INFO [train.py:996] (1/4) Epoch 3, batch 1300, loss[loss=0.2599, simple_loss=0.3285, pruned_loss=0.09568, over 21287.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3534, pruned_loss=0.1153, over 4284117.63 frames. ], batch size: 176, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:37:43,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-19 03:38:34,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=373854.0, ans=10.0 2023-06-19 03:39:09,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=373974.0, ans=0.125 2023-06-19 03:39:17,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-19 03:39:18,826 INFO [train.py:996] (1/4) Epoch 3, batch 1350, loss[loss=0.2862, simple_loss=0.3377, pruned_loss=0.1173, over 21311.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3529, pruned_loss=0.1157, over 4279150.31 frames. ], batch size: 143, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:39:19,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.47 vs. limit=6.0 2023-06-19 03:39:26,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-19 03:39:28,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=22.5 2023-06-19 03:40:01,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.516e+02 4.679e+02 5.899e+02 9.616e+02, threshold=9.359e+02, percent-clipped=8.0 2023-06-19 03:40:16,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=374154.0, ans=0.025 2023-06-19 03:40:26,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=374214.0, ans=0.0 2023-06-19 03:40:59,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=374274.0, ans=0.2 2023-06-19 03:41:03,052 INFO [train.py:996] (1/4) Epoch 3, batch 1400, loss[loss=0.2643, simple_loss=0.3195, pruned_loss=0.1045, over 21981.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3509, pruned_loss=0.1141, over 4276199.10 frames. ], batch size: 103, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:41:11,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=374334.0, ans=0.2 2023-06-19 03:41:30,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=374394.0, ans=0.0 2023-06-19 03:41:33,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=374394.0, ans=0.125 2023-06-19 03:42:34,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=374574.0, ans=0.0 2023-06-19 03:42:47,124 INFO [train.py:996] (1/4) Epoch 3, batch 1450, loss[loss=0.3276, simple_loss=0.3953, pruned_loss=0.1299, over 21632.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3507, pruned_loss=0.1152, over 4286928.77 frames. ], batch size: 471, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:43:28,904 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.105e+02 3.604e+02 4.454e+02 7.120e+02, threshold=7.209e+02, percent-clipped=0.0 2023-06-19 03:43:53,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=374814.0, ans=0.05 2023-06-19 03:44:17,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=374874.0, ans=0.125 2023-06-19 03:44:27,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=374874.0, ans=0.0 2023-06-19 03:44:32,075 INFO [train.py:996] (1/4) Epoch 3, batch 1500, loss[loss=0.2686, simple_loss=0.3188, pruned_loss=0.1092, over 21753.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3514, pruned_loss=0.116, over 4284298.78 frames. ], batch size: 334, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:45:14,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=374994.0, ans=0.2 2023-06-19 03:45:21,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=375054.0, ans=0.125 2023-06-19 03:45:25,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-19 03:45:57,477 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-19 03:45:58,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-19 03:46:04,169 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=22.5 2023-06-19 03:46:17,210 INFO [train.py:996] (1/4) Epoch 3, batch 1550, loss[loss=0.2263, simple_loss=0.2885, pruned_loss=0.0821, over 21233.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3492, pruned_loss=0.1148, over 4282801.65 frames. ], batch size: 159, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:46:39,200 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2023-06-19 03:46:42,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=375234.0, ans=0.0 2023-06-19 03:47:05,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.746e+02 3.313e+02 3.948e+02 6.762e+02, threshold=6.626e+02, percent-clipped=0.0 2023-06-19 03:47:36,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=375414.0, ans=0.0 2023-06-19 03:47:46,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=375474.0, ans=0.0 2023-06-19 03:47:55,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=375474.0, ans=0.0 2023-06-19 03:48:06,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-19 03:48:14,551 INFO [train.py:996] (1/4) Epoch 3, batch 1600, loss[loss=0.2362, simple_loss=0.2805, pruned_loss=0.09594, over 21343.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3481, pruned_loss=0.114, over 4290666.97 frames. ], batch size: 131, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:48:25,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=375534.0, ans=0.125 2023-06-19 03:48:49,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=375594.0, ans=0.125 2023-06-19 03:49:47,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-06-19 03:50:00,853 INFO [train.py:996] (1/4) Epoch 3, batch 1650, loss[loss=0.3217, simple_loss=0.3987, pruned_loss=0.1223, over 21389.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3503, pruned_loss=0.1147, over 4283819.39 frames. ], batch size: 548, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:50:38,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.775e+02 3.357e+02 4.211e+02 7.088e+02, threshold=6.714e+02, percent-clipped=2.0 2023-06-19 03:51:10,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=8.0 2023-06-19 03:51:46,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=376074.0, ans=0.025 2023-06-19 03:51:49,328 INFO [train.py:996] (1/4) Epoch 3, batch 1700, loss[loss=0.1771, simple_loss=0.2184, pruned_loss=0.06786, over 17777.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3538, pruned_loss=0.1153, over 4273526.22 frames. ], batch size: 64, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:51:57,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-19 03:52:36,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-19 03:53:22,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=376374.0, ans=0.125 2023-06-19 03:53:40,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.33 vs. limit=22.5 2023-06-19 03:53:43,160 INFO [train.py:996] (1/4) Epoch 3, batch 1750, loss[loss=0.234, simple_loss=0.3057, pruned_loss=0.08115, over 21615.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3534, pruned_loss=0.1129, over 4274340.99 frames. ], batch size: 263, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:54:04,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=376494.0, ans=0.07 2023-06-19 03:54:23,383 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 3.144e+02 4.448e+02 5.330e+02 9.147e+02, threshold=8.897e+02, percent-clipped=12.0 2023-06-19 03:55:02,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=376614.0, ans=0.125 2023-06-19 03:55:13,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=376674.0, ans=0.125 2023-06-19 03:55:31,678 INFO [train.py:996] (1/4) Epoch 3, batch 1800, loss[loss=0.328, simple_loss=0.398, pruned_loss=0.129, over 21461.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3502, pruned_loss=0.1098, over 4275592.30 frames. ], batch size: 508, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:56:17,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=376854.0, ans=0.2 2023-06-19 03:56:25,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-19 03:57:00,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=376974.0, ans=0.125 2023-06-19 03:57:11,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-19 03:57:11,725 INFO [train.py:996] (1/4) Epoch 3, batch 1850, loss[loss=0.2541, simple_loss=0.3203, pruned_loss=0.09394, over 21397.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3491, pruned_loss=0.1078, over 4273542.88 frames. ], batch size: 194, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:57:22,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=377034.0, ans=0.2 2023-06-19 03:57:38,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-19 03:58:00,687 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.940e+02 3.521e+02 4.849e+02 8.658e+02, threshold=7.043e+02, percent-clipped=0.0 2023-06-19 03:58:08,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-19 03:59:02,445 INFO [train.py:996] (1/4) Epoch 3, batch 1900, loss[loss=0.1854, simple_loss=0.2657, pruned_loss=0.05253, over 21402.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3489, pruned_loss=0.1081, over 4273933.91 frames. ], batch size: 211, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:59:31,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=377394.0, ans=0.1 2023-06-19 03:59:32,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-19 04:00:03,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=377454.0, ans=0.125 2023-06-19 04:00:30,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=377574.0, ans=0.125 2023-06-19 04:00:46,506 INFO [train.py:996] (1/4) Epoch 3, batch 1950, loss[loss=0.2177, simple_loss=0.3072, pruned_loss=0.06416, over 21619.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3452, pruned_loss=0.1082, over 4270583.03 frames. ], batch size: 263, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 04:01:30,639 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 3.084e+02 3.765e+02 4.629e+02 7.601e+02, threshold=7.530e+02, percent-clipped=2.0 2023-06-19 04:02:29,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=377874.0, ans=0.125 2023-06-19 04:02:32,666 INFO [train.py:996] (1/4) Epoch 3, batch 2000, loss[loss=0.3143, simple_loss=0.4033, pruned_loss=0.1127, over 21747.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3419, pruned_loss=0.1069, over 4266478.89 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:02:54,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=377994.0, ans=0.125 2023-06-19 04:03:01,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=377994.0, ans=0.1 2023-06-19 04:03:29,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378054.0, ans=0.1 2023-06-19 04:03:41,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=378114.0, ans=0.125 2023-06-19 04:04:11,026 INFO [train.py:996] (1/4) Epoch 3, batch 2050, loss[loss=0.3206, simple_loss=0.3678, pruned_loss=0.1367, over 21623.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3419, pruned_loss=0.1068, over 4265934.29 frames. ], batch size: 471, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:04:24,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=378234.0, ans=0.125 2023-06-19 04:04:24,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=378234.0, ans=0.125 2023-06-19 04:04:52,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=378354.0, ans=0.1 2023-06-19 04:04:54,238 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.993e+02 3.653e+02 4.561e+02 8.702e+02, threshold=7.306e+02, percent-clipped=1.0 2023-06-19 04:05:11,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378354.0, ans=0.1 2023-06-19 04:05:29,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=378414.0, ans=0.95 2023-06-19 04:05:54,036 INFO [train.py:996] (1/4) Epoch 3, batch 2100, loss[loss=0.3016, simple_loss=0.3564, pruned_loss=0.1234, over 21749.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3459, pruned_loss=0.1093, over 4274342.90 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:06:50,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=378654.0, ans=0.125 2023-06-19 04:07:35,119 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:07:39,929 INFO [train.py:996] (1/4) Epoch 3, batch 2150, loss[loss=0.2665, simple_loss=0.3201, pruned_loss=0.1065, over 21366.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3443, pruned_loss=0.1092, over 4266898.40 frames. ], batch size: 176, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:07:40,292 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:08:03,165 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:08:06,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378894.0, ans=0.1 2023-06-19 04:08:10,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=378894.0, ans=0.0 2023-06-19 04:08:25,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=378894.0, ans=0.04949747468305833 2023-06-19 04:08:30,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.207e+02 3.919e+02 5.012e+02 8.780e+02, threshold=7.837e+02, percent-clipped=4.0 2023-06-19 04:09:03,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=379074.0, ans=0.0 2023-06-19 04:09:13,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=379074.0, ans=0.0 2023-06-19 04:09:24,828 INFO [train.py:996] (1/4) Epoch 3, batch 2200, loss[loss=0.3541, simple_loss=0.4309, pruned_loss=0.1386, over 21586.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3478, pruned_loss=0.109, over 4271847.70 frames. ], batch size: 389, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:09:50,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=379134.0, ans=0.125 2023-06-19 04:10:13,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=379254.0, ans=0.125 2023-06-19 04:10:18,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=379254.0, ans=0.125 2023-06-19 04:10:38,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=379314.0, ans=0.125 2023-06-19 04:11:01,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=379374.0, ans=0.1 2023-06-19 04:11:09,100 INFO [train.py:996] (1/4) Epoch 3, batch 2250, loss[loss=0.2501, simple_loss=0.2996, pruned_loss=0.1003, over 21822.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.347, pruned_loss=0.1082, over 4268174.89 frames. ], batch size: 98, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:11:36,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=379494.0, ans=0.2 2023-06-19 04:11:56,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.971e+02 3.785e+02 4.786e+02 8.748e+02, threshold=7.570e+02, percent-clipped=4.0 2023-06-19 04:12:25,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=379614.0, ans=0.125 2023-06-19 04:12:33,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=379614.0, ans=0.2 2023-06-19 04:12:52,009 INFO [train.py:996] (1/4) Epoch 3, batch 2300, loss[loss=0.2577, simple_loss=0.3312, pruned_loss=0.09208, over 21469.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3407, pruned_loss=0.107, over 4268117.42 frames. ], batch size: 211, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:13:30,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=379794.0, ans=0.2 2023-06-19 04:13:35,239 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:14:42,066 INFO [train.py:996] (1/4) Epoch 3, batch 2350, loss[loss=0.2966, simple_loss=0.3535, pruned_loss=0.1198, over 21231.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3398, pruned_loss=0.1093, over 4257903.71 frames. ], batch size: 143, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 04:14:44,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.39 vs. limit=22.5 2023-06-19 04:15:26,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=380154.0, ans=0.125 2023-06-19 04:15:27,979 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.593e+02 3.227e+02 3.663e+02 5.028e+02 9.666e+02, threshold=7.327e+02, percent-clipped=5.0 2023-06-19 04:16:35,279 INFO [train.py:996] (1/4) Epoch 3, batch 2400, loss[loss=0.3057, simple_loss=0.373, pruned_loss=0.1192, over 21661.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3439, pruned_loss=0.1115, over 4253425.95 frames. ], batch size: 351, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:16:37,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=380334.0, ans=0.0 2023-06-19 04:17:18,224 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:17:23,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=380454.0, ans=0.0 2023-06-19 04:17:36,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-06-19 04:17:47,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=380514.0, ans=0.125 2023-06-19 04:18:21,202 INFO [train.py:996] (1/4) Epoch 3, batch 2450, loss[loss=0.2772, simple_loss=0.3357, pruned_loss=0.1093, over 21514.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.351, pruned_loss=0.1153, over 4259699.20 frames. ], batch size: 414, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:18:38,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=380634.0, ans=0.125 2023-06-19 04:18:39,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-19 04:18:41,762 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:18:51,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=380694.0, ans=0.125 2023-06-19 04:19:00,869 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 3.007e+02 3.779e+02 4.461e+02 8.893e+02, threshold=7.558e+02, percent-clipped=3.0 2023-06-19 04:19:12,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=380754.0, ans=0.1 2023-06-19 04:19:36,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=380874.0, ans=0.2 2023-06-19 04:20:04,650 INFO [train.py:996] (1/4) Epoch 3, batch 2500, loss[loss=0.2793, simple_loss=0.3521, pruned_loss=0.1032, over 21378.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3461, pruned_loss=0.1132, over 4269604.81 frames. ], batch size: 131, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:20:36,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=380994.0, ans=0.2 2023-06-19 04:20:38,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=380994.0, ans=0.0 2023-06-19 04:21:38,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=381174.0, ans=0.0 2023-06-19 04:21:39,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-19 04:21:50,559 INFO [train.py:996] (1/4) Epoch 3, batch 2550, loss[loss=0.3282, simple_loss=0.3721, pruned_loss=0.1421, over 21734.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3417, pruned_loss=0.1117, over 4267966.00 frames. ], batch size: 124, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:22:31,554 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.967e+02 3.517e+02 4.789e+02 7.584e+02, threshold=7.035e+02, percent-clipped=1.0 2023-06-19 04:23:36,601 INFO [train.py:996] (1/4) Epoch 3, batch 2600, loss[loss=0.2547, simple_loss=0.3683, pruned_loss=0.07059, over 19796.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3456, pruned_loss=0.1139, over 4273447.87 frames. ], batch size: 703, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:23:37,748 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-19 04:24:16,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=381654.0, ans=0.125 2023-06-19 04:24:30,286 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-19 04:24:45,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=381714.0, ans=0.0 2023-06-19 04:25:24,291 INFO [train.py:996] (1/4) Epoch 3, batch 2650, loss[loss=0.2672, simple_loss=0.3775, pruned_loss=0.07844, over 19772.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3485, pruned_loss=0.1154, over 4279213.08 frames. ], batch size: 704, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:26:05,629 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 3.151e+02 3.898e+02 4.845e+02 8.708e+02, threshold=7.796e+02, percent-clipped=4.0 2023-06-19 04:27:09,885 INFO [train.py:996] (1/4) Epoch 3, batch 2700, loss[loss=0.2293, simple_loss=0.3, pruned_loss=0.07935, over 21775.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3464, pruned_loss=0.1131, over 4282108.25 frames. ], batch size: 282, lr: 1.19e-02, grad_scale: 16.0 2023-06-19 04:27:10,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=382134.0, ans=0.07 2023-06-19 04:27:29,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=382134.0, ans=0.0 2023-06-19 04:28:17,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.17 vs. limit=10.0 2023-06-19 04:28:55,299 INFO [train.py:996] (1/4) Epoch 3, batch 2750, loss[loss=0.2939, simple_loss=0.351, pruned_loss=0.1184, over 21846.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3462, pruned_loss=0.113, over 4283010.92 frames. ], batch size: 391, lr: 1.19e-02, grad_scale: 16.0 2023-06-19 04:29:19,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=382494.0, ans=0.125 2023-06-19 04:29:26,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=382494.0, ans=0.125 2023-06-19 04:29:34,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=382554.0, ans=0.125 2023-06-19 04:29:37,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.554e+02 3.438e+02 4.314e+02 5.827e+02 1.229e+03, threshold=8.627e+02, percent-clipped=3.0 2023-06-19 04:30:07,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.93 vs. limit=12.0 2023-06-19 04:30:22,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=382674.0, ans=0.0 2023-06-19 04:30:22,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=382674.0, ans=0.0 2023-06-19 04:30:29,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=382674.0, ans=0.125 2023-06-19 04:30:45,967 INFO [train.py:996] (1/4) Epoch 3, batch 2800, loss[loss=0.2698, simple_loss=0.3317, pruned_loss=0.104, over 21215.00 frames. ], tot_loss[loss=0.2905, simple_loss=0.3505, pruned_loss=0.1153, over 4271902.47 frames. ], batch size: 176, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:31:30,993 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:32:12,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=382974.0, ans=0.95 2023-06-19 04:32:32,059 INFO [train.py:996] (1/4) Epoch 3, batch 2850, loss[loss=0.1901, simple_loss=0.2127, pruned_loss=0.08375, over 16761.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3492, pruned_loss=0.115, over 4271069.49 frames. ], batch size: 60, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:32:40,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383034.0, ans=0.1 2023-06-19 04:33:06,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=383094.0, ans=0.125 2023-06-19 04:33:19,148 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.317e+02 3.934e+02 4.710e+02 8.134e+02, threshold=7.867e+02, percent-clipped=0.0 2023-06-19 04:33:26,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=383154.0, ans=0.5 2023-06-19 04:33:50,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=383214.0, ans=0.1 2023-06-19 04:34:14,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=383274.0, ans=0.05 2023-06-19 04:34:17,096 INFO [train.py:996] (1/4) Epoch 3, batch 2900, loss[loss=0.2516, simple_loss=0.3095, pruned_loss=0.09683, over 21672.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3476, pruned_loss=0.1143, over 4271769.79 frames. ], batch size: 263, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:34:38,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=383394.0, ans=0.2 2023-06-19 04:36:02,554 INFO [train.py:996] (1/4) Epoch 3, batch 2950, loss[loss=0.352, simple_loss=0.3928, pruned_loss=0.1556, over 21483.00 frames. ], tot_loss[loss=0.289, simple_loss=0.349, pruned_loss=0.1145, over 4277059.43 frames. ], batch size: 548, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:36:13,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=383634.0, ans=0.125 2023-06-19 04:36:45,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=383754.0, ans=0.07 2023-06-19 04:36:50,305 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.976e+02 3.392e+02 4.326e+02 8.351e+02, threshold=6.785e+02, percent-clipped=1.0 2023-06-19 04:37:04,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=383754.0, ans=0.125 2023-06-19 04:37:20,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=383814.0, ans=0.0 2023-06-19 04:37:48,106 INFO [train.py:996] (1/4) Epoch 3, batch 3000, loss[loss=0.3397, simple_loss=0.3931, pruned_loss=0.1431, over 21594.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3526, pruned_loss=0.1149, over 4276809.49 frames. ], batch size: 230, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:37:48,106 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 04:38:05,902 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.2668, simple_loss=0.3633, pruned_loss=0.08521, over 1796401.00 frames. 2023-06-19 04:38:05,903 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 04:38:52,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=383994.0, ans=0.0 2023-06-19 04:39:18,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=384114.0, ans=0.125 2023-06-19 04:39:52,610 INFO [train.py:996] (1/4) Epoch 3, batch 3050, loss[loss=0.2684, simple_loss=0.3452, pruned_loss=0.09584, over 21620.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3528, pruned_loss=0.1123, over 4277686.81 frames. ], batch size: 263, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:40:34,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=384294.0, ans=0.95 2023-06-19 04:40:35,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=384294.0, ans=0.2 2023-06-19 04:40:38,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=384294.0, ans=0.0 2023-06-19 04:40:41,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384354.0, ans=0.1 2023-06-19 04:40:44,355 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 3.108e+02 3.737e+02 4.686e+02 8.351e+02, threshold=7.474e+02, percent-clipped=4.0 2023-06-19 04:40:47,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=384354.0, ans=12.0 2023-06-19 04:40:57,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=384354.0, ans=0.0 2023-06-19 04:41:41,699 INFO [train.py:996] (1/4) Epoch 3, batch 3100, loss[loss=0.2817, simple_loss=0.3539, pruned_loss=0.1047, over 21382.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3528, pruned_loss=0.1115, over 4276890.19 frames. ], batch size: 194, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:42:14,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=384594.0, ans=0.1 2023-06-19 04:42:15,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=384594.0, ans=0.1 2023-06-19 04:42:36,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=384654.0, ans=0.0 2023-06-19 04:42:45,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-19 04:43:28,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-19 04:43:31,994 INFO [train.py:996] (1/4) Epoch 3, batch 3150, loss[loss=0.2258, simple_loss=0.2973, pruned_loss=0.07713, over 20779.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3572, pruned_loss=0.1138, over 4277448.55 frames. ], batch size: 608, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:43:50,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-19 04:44:19,350 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.177e+02 3.933e+02 4.816e+02 8.908e+02, threshold=7.865e+02, percent-clipped=2.0 2023-06-19 04:44:50,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=385014.0, ans=0.125 2023-06-19 04:45:23,739 INFO [train.py:996] (1/4) Epoch 3, batch 3200, loss[loss=0.2827, simple_loss=0.3421, pruned_loss=0.1117, over 21445.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3556, pruned_loss=0.1121, over 4280303.56 frames. ], batch size: 211, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:46:45,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=385374.0, ans=0.125 2023-06-19 04:46:58,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=385374.0, ans=0.125 2023-06-19 04:47:08,032 INFO [train.py:996] (1/4) Epoch 3, batch 3250, loss[loss=0.2881, simple_loss=0.3315, pruned_loss=0.1224, over 21149.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3554, pruned_loss=0.1146, over 4275676.08 frames. ], batch size: 143, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:47:50,330 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.327e+02 4.160e+02 5.584e+02 8.725e+02, threshold=8.319e+02, percent-clipped=2.0 2023-06-19 04:48:08,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=385614.0, ans=0.2 2023-06-19 04:48:59,512 INFO [train.py:996] (1/4) Epoch 3, batch 3300, loss[loss=0.271, simple_loss=0.3321, pruned_loss=0.1049, over 21180.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3523, pruned_loss=0.1141, over 4275745.38 frames. ], batch size: 159, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:49:25,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=385794.0, ans=0.2 2023-06-19 04:50:03,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=385914.0, ans=0.0 2023-06-19 04:50:44,347 INFO [train.py:996] (1/4) Epoch 3, batch 3350, loss[loss=0.3251, simple_loss=0.3686, pruned_loss=0.1408, over 21307.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3555, pruned_loss=0.115, over 4282962.57 frames. ], batch size: 159, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:51:11,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-19 04:51:14,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-19 04:51:18,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=386154.0, ans=0.0 2023-06-19 04:51:20,250 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.290e+02 3.790e+02 4.247e+02 7.031e+02, threshold=7.579e+02, percent-clipped=0.0 2023-06-19 04:51:41,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=386154.0, ans=0.2 2023-06-19 04:52:27,716 INFO [train.py:996] (1/4) Epoch 3, batch 3400, loss[loss=0.2825, simple_loss=0.339, pruned_loss=0.113, over 21214.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3552, pruned_loss=0.1154, over 4285015.27 frames. ], batch size: 159, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:52:31,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=386334.0, ans=0.2 2023-06-19 04:52:44,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=386394.0, ans=0.125 2023-06-19 04:53:20,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-19 04:53:46,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-19 04:53:58,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=386574.0, ans=0.125 2023-06-19 04:54:13,031 INFO [train.py:996] (1/4) Epoch 3, batch 3450, loss[loss=0.2451, simple_loss=0.3023, pruned_loss=0.09394, over 21565.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3495, pruned_loss=0.1141, over 4287811.89 frames. ], batch size: 263, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:54:17,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=386634.0, ans=0.125 2023-06-19 04:54:24,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=386634.0, ans=0.025 2023-06-19 04:55:06,155 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.192e+02 3.930e+02 4.779e+02 8.558e+02, threshold=7.861e+02, percent-clipped=2.0 2023-06-19 04:55:13,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=386754.0, ans=0.125 2023-06-19 04:55:22,882 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:55:34,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=386814.0, ans=0.125 2023-06-19 04:55:57,916 INFO [train.py:996] (1/4) Epoch 3, batch 3500, loss[loss=0.4353, simple_loss=0.4917, pruned_loss=0.1895, over 21445.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3606, pruned_loss=0.1192, over 4291023.81 frames. ], batch size: 471, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:56:26,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-19 04:57:19,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-19 04:57:44,031 INFO [train.py:996] (1/4) Epoch 3, batch 3550, loss[loss=0.2851, simple_loss=0.3392, pruned_loss=0.1155, over 21503.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3614, pruned_loss=0.1205, over 4280868.51 frames. ], batch size: 441, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:57:53,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=387234.0, ans=0.125 2023-06-19 04:58:35,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-19 04:58:38,183 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.279e+02 3.936e+02 4.776e+02 8.299e+02, threshold=7.873e+02, percent-clipped=2.0 2023-06-19 04:59:31,869 INFO [train.py:996] (1/4) Epoch 3, batch 3600, loss[loss=0.3109, simple_loss=0.3617, pruned_loss=0.13, over 21457.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.358, pruned_loss=0.1203, over 4282657.29 frames. ], batch size: 194, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:00:05,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=387594.0, ans=0.1 2023-06-19 05:00:07,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=387594.0, ans=0.125 2023-06-19 05:00:15,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=387594.0, ans=0.0 2023-06-19 05:00:34,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=12.0 2023-06-19 05:01:16,062 INFO [train.py:996] (1/4) Epoch 3, batch 3650, loss[loss=0.3473, simple_loss=0.3952, pruned_loss=0.1497, over 21314.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3592, pruned_loss=0.1198, over 4281633.66 frames. ], batch size: 143, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:01:31,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=387834.0, ans=0.125 2023-06-19 05:01:49,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=387894.0, ans=0.125 2023-06-19 05:02:02,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=387894.0, ans=0.07 2023-06-19 05:02:08,764 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.313e+02 3.848e+02 4.708e+02 1.033e+03, threshold=7.696e+02, percent-clipped=4.0 2023-06-19 05:02:18,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=387954.0, ans=0.2 2023-06-19 05:02:59,720 INFO [train.py:996] (1/4) Epoch 3, batch 3700, loss[loss=0.3045, simple_loss=0.3606, pruned_loss=0.1243, over 21747.00 frames. ], tot_loss[loss=0.2994, simple_loss=0.3585, pruned_loss=0.1201, over 4288957.16 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:03:00,611 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=12.0 2023-06-19 05:04:11,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=388314.0, ans=0.1 2023-06-19 05:04:11,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=388314.0, ans=0.125 2023-06-19 05:04:21,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=388314.0, ans=0.125 2023-06-19 05:04:36,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=388374.0, ans=0.2 2023-06-19 05:04:56,326 INFO [train.py:996] (1/4) Epoch 3, batch 3750, loss[loss=0.2912, simple_loss=0.3324, pruned_loss=0.125, over 20168.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3556, pruned_loss=0.1184, over 4288569.14 frames. ], batch size: 703, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:05:39,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-06-19 05:05:43,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 3.137e+02 4.357e+02 5.330e+02 7.776e+02, threshold=8.713e+02, percent-clipped=1.0 2023-06-19 05:06:46,815 INFO [train.py:996] (1/4) Epoch 3, batch 3800, loss[loss=0.3085, simple_loss=0.3701, pruned_loss=0.1234, over 21754.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.3526, pruned_loss=0.1169, over 4280835.77 frames. ], batch size: 124, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:06:50,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=388734.0, ans=0.125 2023-06-19 05:07:19,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=388794.0, ans=0.125 2023-06-19 05:08:23,793 INFO [train.py:996] (1/4) Epoch 3, batch 3850, loss[loss=0.2317, simple_loss=0.2861, pruned_loss=0.08865, over 21827.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.3496, pruned_loss=0.1171, over 4271391.89 frames. ], batch size: 107, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:08:30,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389034.0, ans=0.1 2023-06-19 05:08:32,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=389034.0, ans=0.025 2023-06-19 05:08:35,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.44 vs. limit=15.0 2023-06-19 05:08:37,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389034.0, ans=0.1 2023-06-19 05:09:10,446 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 3.065e+02 3.544e+02 4.567e+02 7.617e+02, threshold=7.087e+02, percent-clipped=0.0 2023-06-19 05:10:03,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-19 05:10:07,256 INFO [train.py:996] (1/4) Epoch 3, batch 3900, loss[loss=0.2958, simple_loss=0.3733, pruned_loss=0.1091, over 20757.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3463, pruned_loss=0.1159, over 4280266.26 frames. ], batch size: 607, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:11:02,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=389454.0, ans=0.1 2023-06-19 05:11:06,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=389514.0, ans=0.125 2023-06-19 05:11:10,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=389514.0, ans=0.1 2023-06-19 05:11:30,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=389574.0, ans=0.0 2023-06-19 05:11:51,736 INFO [train.py:996] (1/4) Epoch 3, batch 3950, loss[loss=0.2423, simple_loss=0.3211, pruned_loss=0.08178, over 21661.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3443, pruned_loss=0.1131, over 4285521.62 frames. ], batch size: 414, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:11:53,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=389634.0, ans=0.125 2023-06-19 05:12:02,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=389634.0, ans=0.125 2023-06-19 05:12:38,256 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.040e+02 3.554e+02 4.206e+02 5.675e+02, threshold=7.109e+02, percent-clipped=0.0 2023-06-19 05:13:36,452 INFO [train.py:996] (1/4) Epoch 3, batch 4000, loss[loss=0.2195, simple_loss=0.2735, pruned_loss=0.08275, over 21590.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3373, pruned_loss=0.1095, over 4276579.42 frames. ], batch size: 231, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:14:53,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=390114.0, ans=0.125 2023-06-19 05:15:11,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=390174.0, ans=0.125 2023-06-19 05:15:12,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=390174.0, ans=0.2 2023-06-19 05:15:23,365 INFO [train.py:996] (1/4) Epoch 3, batch 4050, loss[loss=0.333, simple_loss=0.3786, pruned_loss=0.1438, over 21894.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3373, pruned_loss=0.1086, over 4279597.60 frames. ], batch size: 107, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:15:54,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=390294.0, ans=0.125 2023-06-19 05:15:58,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=390294.0, ans=0.0 2023-06-19 05:16:09,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=390354.0, ans=0.0 2023-06-19 05:16:10,710 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.094e+02 3.976e+02 4.759e+02 9.787e+02, threshold=7.952e+02, percent-clipped=5.0 2023-06-19 05:16:11,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=390354.0, ans=0.125 2023-06-19 05:16:55,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=390474.0, ans=0.0 2023-06-19 05:17:03,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=390474.0, ans=0.125 2023-06-19 05:17:13,588 INFO [train.py:996] (1/4) Epoch 3, batch 4100, loss[loss=0.3054, simple_loss=0.3598, pruned_loss=0.1256, over 21776.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3402, pruned_loss=0.1098, over 4281067.59 frames. ], batch size: 112, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:17:17,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=390534.0, ans=0.125 2023-06-19 05:17:37,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=390594.0, ans=0.0 2023-06-19 05:17:38,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-19 05:17:59,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=390654.0, ans=0.125 2023-06-19 05:18:54,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=390774.0, ans=10.0 2023-06-19 05:18:58,802 INFO [train.py:996] (1/4) Epoch 3, batch 4150, loss[loss=0.2206, simple_loss=0.3013, pruned_loss=0.06995, over 21279.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3421, pruned_loss=0.1077, over 4284440.02 frames. ], batch size: 176, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:19:16,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390834.0, ans=0.1 2023-06-19 05:19:35,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=390894.0, ans=0.125 2023-06-19 05:19:36,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=390954.0, ans=0.125 2023-06-19 05:19:41,648 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 3.003e+02 3.732e+02 5.110e+02 9.922e+02, threshold=7.464e+02, percent-clipped=2.0 2023-06-19 05:19:49,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=390954.0, ans=0.0 2023-06-19 05:20:46,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=391074.0, ans=0.2 2023-06-19 05:20:51,276 INFO [train.py:996] (1/4) Epoch 3, batch 4200, loss[loss=0.3158, simple_loss=0.3985, pruned_loss=0.1166, over 21871.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3431, pruned_loss=0.1084, over 4275592.46 frames. ], batch size: 372, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:21:07,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=391194.0, ans=0.125 2023-06-19 05:22:11,567 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-19 05:22:14,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=391314.0, ans=0.1 2023-06-19 05:22:40,214 INFO [train.py:996] (1/4) Epoch 3, batch 4250, loss[loss=0.3198, simple_loss=0.3815, pruned_loss=0.129, over 21962.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3478, pruned_loss=0.1099, over 4271828.64 frames. ], batch size: 317, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:22:49,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=391434.0, ans=0.0 2023-06-19 05:22:57,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=391494.0, ans=0.125 2023-06-19 05:23:30,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.309e+02 4.046e+02 4.889e+02 9.500e+02, threshold=8.092e+02, percent-clipped=4.0 2023-06-19 05:23:49,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=391614.0, ans=0.2 2023-06-19 05:24:11,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=391674.0, ans=0.125 2023-06-19 05:24:27,975 INFO [train.py:996] (1/4) Epoch 3, batch 4300, loss[loss=0.2462, simple_loss=0.3171, pruned_loss=0.08766, over 21421.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3514, pruned_loss=0.1103, over 4273081.65 frames. ], batch size: 194, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:24:36,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=391734.0, ans=0.125 2023-06-19 05:25:24,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=391854.0, ans=0.125 2023-06-19 05:26:01,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391974.0, ans=0.1 2023-06-19 05:26:13,014 INFO [train.py:996] (1/4) Epoch 3, batch 4350, loss[loss=0.3112, simple_loss=0.4332, pruned_loss=0.0946, over 19819.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3517, pruned_loss=0.1104, over 4275277.51 frames. ], batch size: 702, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:26:46,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=392094.0, ans=0.2 2023-06-19 05:27:08,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.136e+02 3.673e+02 4.293e+02 1.094e+03, threshold=7.346e+02, percent-clipped=4.0 2023-06-19 05:27:59,229 INFO [train.py:996] (1/4) Epoch 3, batch 4400, loss[loss=0.2278, simple_loss=0.3093, pruned_loss=0.07318, over 21355.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3474, pruned_loss=0.109, over 4274073.95 frames. ], batch size: 160, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:28:24,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=12.0 2023-06-19 05:29:16,865 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:29:39,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392574.0, ans=0.1 2023-06-19 05:29:49,755 INFO [train.py:996] (1/4) Epoch 3, batch 4450, loss[loss=0.3509, simple_loss=0.4159, pruned_loss=0.1429, over 21770.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3571, pruned_loss=0.1113, over 4271126.77 frames. ], batch size: 414, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:30:13,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=392694.0, ans=0.125 2023-06-19 05:30:40,182 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.016e+02 3.680e+02 4.427e+02 7.679e+02, threshold=7.360e+02, percent-clipped=2.0 2023-06-19 05:30:44,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-19 05:30:52,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=392814.0, ans=0.0 2023-06-19 05:31:23,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=392874.0, ans=0.0 2023-06-19 05:31:36,783 INFO [train.py:996] (1/4) Epoch 3, batch 4500, loss[loss=0.3168, simple_loss=0.3794, pruned_loss=0.1271, over 21859.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3584, pruned_loss=0.1136, over 4272472.25 frames. ], batch size: 371, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:32:02,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=392994.0, ans=0.015 2023-06-19 05:32:02,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-19 05:32:12,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=392994.0, ans=0.95 2023-06-19 05:32:35,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=393054.0, ans=0.0 2023-06-19 05:32:44,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=22.5 2023-06-19 05:33:01,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=22.5 2023-06-19 05:33:16,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=393174.0, ans=0.125 2023-06-19 05:33:19,148 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.09 vs. limit=6.0 2023-06-19 05:33:28,827 INFO [train.py:996] (1/4) Epoch 3, batch 4550, loss[loss=0.3189, simple_loss=0.3822, pruned_loss=0.1278, over 21327.00 frames. ], tot_loss[loss=0.2933, simple_loss=0.3599, pruned_loss=0.1133, over 4280127.83 frames. ], batch size: 176, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:33:53,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=393294.0, ans=0.125 2023-06-19 05:34:13,197 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.526e+02 4.619e+02 6.028e+02 1.155e+03, threshold=9.238e+02, percent-clipped=14.0 2023-06-19 05:34:22,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=393354.0, ans=0.0 2023-06-19 05:34:22,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-19 05:34:56,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=393474.0, ans=0.025 2023-06-19 05:35:14,599 INFO [train.py:996] (1/4) Epoch 3, batch 4600, loss[loss=0.2891, simple_loss=0.358, pruned_loss=0.1101, over 21693.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3624, pruned_loss=0.1152, over 4282404.16 frames. ], batch size: 414, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:35:58,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=393654.0, ans=0.125 2023-06-19 05:36:28,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=393714.0, ans=0.2 2023-06-19 05:37:01,736 INFO [train.py:996] (1/4) Epoch 3, batch 4650, loss[loss=0.2606, simple_loss=0.3245, pruned_loss=0.09835, over 21808.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3558, pruned_loss=0.1132, over 4283695.00 frames. ], batch size: 112, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:37:15,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=393834.0, ans=0.0 2023-06-19 05:37:20,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=393834.0, ans=0.0 2023-06-19 05:37:31,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=393894.0, ans=0.125 2023-06-19 05:37:44,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.775e+02 3.310e+02 3.723e+02 7.638e+02, threshold=6.620e+02, percent-clipped=0.0 2023-06-19 05:37:47,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.59 vs. limit=10.0 2023-06-19 05:37:54,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=393954.0, ans=0.125 2023-06-19 05:38:52,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=394134.0, ans=0.125 2023-06-19 05:38:53,950 INFO [train.py:996] (1/4) Epoch 3, batch 4700, loss[loss=0.2516, simple_loss=0.3108, pruned_loss=0.09618, over 21914.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3447, pruned_loss=0.1094, over 4281757.23 frames. ], batch size: 107, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:39:09,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=394194.0, ans=0.0 2023-06-19 05:39:40,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-19 05:40:08,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=394314.0, ans=0.0 2023-06-19 05:40:08,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.06 vs. limit=22.5 2023-06-19 05:40:32,932 INFO [train.py:996] (1/4) Epoch 3, batch 4750, loss[loss=0.3469, simple_loss=0.3866, pruned_loss=0.1536, over 21788.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3398, pruned_loss=0.1087, over 4266951.03 frames. ], batch size: 107, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:41:11,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=394554.0, ans=0.04949747468305833 2023-06-19 05:41:21,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 3.052e+02 3.874e+02 5.001e+02 1.083e+03, threshold=7.748e+02, percent-clipped=9.0 2023-06-19 05:41:34,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=394554.0, ans=0.125 2023-06-19 05:41:54,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=394674.0, ans=10.0 2023-06-19 05:42:25,159 INFO [train.py:996] (1/4) Epoch 3, batch 4800, loss[loss=0.296, simple_loss=0.3546, pruned_loss=0.1187, over 21567.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3405, pruned_loss=0.11, over 4273144.12 frames. ], batch size: 131, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:42:27,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=394734.0, ans=0.125 2023-06-19 05:42:45,917 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-19 05:43:40,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=394974.0, ans=0.125 2023-06-19 05:43:59,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=394974.0, ans=0.1 2023-06-19 05:44:10,683 INFO [train.py:996] (1/4) Epoch 3, batch 4850, loss[loss=0.2605, simple_loss=0.3426, pruned_loss=0.08923, over 19990.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3403, pruned_loss=0.1093, over 4260156.41 frames. ], batch size: 703, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:44:36,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=395094.0, ans=0.0 2023-06-19 05:44:38,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=395094.0, ans=0.125 2023-06-19 05:44:54,933 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.516e+02 4.493e+02 6.112e+02 1.101e+03, threshold=8.986e+02, percent-clipped=11.0 2023-06-19 05:45:02,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=395154.0, ans=0.1 2023-06-19 05:45:33,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=395274.0, ans=0.2 2023-06-19 05:45:55,754 INFO [train.py:996] (1/4) Epoch 3, batch 4900, loss[loss=0.2965, simple_loss=0.3649, pruned_loss=0.1141, over 21411.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3438, pruned_loss=0.1119, over 4265965.21 frames. ], batch size: 211, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:46:51,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-19 05:47:09,603 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:47:11,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=395514.0, ans=0.125 2023-06-19 05:47:30,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-19 05:47:38,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=395574.0, ans=0.125 2023-06-19 05:47:41,178 INFO [train.py:996] (1/4) Epoch 3, batch 4950, loss[loss=0.2216, simple_loss=0.3021, pruned_loss=0.07058, over 21275.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3461, pruned_loss=0.1105, over 4266804.87 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:48:30,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=12.0 2023-06-19 05:48:31,573 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.805e+02 3.354e+02 4.068e+02 9.306e+02, threshold=6.708e+02, percent-clipped=1.0 2023-06-19 05:48:35,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-19 05:48:52,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=395814.0, ans=0.125 2023-06-19 05:49:05,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=395874.0, ans=0.125 2023-06-19 05:49:27,290 INFO [train.py:996] (1/4) Epoch 3, batch 5000, loss[loss=0.2482, simple_loss=0.3244, pruned_loss=0.086, over 21489.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3446, pruned_loss=0.1064, over 4266493.98 frames. ], batch size: 212, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:49:34,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=395934.0, ans=0.0 2023-06-19 05:49:48,157 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:49:50,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=395994.0, ans=0.0 2023-06-19 05:50:09,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=396054.0, ans=0.0 2023-06-19 05:51:12,152 INFO [train.py:996] (1/4) Epoch 3, batch 5050, loss[loss=0.3032, simple_loss=0.3571, pruned_loss=0.1246, over 21903.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3455, pruned_loss=0.1078, over 4276753.19 frames. ], batch size: 107, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:51:29,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=396294.0, ans=0.125 2023-06-19 05:51:56,269 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.428e+02 4.062e+02 4.972e+02 8.550e+02, threshold=8.125e+02, percent-clipped=7.0 2023-06-19 05:52:16,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=396414.0, ans=0.05 2023-06-19 05:52:51,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=396534.0, ans=0.125 2023-06-19 05:52:51,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=396534.0, ans=0.1 2023-06-19 05:52:52,317 INFO [train.py:996] (1/4) Epoch 3, batch 5100, loss[loss=0.2408, simple_loss=0.3468, pruned_loss=0.06735, over 21173.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3444, pruned_loss=0.108, over 4283706.22 frames. ], batch size: 548, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:52:52,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=396534.0, ans=0.125 2023-06-19 05:54:20,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396774.0, ans=0.1 2023-06-19 05:54:37,003 INFO [train.py:996] (1/4) Epoch 3, batch 5150, loss[loss=0.2601, simple_loss=0.3193, pruned_loss=0.1004, over 21344.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3448, pruned_loss=0.1089, over 4283085.63 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:54:39,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=396834.0, ans=0.125 2023-06-19 05:54:55,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=396894.0, ans=0.125 2023-06-19 05:55:27,272 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.247e+02 3.957e+02 4.711e+02 9.896e+02, threshold=7.915e+02, percent-clipped=1.0 2023-06-19 05:55:36,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=396954.0, ans=10.0 2023-06-19 05:56:23,665 INFO [train.py:996] (1/4) Epoch 3, batch 5200, loss[loss=0.2934, simple_loss=0.3603, pruned_loss=0.1132, over 21848.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3443, pruned_loss=0.1085, over 4276189.33 frames. ], batch size: 371, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:56:44,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=397194.0, ans=0.125 2023-06-19 05:57:15,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=397254.0, ans=0.125 2023-06-19 05:57:37,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=397314.0, ans=0.125 2023-06-19 05:57:37,616 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.65 vs. limit=10.0 2023-06-19 05:57:54,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397374.0, ans=0.1 2023-06-19 05:58:09,820 INFO [train.py:996] (1/4) Epoch 3, batch 5250, loss[loss=0.2731, simple_loss=0.3574, pruned_loss=0.0944, over 21637.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3492, pruned_loss=0.1079, over 4269278.16 frames. ], batch size: 389, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:58:59,532 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.349e+02 3.883e+02 5.144e+02 8.715e+02, threshold=7.765e+02, percent-clipped=1.0 2023-06-19 05:59:45,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=397674.0, ans=0.2 2023-06-19 05:59:45,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=397674.0, ans=0.125 2023-06-19 05:59:53,200 INFO [train.py:996] (1/4) Epoch 3, batch 5300, loss[loss=0.3281, simple_loss=0.3716, pruned_loss=0.1423, over 21640.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3484, pruned_loss=0.1095, over 4279194.02 frames. ], batch size: 471, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:01:33,537 INFO [train.py:996] (1/4) Epoch 3, batch 5350, loss[loss=0.2683, simple_loss=0.3274, pruned_loss=0.1046, over 21264.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3476, pruned_loss=0.1119, over 4290765.45 frames. ], batch size: 143, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:01:39,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=398034.0, ans=0.125 2023-06-19 06:01:41,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=398034.0, ans=0.125 2023-06-19 06:01:47,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-19 06:01:57,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=398094.0, ans=0.125 2023-06-19 06:02:12,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=398094.0, ans=0.125 2023-06-19 06:02:23,863 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.150e+02 3.600e+02 4.539e+02 9.021e+02, threshold=7.200e+02, percent-clipped=2.0 2023-06-19 06:02:27,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=398154.0, ans=0.125 2023-06-19 06:02:44,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=398214.0, ans=0.2 2023-06-19 06:03:14,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-19 06:03:18,164 INFO [train.py:996] (1/4) Epoch 3, batch 5400, loss[loss=0.2477, simple_loss=0.3162, pruned_loss=0.08957, over 21817.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3468, pruned_loss=0.1128, over 4289125.11 frames. ], batch size: 282, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:04:12,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-19 06:04:29,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=398514.0, ans=0.125 2023-06-19 06:05:02,973 INFO [train.py:996] (1/4) Epoch 3, batch 5450, loss[loss=0.2878, simple_loss=0.4058, pruned_loss=0.08488, over 19747.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3476, pruned_loss=0.1091, over 4282786.58 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:05:03,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398634.0, ans=0.1 2023-06-19 06:05:08,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=398634.0, ans=0.2 2023-06-19 06:06:00,066 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.869e+02 3.395e+02 4.566e+02 8.866e+02, threshold=6.789e+02, percent-clipped=3.0 2023-06-19 06:06:20,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=398814.0, ans=0.125 2023-06-19 06:06:24,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.25 vs. limit=22.5 2023-06-19 06:06:42,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398874.0, ans=0.1 2023-06-19 06:06:50,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=398874.0, ans=0.04949747468305833 2023-06-19 06:07:02,459 INFO [train.py:996] (1/4) Epoch 3, batch 5500, loss[loss=0.2678, simple_loss=0.36, pruned_loss=0.08778, over 21654.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3495, pruned_loss=0.1037, over 4276107.24 frames. ], batch size: 389, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:07:45,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=399054.0, ans=0.035 2023-06-19 06:08:33,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=399174.0, ans=0.125 2023-06-19 06:08:46,531 INFO [train.py:996] (1/4) Epoch 3, batch 5550, loss[loss=0.2324, simple_loss=0.3244, pruned_loss=0.07019, over 21763.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3474, pruned_loss=0.1002, over 4283048.76 frames. ], batch size: 371, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:09:16,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=399294.0, ans=0.0 2023-06-19 06:09:38,645 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.755e+02 3.259e+02 4.197e+02 7.319e+02, threshold=6.518e+02, percent-clipped=2.0 2023-06-19 06:09:45,123 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-19 06:10:15,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=399474.0, ans=0.0 2023-06-19 06:10:17,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=399474.0, ans=0.2 2023-06-19 06:10:34,000 INFO [train.py:996] (1/4) Epoch 3, batch 5600, loss[loss=0.4329, simple_loss=0.4887, pruned_loss=0.1886, over 21411.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3445, pruned_loss=0.09768, over 4281421.48 frames. ], batch size: 507, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:11:02,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=399594.0, ans=0.1 2023-06-19 06:11:51,936 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-19 06:12:01,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=399774.0, ans=0.2 2023-06-19 06:12:17,373 INFO [train.py:996] (1/4) Epoch 3, batch 5650, loss[loss=0.2929, simple_loss=0.3472, pruned_loss=0.1193, over 21310.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3499, pruned_loss=0.1003, over 4278880.43 frames. ], batch size: 159, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:12:54,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=399894.0, ans=0.125 2023-06-19 06:13:13,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.102e+02 3.958e+02 5.147e+02 8.863e+02, threshold=7.916e+02, percent-clipped=12.0 2023-06-19 06:13:31,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=400014.0, ans=0.0 2023-06-19 06:13:48,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=400074.0, ans=0.0 2023-06-19 06:14:16,668 INFO [train.py:996] (1/4) Epoch 3, batch 5700, loss[loss=0.2767, simple_loss=0.3721, pruned_loss=0.09062, over 20007.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3502, pruned_loss=0.1025, over 4278273.46 frames. ], batch size: 702, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:14:24,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=400134.0, ans=15.0 2023-06-19 06:14:50,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=400194.0, ans=0.125 2023-06-19 06:15:39,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=400374.0, ans=0.0 2023-06-19 06:15:59,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=400374.0, ans=0.125 2023-06-19 06:16:01,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=400374.0, ans=0.125 2023-06-19 06:16:03,934 INFO [train.py:996] (1/4) Epoch 3, batch 5750, loss[loss=0.2114, simple_loss=0.3015, pruned_loss=0.06069, over 21819.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3435, pruned_loss=0.0994, over 4274891.02 frames. ], batch size: 282, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:16:20,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=400434.0, ans=0.0 2023-06-19 06:16:47,828 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:16:54,225 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.901e+02 3.353e+02 4.192e+02 8.562e+02, threshold=6.706e+02, percent-clipped=1.0 2023-06-19 06:17:48,969 INFO [train.py:996] (1/4) Epoch 3, batch 5800, loss[loss=0.2578, simple_loss=0.3321, pruned_loss=0.09174, over 20154.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3422, pruned_loss=0.09793, over 4275476.71 frames. ], batch size: 702, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:18:20,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-19 06:18:20,946 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.81 vs. limit=22.5 2023-06-19 06:18:54,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=400854.0, ans=0.125 2023-06-19 06:18:59,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-19 06:19:41,062 INFO [train.py:996] (1/4) Epoch 3, batch 5850, loss[loss=0.2256, simple_loss=0.3319, pruned_loss=0.0596, over 21597.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3371, pruned_loss=0.09208, over 4275844.45 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:20:32,171 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 2.443e+02 2.875e+02 3.533e+02 5.012e+02, threshold=5.751e+02, percent-clipped=0.0 2023-06-19 06:21:21,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=401274.0, ans=0.2 2023-06-19 06:21:21,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=401274.0, ans=0.1 2023-06-19 06:21:31,763 INFO [train.py:996] (1/4) Epoch 3, batch 5900, loss[loss=0.2251, simple_loss=0.2975, pruned_loss=0.07628, over 21520.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3281, pruned_loss=0.08504, over 4280130.63 frames. ], batch size: 211, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:21:51,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=401394.0, ans=0.125 2023-06-19 06:22:22,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=401454.0, ans=0.0 2023-06-19 06:22:49,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=401574.0, ans=0.025 2023-06-19 06:23:13,897 INFO [train.py:996] (1/4) Epoch 3, batch 5950, loss[loss=0.2584, simple_loss=0.31, pruned_loss=0.1034, over 21657.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.33, pruned_loss=0.09112, over 4290304.03 frames. ], batch size: 247, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:23:58,570 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 2.851e+02 3.351e+02 4.142e+02 6.067e+02, threshold=6.702e+02, percent-clipped=3.0 2023-06-19 06:24:24,792 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.44 vs. limit=15.0 2023-06-19 06:25:00,252 INFO [train.py:996] (1/4) Epoch 3, batch 6000, loss[loss=0.2745, simple_loss=0.3253, pruned_loss=0.1119, over 21859.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3282, pruned_loss=0.0966, over 4293509.11 frames. ], batch size: 98, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:25:00,253 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 06:25:17,462 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.2818, simple_loss=0.374, pruned_loss=0.0948, over 1796401.00 frames. 2023-06-19 06:25:17,463 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 06:25:31,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=401934.0, ans=0.2 2023-06-19 06:26:32,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=402114.0, ans=0.0 2023-06-19 06:26:36,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=402114.0, ans=0.125 2023-06-19 06:26:51,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-19 06:26:59,504 INFO [train.py:996] (1/4) Epoch 3, batch 6050, loss[loss=0.2203, simple_loss=0.2912, pruned_loss=0.07469, over 21673.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3234, pruned_loss=0.09718, over 4291450.21 frames. ], batch size: 298, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:27:23,155 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-06-19 06:27:44,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=402354.0, ans=0.125 2023-06-19 06:27:49,161 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.918e+02 3.547e+02 4.372e+02 9.416e+02, threshold=7.093e+02, percent-clipped=6.0 2023-06-19 06:28:18,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=402414.0, ans=0.2 2023-06-19 06:28:43,526 INFO [train.py:996] (1/4) Epoch 3, batch 6100, loss[loss=0.2661, simple_loss=0.3272, pruned_loss=0.1025, over 21548.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3226, pruned_loss=0.09634, over 4287328.83 frames. ], batch size: 212, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:28:59,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=402534.0, ans=0.0 2023-06-19 06:29:36,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-19 06:29:37,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=402654.0, ans=0.125 2023-06-19 06:29:58,495 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:30:10,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=402714.0, ans=0.125 2023-06-19 06:30:24,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=402774.0, ans=0.125 2023-06-19 06:30:30,035 INFO [train.py:996] (1/4) Epoch 3, batch 6150, loss[loss=0.2724, simple_loss=0.3782, pruned_loss=0.08335, over 19811.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3278, pruned_loss=0.1004, over 4287725.46 frames. ], batch size: 703, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:30:44,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=402834.0, ans=0.125 2023-06-19 06:31:15,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=402954.0, ans=0.0 2023-06-19 06:31:21,143 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 3.076e+02 3.570e+02 4.379e+02 8.300e+02, threshold=7.140e+02, percent-clipped=3.0 2023-06-19 06:31:21,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=402954.0, ans=0.0 2023-06-19 06:32:15,744 INFO [train.py:996] (1/4) Epoch 3, batch 6200, loss[loss=0.3229, simple_loss=0.3711, pruned_loss=0.1374, over 21449.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3327, pruned_loss=0.1013, over 4277440.29 frames. ], batch size: 131, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:32:23,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=403134.0, ans=0.125 2023-06-19 06:33:04,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=403254.0, ans=0.015 2023-06-19 06:33:27,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=403314.0, ans=0.125 2023-06-19 06:33:33,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-19 06:33:45,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.68 vs. limit=15.0 2023-06-19 06:34:08,010 INFO [train.py:996] (1/4) Epoch 3, batch 6250, loss[loss=0.2355, simple_loss=0.3233, pruned_loss=0.07386, over 21371.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3383, pruned_loss=0.1016, over 4280466.03 frames. ], batch size: 194, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:35:08,838 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 3.015e+02 3.767e+02 4.898e+02 1.129e+03, threshold=7.534e+02, percent-clipped=8.0 2023-06-19 06:35:12,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=403554.0, ans=0.5 2023-06-19 06:35:24,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=403614.0, ans=0.0 2023-06-19 06:35:47,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=403674.0, ans=0.05 2023-06-19 06:36:01,198 INFO [train.py:996] (1/4) Epoch 3, batch 6300, loss[loss=0.2608, simple_loss=0.3461, pruned_loss=0.08773, over 21761.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3417, pruned_loss=0.1005, over 4282924.89 frames. ], batch size: 298, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:36:21,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=403794.0, ans=0.125 2023-06-19 06:36:34,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=403794.0, ans=0.0 2023-06-19 06:36:37,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.61 vs. limit=10.0 2023-06-19 06:36:55,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=403854.0, ans=0.125 2023-06-19 06:36:59,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=403854.0, ans=0.0 2023-06-19 06:37:17,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=403914.0, ans=0.0 2023-06-19 06:37:22,900 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:37:41,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=403974.0, ans=0.0 2023-06-19 06:37:45,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-19 06:37:45,960 INFO [train.py:996] (1/4) Epoch 3, batch 6350, loss[loss=0.34, simple_loss=0.3954, pruned_loss=0.1423, over 21350.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3463, pruned_loss=0.1061, over 4290933.37 frames. ], batch size: 159, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:38:38,185 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.096e+02 3.646e+02 4.304e+02 8.936e+02, threshold=7.293e+02, percent-clipped=1.0 2023-06-19 06:38:44,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=404154.0, ans=10.0 2023-06-19 06:39:19,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=404274.0, ans=0.0 2023-06-19 06:39:31,456 INFO [train.py:996] (1/4) Epoch 3, batch 6400, loss[loss=0.3048, simple_loss=0.3678, pruned_loss=0.1209, over 21560.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3537, pruned_loss=0.1115, over 4289966.03 frames. ], batch size: 131, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:39:44,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-19 06:40:15,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=404454.0, ans=0.0 2023-06-19 06:41:17,044 INFO [train.py:996] (1/4) Epoch 3, batch 6450, loss[loss=0.227, simple_loss=0.3046, pruned_loss=0.07468, over 21245.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3546, pruned_loss=0.1103, over 4291297.38 frames. ], batch size: 159, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:41:19,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=404634.0, ans=0.5 2023-06-19 06:42:09,370 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.087e+02 4.205e+02 5.976e+02 1.329e+03, threshold=8.410e+02, percent-clipped=11.0 2023-06-19 06:43:02,057 INFO [train.py:996] (1/4) Epoch 3, batch 6500, loss[loss=0.2864, simple_loss=0.3539, pruned_loss=0.1094, over 21579.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3484, pruned_loss=0.109, over 4287522.37 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:43:15,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=22.5 2023-06-19 06:43:21,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404934.0, ans=0.1 2023-06-19 06:43:37,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=404994.0, ans=12.0 2023-06-19 06:43:44,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-19 06:43:50,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=405054.0, ans=0.1 2023-06-19 06:44:22,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=405114.0, ans=0.2 2023-06-19 06:44:22,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=405114.0, ans=0.125 2023-06-19 06:44:24,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-06-19 06:44:54,411 INFO [train.py:996] (1/4) Epoch 3, batch 6550, loss[loss=0.2607, simple_loss=0.3294, pruned_loss=0.09598, over 21832.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3458, pruned_loss=0.1071, over 4290590.05 frames. ], batch size: 391, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:45:15,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.94 vs. limit=15.0 2023-06-19 06:45:41,970 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.916e+02 3.526e+02 4.372e+02 9.339e+02, threshold=7.052e+02, percent-clipped=1.0 2023-06-19 06:45:53,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=405414.0, ans=0.125 2023-06-19 06:46:13,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=12.0 2023-06-19 06:46:38,341 INFO [train.py:996] (1/4) Epoch 3, batch 6600, loss[loss=0.2691, simple_loss=0.3204, pruned_loss=0.1089, over 21563.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3387, pruned_loss=0.1063, over 4277401.53 frames. ], batch size: 414, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:48:14,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=405774.0, ans=0.125 2023-06-19 06:48:23,538 INFO [train.py:996] (1/4) Epoch 3, batch 6650, loss[loss=0.2719, simple_loss=0.3285, pruned_loss=0.1076, over 21546.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.331, pruned_loss=0.1038, over 4278894.71 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:48:27,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=405834.0, ans=0.125 2023-06-19 06:49:17,956 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.970e+02 3.451e+02 4.323e+02 7.420e+02, threshold=6.902e+02, percent-clipped=1.0 2023-06-19 06:49:35,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=406014.0, ans=0.0 2023-06-19 06:49:35,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=406014.0, ans=0.1 2023-06-19 06:50:07,766 INFO [train.py:996] (1/4) Epoch 3, batch 6700, loss[loss=0.2771, simple_loss=0.3303, pruned_loss=0.1119, over 21497.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3253, pruned_loss=0.1028, over 4283155.51 frames. ], batch size: 195, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:50:12,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-19 06:50:25,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-06-19 06:50:47,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=406254.0, ans=0.125 2023-06-19 06:50:52,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=406254.0, ans=0.125 2023-06-19 06:51:00,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=406254.0, ans=0.1 2023-06-19 06:51:26,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=406374.0, ans=0.2 2023-06-19 06:51:52,461 INFO [train.py:996] (1/4) Epoch 3, batch 6750, loss[loss=0.3368, simple_loss=0.3591, pruned_loss=0.1572, over 21606.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3243, pruned_loss=0.1035, over 4286839.95 frames. ], batch size: 508, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:51:54,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=406434.0, ans=0.0 2023-06-19 06:52:40,101 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.875e+02 3.278e+02 4.228e+02 8.254e+02, threshold=6.556e+02, percent-clipped=2.0 2023-06-19 06:52:59,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=406614.0, ans=0.125 2023-06-19 06:52:59,716 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-19 06:53:04,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=406614.0, ans=0.125 2023-06-19 06:53:34,989 INFO [train.py:996] (1/4) Epoch 3, batch 6800, loss[loss=0.2599, simple_loss=0.3202, pruned_loss=0.09979, over 21436.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3262, pruned_loss=0.1062, over 4294836.84 frames. ], batch size: 131, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:54:17,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=406854.0, ans=0.125 2023-06-19 06:54:19,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=406854.0, ans=0.0 2023-06-19 06:54:19,635 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-19 06:54:40,469 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=12.0 2023-06-19 06:54:45,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=406914.0, ans=15.0 2023-06-19 06:55:19,145 INFO [train.py:996] (1/4) Epoch 3, batch 6850, loss[loss=0.2533, simple_loss=0.3066, pruned_loss=0.09997, over 21863.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3256, pruned_loss=0.1075, over 4281849.03 frames. ], batch size: 373, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:55:29,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=407034.0, ans=0.0 2023-06-19 06:55:31,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407034.0, ans=0.1 2023-06-19 06:56:08,179 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.177e+02 3.620e+02 4.749e+02 9.271e+02, threshold=7.240e+02, percent-clipped=3.0 2023-06-19 06:56:10,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=407154.0, ans=0.1 2023-06-19 06:56:30,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=407214.0, ans=0.0 2023-06-19 06:56:58,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-19 06:57:00,788 INFO [train.py:996] (1/4) Epoch 3, batch 6900, loss[loss=0.2624, simple_loss=0.3489, pruned_loss=0.08795, over 21685.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3285, pruned_loss=0.1069, over 4280559.52 frames. ], batch size: 389, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:57:01,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=407334.0, ans=0.125 2023-06-19 06:57:43,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=407454.0, ans=10.0 2023-06-19 06:57:52,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-19 06:58:18,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-19 06:58:19,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=407514.0, ans=0.125 2023-06-19 06:58:46,754 INFO [train.py:996] (1/4) Epoch 3, batch 6950, loss[loss=0.3145, simple_loss=0.369, pruned_loss=0.13, over 21316.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.331, pruned_loss=0.1037, over 4279394.06 frames. ], batch size: 159, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 06:59:40,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=407754.0, ans=15.0 2023-06-19 06:59:42,592 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.993e+02 3.659e+02 4.526e+02 7.412e+02, threshold=7.319e+02, percent-clipped=1.0 2023-06-19 07:00:13,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=407874.0, ans=0.04949747468305833 2023-06-19 07:00:27,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407874.0, ans=0.1 2023-06-19 07:00:30,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=407934.0, ans=0.1 2023-06-19 07:00:32,082 INFO [train.py:996] (1/4) Epoch 3, batch 7000, loss[loss=0.2728, simple_loss=0.3195, pruned_loss=0.1131, over 21838.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3369, pruned_loss=0.107, over 4279852.37 frames. ], batch size: 107, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:01:23,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=408054.0, ans=0.125 2023-06-19 07:01:40,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=408114.0, ans=0.04949747468305833 2023-06-19 07:02:04,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=408174.0, ans=0.125 2023-06-19 07:02:19,505 INFO [train.py:996] (1/4) Epoch 3, batch 7050, loss[loss=0.245, simple_loss=0.3108, pruned_loss=0.08958, over 21501.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3362, pruned_loss=0.1068, over 4271470.82 frames. ], batch size: 194, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:02:20,667 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-06-19 07:02:29,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=408234.0, ans=0.035 2023-06-19 07:02:33,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-19 07:02:38,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=408234.0, ans=0.125 2023-06-19 07:02:44,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-19 07:03:19,711 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.081e+02 3.762e+02 4.566e+02 1.137e+03, threshold=7.524e+02, percent-clipped=2.0 2023-06-19 07:03:36,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-19 07:03:50,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=408474.0, ans=0.07 2023-06-19 07:04:06,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=408474.0, ans=0.0 2023-06-19 07:04:10,020 INFO [train.py:996] (1/4) Epoch 3, batch 7100, loss[loss=0.2236, simple_loss=0.2946, pruned_loss=0.07629, over 21403.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3395, pruned_loss=0.1077, over 4276798.99 frames. ], batch size: 211, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:04:38,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=408594.0, ans=0.125 2023-06-19 07:05:38,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=8.0 2023-06-19 07:05:47,749 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.785e-03 2023-06-19 07:05:51,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=408774.0, ans=0.125 2023-06-19 07:05:55,530 INFO [train.py:996] (1/4) Epoch 3, batch 7150, loss[loss=0.1884, simple_loss=0.2703, pruned_loss=0.05324, over 21692.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3343, pruned_loss=0.104, over 4274920.19 frames. ], batch size: 298, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:06:57,319 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 3.041e+02 3.413e+02 3.887e+02 5.883e+02, threshold=6.826e+02, percent-clipped=0.0 2023-06-19 07:07:40,870 INFO [train.py:996] (1/4) Epoch 3, batch 7200, loss[loss=0.265, simple_loss=0.3216, pruned_loss=0.1042, over 21774.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3376, pruned_loss=0.1073, over 4280555.30 frames. ], batch size: 112, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:07:58,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=409134.0, ans=0.125 2023-06-19 07:08:02,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-06-19 07:08:37,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-19 07:08:47,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=409314.0, ans=0.0 2023-06-19 07:09:07,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409374.0, ans=0.1 2023-06-19 07:09:23,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=409374.0, ans=0.2 2023-06-19 07:09:32,310 INFO [train.py:996] (1/4) Epoch 3, batch 7250, loss[loss=0.2594, simple_loss=0.3123, pruned_loss=0.1032, over 21555.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3354, pruned_loss=0.1087, over 4277231.24 frames. ], batch size: 414, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:09:48,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=409494.0, ans=0.1 2023-06-19 07:10:01,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=409494.0, ans=0.125 2023-06-19 07:10:07,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=409494.0, ans=0.04949747468305833 2023-06-19 07:10:24,985 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.049e+02 3.903e+02 5.201e+02 1.242e+03, threshold=7.806e+02, percent-clipped=6.0 2023-06-19 07:10:30,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=409614.0, ans=10.0 2023-06-19 07:10:52,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=409674.0, ans=0.0 2023-06-19 07:11:07,682 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-19 07:11:08,783 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:11:13,543 INFO [train.py:996] (1/4) Epoch 3, batch 7300, loss[loss=0.2016, simple_loss=0.2609, pruned_loss=0.07119, over 21825.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3298, pruned_loss=0.107, over 4263892.32 frames. ], batch size: 98, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:11:18,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=409734.0, ans=0.1 2023-06-19 07:11:45,652 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.51 vs. limit=6.0 2023-06-19 07:12:17,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=409914.0, ans=0.0 2023-06-19 07:12:27,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409914.0, ans=0.1 2023-06-19 07:12:51,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=409974.0, ans=0.0 2023-06-19 07:12:59,458 INFO [train.py:996] (1/4) Epoch 3, batch 7350, loss[loss=0.3297, simple_loss=0.3719, pruned_loss=0.1437, over 21305.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3267, pruned_loss=0.1078, over 4265118.07 frames. ], batch size: 549, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:13:33,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=410094.0, ans=0.0 2023-06-19 07:13:38,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=410094.0, ans=0.125 2023-06-19 07:13:55,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=410154.0, ans=0.0 2023-06-19 07:13:59,101 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.138e+02 3.671e+02 4.789e+02 1.075e+03, threshold=7.343e+02, percent-clipped=3.0 2023-06-19 07:14:06,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=410214.0, ans=0.125 2023-06-19 07:14:55,693 INFO [train.py:996] (1/4) Epoch 3, batch 7400, loss[loss=0.3139, simple_loss=0.393, pruned_loss=0.1174, over 21471.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3323, pruned_loss=0.1099, over 4270450.07 frames. ], batch size: 471, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:16:48,336 INFO [train.py:996] (1/4) Epoch 3, batch 7450, loss[loss=0.249, simple_loss=0.3072, pruned_loss=0.09543, over 21818.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3291, pruned_loss=0.108, over 4255170.87 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:17:02,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=410634.0, ans=0.09899494936611666 2023-06-19 07:17:04,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=410694.0, ans=0.0 2023-06-19 07:17:13,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=410694.0, ans=0.2 2023-06-19 07:17:13,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=410694.0, ans=0.0 2023-06-19 07:17:15,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-19 07:17:30,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=410754.0, ans=0.95 2023-06-19 07:17:41,919 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.990e+02 3.597e+02 4.537e+02 7.540e+02, threshold=7.195e+02, percent-clipped=1.0 2023-06-19 07:17:52,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-19 07:17:52,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-19 07:18:35,617 INFO [train.py:996] (1/4) Epoch 3, batch 7500, loss[loss=0.2936, simple_loss=0.3408, pruned_loss=0.1232, over 21613.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3324, pruned_loss=0.1085, over 4260144.67 frames. ], batch size: 415, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:18:39,743 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-19 07:18:48,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=410934.0, ans=0.125 2023-06-19 07:18:55,438 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-19 07:19:02,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=410994.0, ans=0.1 2023-06-19 07:19:09,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=411054.0, ans=0.125 2023-06-19 07:19:32,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=411054.0, ans=0.1 2023-06-19 07:20:24,686 INFO [train.py:996] (1/4) Epoch 3, batch 7550, loss[loss=0.389, simple_loss=0.4486, pruned_loss=0.1647, over 21440.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3419, pruned_loss=0.1076, over 4263391.58 frames. ], batch size: 507, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:20:41,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=411294.0, ans=0.125 2023-06-19 07:20:49,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=411294.0, ans=0.1 2023-06-19 07:21:22,335 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.181e+02 3.735e+02 4.565e+02 8.412e+02, threshold=7.470e+02, percent-clipped=4.0 2023-06-19 07:21:41,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=411414.0, ans=0.125 2023-06-19 07:22:03,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=411474.0, ans=0.0 2023-06-19 07:22:09,820 INFO [train.py:996] (1/4) Epoch 3, batch 7600, loss[loss=0.3664, simple_loss=0.3932, pruned_loss=0.1698, over 21816.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3424, pruned_loss=0.1069, over 4272679.75 frames. ], batch size: 508, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:22:10,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=411534.0, ans=0.125 2023-06-19 07:23:51,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=411774.0, ans=0.2 2023-06-19 07:23:55,463 INFO [train.py:996] (1/4) Epoch 3, batch 7650, loss[loss=0.2871, simple_loss=0.3442, pruned_loss=0.115, over 21822.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.342, pruned_loss=0.1083, over 4284123.55 frames. ], batch size: 441, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:24:54,780 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.811e+02 3.352e+02 3.855e+02 5.541e+02, threshold=6.704e+02, percent-clipped=0.0 2023-06-19 07:25:44,614 INFO [train.py:996] (1/4) Epoch 3, batch 7700, loss[loss=0.2881, simple_loss=0.3462, pruned_loss=0.115, over 21798.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3458, pruned_loss=0.1119, over 4281322.79 frames. ], batch size: 247, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:25:47,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-19 07:27:01,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=412314.0, ans=0.125 2023-06-19 07:27:01,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=412314.0, ans=0.0 2023-06-19 07:27:32,040 INFO [train.py:996] (1/4) Epoch 3, batch 7750, loss[loss=0.2383, simple_loss=0.2817, pruned_loss=0.09742, over 20783.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3506, pruned_loss=0.1124, over 4279533.54 frames. ], batch size: 609, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:28:06,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=412494.0, ans=0.125 2023-06-19 07:28:15,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=412494.0, ans=0.125 2023-06-19 07:28:33,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=412554.0, ans=0.05 2023-06-19 07:28:46,860 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 3.582e+02 4.541e+02 5.903e+02 1.038e+03, threshold=9.082e+02, percent-clipped=9.0 2023-06-19 07:28:59,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=412614.0, ans=0.125 2023-06-19 07:29:10,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.92 vs. limit=22.5 2023-06-19 07:29:24,307 INFO [train.py:996] (1/4) Epoch 3, batch 7800, loss[loss=0.255, simple_loss=0.2889, pruned_loss=0.1106, over 20742.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3505, pruned_loss=0.1121, over 4278586.62 frames. ], batch size: 609, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:30:18,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=412854.0, ans=0.125 2023-06-19 07:30:53,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=412914.0, ans=0.0 2023-06-19 07:31:12,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=413034.0, ans=0.0 2023-06-19 07:31:13,605 INFO [train.py:996] (1/4) Epoch 3, batch 7850, loss[loss=0.2672, simple_loss=0.317, pruned_loss=0.1087, over 21869.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3441, pruned_loss=0.1113, over 4261859.44 frames. ], batch size: 373, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:31:29,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-19 07:31:59,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=413154.0, ans=0.125 2023-06-19 07:32:23,197 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 3.181e+02 3.685e+02 4.397e+02 7.326e+02, threshold=7.370e+02, percent-clipped=0.0 2023-06-19 07:32:34,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=413214.0, ans=0.125 2023-06-19 07:32:51,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=19.25 vs. limit=15.0 2023-06-19 07:33:08,238 INFO [train.py:996] (1/4) Epoch 3, batch 7900, loss[loss=0.244, simple_loss=0.299, pruned_loss=0.09447, over 21221.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3405, pruned_loss=0.1114, over 4267424.71 frames. ], batch size: 143, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:33:32,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=413394.0, ans=0.0 2023-06-19 07:33:39,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=413394.0, ans=0.125 2023-06-19 07:33:41,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413394.0, ans=0.1 2023-06-19 07:33:57,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=413454.0, ans=0.125 2023-06-19 07:33:57,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=413454.0, ans=0.0 2023-06-19 07:34:56,178 INFO [train.py:996] (1/4) Epoch 3, batch 7950, loss[loss=0.2686, simple_loss=0.3238, pruned_loss=0.1067, over 21916.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3438, pruned_loss=0.1096, over 4262698.13 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:35:01,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=413634.0, ans=0.125 2023-06-19 07:35:14,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413694.0, ans=0.1 2023-06-19 07:35:17,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=413694.0, ans=0.125 2023-06-19 07:35:31,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=413694.0, ans=0.125 2023-06-19 07:35:56,702 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.292e+02 2.859e+02 3.738e+02 4.773e+02 1.037e+03, threshold=7.477e+02, percent-clipped=3.0 2023-06-19 07:36:18,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=413814.0, ans=0.125 2023-06-19 07:36:30,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=413874.0, ans=0.02 2023-06-19 07:36:38,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=413874.0, ans=12.0 2023-06-19 07:36:44,338 INFO [train.py:996] (1/4) Epoch 3, batch 8000, loss[loss=0.3111, simple_loss=0.37, pruned_loss=0.1261, over 21574.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3494, pruned_loss=0.1118, over 4264457.02 frames. ], batch size: 263, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:36:58,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413934.0, ans=0.1 2023-06-19 07:37:28,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=413994.0, ans=0.1 2023-06-19 07:37:38,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=414054.0, ans=0.04949747468305833 2023-06-19 07:38:16,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=414114.0, ans=0.05 2023-06-19 07:38:22,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=414174.0, ans=0.0 2023-06-19 07:38:35,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=414174.0, ans=0.0 2023-06-19 07:38:46,681 INFO [train.py:996] (1/4) Epoch 3, batch 8050, loss[loss=0.3429, simple_loss=0.4135, pruned_loss=0.1361, over 21648.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3553, pruned_loss=0.1124, over 4271473.63 frames. ], batch size: 441, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:39:47,873 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.438e+02 3.985e+02 5.129e+02 7.856e+02, threshold=7.969e+02, percent-clipped=2.0 2023-06-19 07:40:00,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=414414.0, ans=0.125 2023-06-19 07:40:12,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=414474.0, ans=0.0 2023-06-19 07:40:35,173 INFO [train.py:996] (1/4) Epoch 3, batch 8100, loss[loss=0.2953, simple_loss=0.3521, pruned_loss=0.1193, over 21764.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3528, pruned_loss=0.1131, over 4277779.17 frames. ], batch size: 112, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:41:32,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=414654.0, ans=0.125 2023-06-19 07:42:10,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=414774.0, ans=0.125 2023-06-19 07:42:24,518 INFO [train.py:996] (1/4) Epoch 3, batch 8150, loss[loss=0.3777, simple_loss=0.4586, pruned_loss=0.1484, over 21521.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3586, pruned_loss=0.1143, over 4273534.78 frames. ], batch size: 473, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:43:04,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=414894.0, ans=0.125 2023-06-19 07:43:38,475 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.875e+02 3.410e+02 4.043e+02 9.100e+02, threshold=6.821e+02, percent-clipped=2.0 2023-06-19 07:43:47,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415014.0, ans=0.1 2023-06-19 07:44:05,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415074.0, ans=0.1 2023-06-19 07:44:12,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-19 07:44:13,616 INFO [train.py:996] (1/4) Epoch 3, batch 8200, loss[loss=0.2633, simple_loss=0.3064, pruned_loss=0.1101, over 21246.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3511, pruned_loss=0.1117, over 4266846.23 frames. ], batch size: 159, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:44:59,464 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.34 vs. limit=12.0 2023-06-19 07:45:10,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=415254.0, ans=0.125 2023-06-19 07:45:24,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.16 vs. limit=12.0 2023-06-19 07:45:37,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=415314.0, ans=0.125 2023-06-19 07:45:58,711 INFO [train.py:996] (1/4) Epoch 3, batch 8250, loss[loss=0.2483, simple_loss=0.3244, pruned_loss=0.08609, over 21329.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3492, pruned_loss=0.112, over 4264265.83 frames. ], batch size: 159, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:46:28,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=415494.0, ans=0.0 2023-06-19 07:46:52,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=415554.0, ans=0.07 2023-06-19 07:47:12,977 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.026e+02 3.791e+02 5.480e+02 8.265e+02, threshold=7.583e+02, percent-clipped=10.0 2023-06-19 07:47:47,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415674.0, ans=0.1 2023-06-19 07:47:52,048 INFO [train.py:996] (1/4) Epoch 3, batch 8300, loss[loss=0.3218, simple_loss=0.3833, pruned_loss=0.1302, over 21790.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3498, pruned_loss=0.1092, over 4257511.62 frames. ], batch size: 371, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:48:35,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415854.0, ans=0.1 2023-06-19 07:48:51,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=415854.0, ans=0.2 2023-06-19 07:49:06,299 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:49:12,836 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:49:38,472 INFO [train.py:996] (1/4) Epoch 3, batch 8350, loss[loss=0.2565, simple_loss=0.3413, pruned_loss=0.08583, over 21687.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3476, pruned_loss=0.1064, over 4260332.76 frames. ], batch size: 247, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:50:07,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=416094.0, ans=0.125 2023-06-19 07:50:10,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-19 07:50:46,334 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.962e+02 3.800e+02 4.795e+02 8.641e+02, threshold=7.601e+02, percent-clipped=3.0 2023-06-19 07:50:58,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=416214.0, ans=0.0 2023-06-19 07:51:23,694 INFO [train.py:996] (1/4) Epoch 3, batch 8400, loss[loss=0.1658, simple_loss=0.2216, pruned_loss=0.05506, over 17661.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3448, pruned_loss=0.1035, over 4261719.58 frames. ], batch size: 66, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:51:34,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=416334.0, ans=0.0 2023-06-19 07:51:40,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=416334.0, ans=0.125 2023-06-19 07:52:47,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=416574.0, ans=0.2 2023-06-19 07:52:47,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-19 07:53:03,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-19 07:53:07,335 INFO [train.py:996] (1/4) Epoch 3, batch 8450, loss[loss=0.3124, simple_loss=0.4275, pruned_loss=0.09862, over 20794.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3428, pruned_loss=0.1031, over 4270413.10 frames. ], batch size: 607, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:53:37,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=416694.0, ans=0.5 2023-06-19 07:53:46,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=416754.0, ans=0.0 2023-06-19 07:54:13,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-19 07:54:15,143 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.539e+02 3.236e+02 3.912e+02 6.365e+02, threshold=6.471e+02, percent-clipped=0.0 2023-06-19 07:54:59,948 INFO [train.py:996] (1/4) Epoch 3, batch 8500, loss[loss=0.2636, simple_loss=0.3037, pruned_loss=0.1117, over 21262.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3395, pruned_loss=0.1048, over 4273375.77 frames. ], batch size: 144, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:55:14,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=416934.0, ans=0.5 2023-06-19 07:55:22,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=416994.0, ans=0.2 2023-06-19 07:55:31,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-19 07:55:40,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=417054.0, ans=0.125 2023-06-19 07:55:49,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=417054.0, ans=0.125 2023-06-19 07:56:07,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=417114.0, ans=0.0 2023-06-19 07:56:25,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=417174.0, ans=0.0 2023-06-19 07:56:34,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=417174.0, ans=0.125 2023-06-19 07:56:36,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=417174.0, ans=0.125 2023-06-19 07:56:48,417 INFO [train.py:996] (1/4) Epoch 3, batch 8550, loss[loss=0.3245, simple_loss=0.3983, pruned_loss=0.1253, over 21760.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3459, pruned_loss=0.1092, over 4269042.41 frames. ], batch size: 332, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:57:01,169 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-19 07:57:39,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=417354.0, ans=0.0 2023-06-19 07:57:47,725 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.294e+02 4.189e+02 5.052e+02 1.014e+03, threshold=8.378e+02, percent-clipped=9.0 2023-06-19 07:58:12,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-19 07:58:26,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-19 07:58:32,097 INFO [train.py:996] (1/4) Epoch 3, batch 8600, loss[loss=0.4271, simple_loss=0.4546, pruned_loss=0.1998, over 21403.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3525, pruned_loss=0.1111, over 4273600.03 frames. ], batch size: 471, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:58:34,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=417534.0, ans=0.125 2023-06-19 07:58:43,902 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-06-19 07:59:10,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.04 vs. limit=10.0 2023-06-19 07:59:35,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=417714.0, ans=0.0 2023-06-19 08:00:13,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=417774.0, ans=0.125 2023-06-19 08:00:20,143 INFO [train.py:996] (1/4) Epoch 3, batch 8650, loss[loss=0.2841, simple_loss=0.3699, pruned_loss=0.09915, over 21682.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3561, pruned_loss=0.112, over 4273021.64 frames. ], batch size: 389, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:00:30,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=417834.0, ans=0.125 2023-06-19 08:00:50,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-19 08:00:58,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-19 08:01:28,270 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.885e+02 3.395e+02 4.112e+02 7.467e+02, threshold=6.789e+02, percent-clipped=0.0 2023-06-19 08:01:49,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=418074.0, ans=0.1 2023-06-19 08:02:00,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=418074.0, ans=0.0 2023-06-19 08:02:05,236 INFO [train.py:996] (1/4) Epoch 3, batch 8700, loss[loss=0.256, simple_loss=0.3239, pruned_loss=0.09404, over 15566.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3458, pruned_loss=0.1064, over 4261944.07 frames. ], batch size: 61, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:02:05,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=418134.0, ans=0.1 2023-06-19 08:02:33,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=418194.0, ans=0.125 2023-06-19 08:02:37,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=418194.0, ans=12.0 2023-06-19 08:02:42,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-19 08:02:54,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=418254.0, ans=0.125 2023-06-19 08:02:56,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=418254.0, ans=0.125 2023-06-19 08:03:13,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=418314.0, ans=0.0 2023-06-19 08:03:41,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=418374.0, ans=0.2 2023-06-19 08:03:57,796 INFO [train.py:996] (1/4) Epoch 3, batch 8750, loss[loss=0.3552, simple_loss=0.3822, pruned_loss=0.1641, over 21714.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3417, pruned_loss=0.1066, over 4257758.44 frames. ], batch size: 508, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:04:11,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=418434.0, ans=0.015 2023-06-19 08:04:23,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=418494.0, ans=0.1 2023-06-19 08:05:05,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=418614.0, ans=0.125 2023-06-19 08:05:06,657 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 3.018e+02 3.630e+02 4.545e+02 8.299e+02, threshold=7.260e+02, percent-clipped=2.0 2023-06-19 08:05:44,847 INFO [train.py:996] (1/4) Epoch 3, batch 8800, loss[loss=0.2247, simple_loss=0.3284, pruned_loss=0.06051, over 20790.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3503, pruned_loss=0.1101, over 4261425.25 frames. ], batch size: 608, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:05:53,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=418734.0, ans=0.125 2023-06-19 08:06:57,495 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-19 08:07:41,989 INFO [train.py:996] (1/4) Epoch 3, batch 8850, loss[loss=0.2849, simple_loss=0.3745, pruned_loss=0.09763, over 21765.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3585, pruned_loss=0.1123, over 4261218.43 frames. ], batch size: 332, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:07:44,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419034.0, ans=0.1 2023-06-19 08:07:51,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=419034.0, ans=0.0 2023-06-19 08:08:16,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=419094.0, ans=0.125 2023-06-19 08:08:46,143 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.554e+02 4.291e+02 5.667e+02 9.091e+02, threshold=8.581e+02, percent-clipped=5.0 2023-06-19 08:08:52,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-19 08:09:27,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=419274.0, ans=22.5 2023-06-19 08:09:29,171 INFO [train.py:996] (1/4) Epoch 3, batch 8900, loss[loss=0.2526, simple_loss=0.3144, pruned_loss=0.09545, over 21748.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3529, pruned_loss=0.1117, over 4267300.22 frames. ], batch size: 371, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:09:36,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=419334.0, ans=0.5 2023-06-19 08:10:19,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419454.0, ans=0.1 2023-06-19 08:10:50,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.00 vs. limit=12.0 2023-06-19 08:11:15,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-19 08:11:18,143 INFO [train.py:996] (1/4) Epoch 3, batch 8950, loss[loss=0.3358, simple_loss=0.4074, pruned_loss=0.132, over 21628.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3543, pruned_loss=0.111, over 4266479.01 frames. ], batch size: 441, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:11:18,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=419634.0, ans=0.95 2023-06-19 08:11:33,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419634.0, ans=0.1 2023-06-19 08:11:49,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=419694.0, ans=0.0 2023-06-19 08:12:01,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419694.0, ans=0.1 2023-06-19 08:12:27,662 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 3.236e+02 4.208e+02 5.168e+02 9.134e+02, threshold=8.417e+02, percent-clipped=1.0 2023-06-19 08:12:50,291 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.073e-02 2023-06-19 08:13:00,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419874.0, ans=0.1 2023-06-19 08:13:04,996 INFO [train.py:996] (1/4) Epoch 3, batch 9000, loss[loss=0.2428, simple_loss=0.2982, pruned_loss=0.09368, over 21858.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3473, pruned_loss=0.1108, over 4263931.52 frames. ], batch size: 107, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:13:04,997 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 08:13:24,319 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.2787, simple_loss=0.3793, pruned_loss=0.08906, over 1796401.00 frames. 2023-06-19 08:13:24,319 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 08:13:35,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=419934.0, ans=0.125 2023-06-19 08:14:40,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=420114.0, ans=0.125 2023-06-19 08:15:01,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=420174.0, ans=0.125 2023-06-19 08:15:19,186 INFO [train.py:996] (1/4) Epoch 3, batch 9050, loss[loss=0.2287, simple_loss=0.2998, pruned_loss=0.07876, over 21218.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3417, pruned_loss=0.1069, over 4267246.44 frames. ], batch size: 176, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:15:21,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=420234.0, ans=0.0 2023-06-19 08:15:38,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-19 08:16:23,141 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 3.182e+02 3.849e+02 4.740e+02 7.257e+02, threshold=7.697e+02, percent-clipped=0.0 2023-06-19 08:16:36,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=420414.0, ans=0.125 2023-06-19 08:16:58,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-19 08:17:03,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-19 08:17:05,676 INFO [train.py:996] (1/4) Epoch 3, batch 9100, loss[loss=0.269, simple_loss=0.356, pruned_loss=0.091, over 21700.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3492, pruned_loss=0.1103, over 4275310.92 frames. ], batch size: 247, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:17:32,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=420594.0, ans=0.1 2023-06-19 08:17:33,818 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:18:03,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=420654.0, ans=0.125 2023-06-19 08:18:52,698 INFO [train.py:996] (1/4) Epoch 3, batch 9150, loss[loss=0.2639, simple_loss=0.3431, pruned_loss=0.09236, over 21649.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3518, pruned_loss=0.1071, over 4273994.71 frames. ], batch size: 230, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:19:40,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=420954.0, ans=0.125 2023-06-19 08:20:00,320 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.911e+02 3.357e+02 4.018e+02 6.144e+02, threshold=6.715e+02, percent-clipped=0.0 2023-06-19 08:20:09,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=421014.0, ans=0.1 2023-06-19 08:20:19,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=421014.0, ans=0.125 2023-06-19 08:20:45,080 INFO [train.py:996] (1/4) Epoch 3, batch 9200, loss[loss=0.2405, simple_loss=0.3197, pruned_loss=0.08066, over 21385.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3534, pruned_loss=0.1064, over 4265669.40 frames. ], batch size: 211, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:21:12,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=421194.0, ans=0.07 2023-06-19 08:21:16,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=421194.0, ans=0.125 2023-06-19 08:22:05,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=421314.0, ans=0.125 2023-06-19 08:22:16,732 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-19 08:22:31,093 INFO [train.py:996] (1/4) Epoch 3, batch 9250, loss[loss=0.2531, simple_loss=0.3063, pruned_loss=0.09998, over 21563.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3563, pruned_loss=0.1111, over 4271267.13 frames. ], batch size: 247, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:22:52,148 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.70 vs. limit=6.0 2023-06-19 08:23:03,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=421494.0, ans=0.0 2023-06-19 08:23:39,387 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.086e+02 3.800e+02 4.447e+02 7.339e+02, threshold=7.599e+02, percent-clipped=2.0 2023-06-19 08:24:00,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=421674.0, ans=0.0 2023-06-19 08:24:17,003 INFO [train.py:996] (1/4) Epoch 3, batch 9300, loss[loss=0.2945, simple_loss=0.3837, pruned_loss=0.1027, over 21875.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3492, pruned_loss=0.1101, over 4263455.18 frames. ], batch size: 372, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:24:21,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=421734.0, ans=0.07 2023-06-19 08:24:50,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=421794.0, ans=0.5 2023-06-19 08:25:40,194 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:25:48,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=421974.0, ans=0.09899494936611666 2023-06-19 08:26:11,683 INFO [train.py:996] (1/4) Epoch 3, batch 9350, loss[loss=0.303, simple_loss=0.4203, pruned_loss=0.09283, over 20842.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3577, pruned_loss=0.1119, over 4261545.33 frames. ], batch size: 607, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:26:20,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=422034.0, ans=0.125 2023-06-19 08:26:38,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=422094.0, ans=0.025 2023-06-19 08:27:00,718 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:27:20,731 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 3.071e+02 3.690e+02 4.644e+02 6.944e+02, threshold=7.381e+02, percent-clipped=0.0 2023-06-19 08:27:50,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=422274.0, ans=0.2 2023-06-19 08:27:59,133 INFO [train.py:996] (1/4) Epoch 3, batch 9400, loss[loss=0.2401, simple_loss=0.3036, pruned_loss=0.08827, over 21552.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3588, pruned_loss=0.1117, over 4265835.95 frames. ], batch size: 414, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:29:32,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=422574.0, ans=0.0 2023-06-19 08:29:34,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=422574.0, ans=0.1 2023-06-19 08:29:43,988 INFO [train.py:996] (1/4) Epoch 3, batch 9450, loss[loss=0.2911, simple_loss=0.3307, pruned_loss=0.1257, over 21221.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3491, pruned_loss=0.1098, over 4258633.64 frames. ], batch size: 471, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:30:46,538 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:30:51,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 3.147e+02 3.732e+02 4.957e+02 8.626e+02, threshold=7.464e+02, percent-clipped=5.0 2023-06-19 08:30:53,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=422814.0, ans=0.1 2023-06-19 08:31:28,570 INFO [train.py:996] (1/4) Epoch 3, batch 9500, loss[loss=0.2462, simple_loss=0.3101, pruned_loss=0.09119, over 21163.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3437, pruned_loss=0.108, over 4244415.53 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:31:59,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=422994.0, ans=0.125 2023-06-19 08:32:22,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423054.0, ans=0.1 2023-06-19 08:33:02,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-19 08:33:03,130 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:33:14,242 INFO [train.py:996] (1/4) Epoch 3, batch 9550, loss[loss=0.2827, simple_loss=0.3615, pruned_loss=0.1019, over 21874.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3467, pruned_loss=0.111, over 4262119.42 frames. ], batch size: 316, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:33:32,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=423234.0, ans=0.1 2023-06-19 08:34:20,858 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.824e+02 3.274e+02 3.853e+02 7.090e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-19 08:34:32,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=423414.0, ans=0.0 2023-06-19 08:34:33,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=22.5 2023-06-19 08:34:51,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=423474.0, ans=0.125 2023-06-19 08:34:55,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=423474.0, ans=0.1 2023-06-19 08:34:58,269 INFO [train.py:996] (1/4) Epoch 3, batch 9600, loss[loss=0.2731, simple_loss=0.3387, pruned_loss=0.1037, over 21844.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3495, pruned_loss=0.1112, over 4270271.15 frames. ], batch size: 332, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:34:59,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=423534.0, ans=22.5 2023-06-19 08:36:29,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=423774.0, ans=0.0 2023-06-19 08:36:39,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=423774.0, ans=0.0 2023-06-19 08:36:45,580 INFO [train.py:996] (1/4) Epoch 3, batch 9650, loss[loss=0.2763, simple_loss=0.3391, pruned_loss=0.1067, over 21561.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3502, pruned_loss=0.1112, over 4266823.78 frames. ], batch size: 263, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:37:58,446 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.232e+02 3.864e+02 5.587e+02 9.927e+02, threshold=7.728e+02, percent-clipped=9.0 2023-06-19 08:38:32,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=424074.0, ans=0.125 2023-06-19 08:38:40,947 INFO [train.py:996] (1/4) Epoch 3, batch 9700, loss[loss=0.2614, simple_loss=0.3325, pruned_loss=0.09509, over 21483.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3528, pruned_loss=0.1116, over 4275513.05 frames. ], batch size: 131, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:39:05,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=424194.0, ans=0.125 2023-06-19 08:39:24,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=424254.0, ans=0.2 2023-06-19 08:40:18,520 INFO [train.py:996] (1/4) Epoch 3, batch 9750, loss[loss=0.2699, simple_loss=0.3313, pruned_loss=0.1043, over 22000.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3472, pruned_loss=0.1099, over 4265195.49 frames. ], batch size: 103, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:40:28,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=424434.0, ans=0.0 2023-06-19 08:41:18,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.037e+02 3.853e+02 4.411e+02 7.266e+02, threshold=7.707e+02, percent-clipped=0.0 2023-06-19 08:41:36,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=424614.0, ans=0.125 2023-06-19 08:41:48,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=424674.0, ans=0.2 2023-06-19 08:41:52,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=424674.0, ans=0.0 2023-06-19 08:41:55,573 INFO [train.py:996] (1/4) Epoch 3, batch 9800, loss[loss=0.2607, simple_loss=0.3199, pruned_loss=0.1007, over 21592.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3479, pruned_loss=0.1111, over 4271199.26 frames. ], batch size: 195, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:42:27,009 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-19 08:42:47,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=424854.0, ans=0.0 2023-06-19 08:42:52,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-19 08:43:23,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=424974.0, ans=0.1 2023-06-19 08:43:29,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=424974.0, ans=0.0 2023-06-19 08:43:30,941 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:43:39,040 INFO [train.py:996] (1/4) Epoch 3, batch 9850, loss[loss=0.2504, simple_loss=0.3045, pruned_loss=0.09813, over 21724.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3441, pruned_loss=0.1109, over 4274659.32 frames. ], batch size: 316, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:43:55,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=425034.0, ans=0.0 2023-06-19 08:43:55,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=425034.0, ans=0.0 2023-06-19 08:44:01,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=425094.0, ans=0.125 2023-06-19 08:44:25,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=425154.0, ans=0.125 2023-06-19 08:44:27,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=425154.0, ans=0.125 2023-06-19 08:44:31,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=425154.0, ans=0.125 2023-06-19 08:44:50,369 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.794e+02 3.273e+02 4.007e+02 7.022e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-19 08:44:55,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=425214.0, ans=0.125 2023-06-19 08:45:04,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=425274.0, ans=0.0 2023-06-19 08:45:21,538 INFO [train.py:996] (1/4) Epoch 3, batch 9900, loss[loss=0.3247, simple_loss=0.378, pruned_loss=0.1357, over 21387.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3388, pruned_loss=0.1092, over 4271177.98 frames. ], batch size: 131, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:45:26,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=425334.0, ans=0.2 2023-06-19 08:46:09,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=425454.0, ans=0.125 2023-06-19 08:47:10,938 INFO [train.py:996] (1/4) Epoch 3, batch 9950, loss[loss=0.3054, simple_loss=0.3601, pruned_loss=0.1253, over 21252.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3412, pruned_loss=0.112, over 4264361.55 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:48:18,522 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.027e+02 3.494e+02 4.278e+02 9.586e+02, threshold=6.989e+02, percent-clipped=3.0 2023-06-19 08:49:03,440 INFO [train.py:996] (1/4) Epoch 3, batch 10000, loss[loss=0.298, simple_loss=0.3578, pruned_loss=0.1191, over 20704.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3344, pruned_loss=0.1096, over 4267921.61 frames. ], batch size: 607, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:49:10,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=425934.0, ans=0.0 2023-06-19 08:49:33,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-19 08:49:40,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=425994.0, ans=0.0 2023-06-19 08:49:59,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=426054.0, ans=0.0 2023-06-19 08:50:17,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=426114.0, ans=0.1 2023-06-19 08:50:28,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426174.0, ans=0.1 2023-06-19 08:50:45,779 INFO [train.py:996] (1/4) Epoch 3, batch 10050, loss[loss=0.2301, simple_loss=0.3006, pruned_loss=0.07979, over 21550.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3382, pruned_loss=0.1115, over 4272648.47 frames. ], batch size: 230, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:51:09,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=426294.0, ans=0.125 2023-06-19 08:51:44,416 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:51:50,933 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.885e+02 3.512e+02 4.083e+02 6.660e+02, threshold=7.024e+02, percent-clipped=0.0 2023-06-19 08:52:37,731 INFO [train.py:996] (1/4) Epoch 3, batch 10100, loss[loss=0.2978, simple_loss=0.3726, pruned_loss=0.1115, over 21638.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3351, pruned_loss=0.1081, over 4271835.90 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:52:54,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-19 08:53:19,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=426654.0, ans=0.125 2023-06-19 08:54:22,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426834.0, ans=0.1 2023-06-19 08:54:23,734 INFO [train.py:996] (1/4) Epoch 3, batch 10150, loss[loss=0.2728, simple_loss=0.3354, pruned_loss=0.1051, over 21312.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3421, pruned_loss=0.111, over 4272597.22 frames. ], batch size: 176, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:55:35,201 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.064e+02 3.671e+02 4.442e+02 6.348e+02, threshold=7.343e+02, percent-clipped=0.0 2023-06-19 08:55:37,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=427014.0, ans=0.125 2023-06-19 08:56:07,366 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-19 08:56:10,007 INFO [train.py:996] (1/4) Epoch 3, batch 10200, loss[loss=0.2403, simple_loss=0.3046, pruned_loss=0.08801, over 21783.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3406, pruned_loss=0.1087, over 4262983.97 frames. ], batch size: 112, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:56:46,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=427194.0, ans=0.0 2023-06-19 08:57:13,824 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.26 vs. limit=15.0 2023-06-19 08:57:36,330 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-19 08:57:57,543 INFO [train.py:996] (1/4) Epoch 3, batch 10250, loss[loss=0.283, simple_loss=0.3347, pruned_loss=0.1156, over 21207.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3357, pruned_loss=0.1023, over 4258406.48 frames. ], batch size: 608, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:57:58,732 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-06-19 08:58:15,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=427434.0, ans=0.1 2023-06-19 08:59:07,971 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 2.442e+02 2.730e+02 3.285e+02 6.537e+02, threshold=5.460e+02, percent-clipped=0.0 2023-06-19 08:59:26,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=427674.0, ans=0.125 2023-06-19 08:59:41,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=427674.0, ans=0.2 2023-06-19 08:59:43,913 INFO [train.py:996] (1/4) Epoch 3, batch 10300, loss[loss=0.3212, simple_loss=0.3676, pruned_loss=0.1375, over 21640.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.339, pruned_loss=0.1032, over 4266244.23 frames. ], batch size: 230, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 09:00:00,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=427734.0, ans=0.2 2023-06-19 09:00:41,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=427854.0, ans=0.125 2023-06-19 09:00:49,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=427854.0, ans=0.125 2023-06-19 09:01:05,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-19 09:01:38,602 INFO [train.py:996] (1/4) Epoch 3, batch 10350, loss[loss=0.3114, simple_loss=0.3893, pruned_loss=0.1167, over 21232.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3397, pruned_loss=0.1032, over 4261906.05 frames. ], batch size: 549, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 09:02:01,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=428094.0, ans=0.125 2023-06-19 09:02:07,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=22.5 2023-06-19 09:02:25,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=428154.0, ans=0.2 2023-06-19 09:02:25,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=428154.0, ans=0.125 2023-06-19 09:02:50,529 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.191e+02 3.696e+02 4.662e+02 9.387e+02, threshold=7.392e+02, percent-clipped=8.0 2023-06-19 09:03:11,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=428274.0, ans=0.125 2023-06-19 09:03:25,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=428334.0, ans=0.1 2023-06-19 09:03:26,633 INFO [train.py:996] (1/4) Epoch 3, batch 10400, loss[loss=0.2808, simple_loss=0.3414, pruned_loss=0.1101, over 20770.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3306, pruned_loss=0.1003, over 4255923.08 frames. ], batch size: 607, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:04:41,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=428514.0, ans=0.0 2023-06-19 09:04:53,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=428514.0, ans=0.125 2023-06-19 09:05:19,047 INFO [train.py:996] (1/4) Epoch 3, batch 10450, loss[loss=0.2693, simple_loss=0.3375, pruned_loss=0.1005, over 21593.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3364, pruned_loss=0.1046, over 4256761.63 frames. ], batch size: 230, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:05:37,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=428634.0, ans=0.125 2023-06-19 09:06:02,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=428754.0, ans=0.2 2023-06-19 09:06:09,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-19 09:06:28,885 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.522e+02 4.182e+02 5.744e+02 1.036e+03, threshold=8.363e+02, percent-clipped=11.0 2023-06-19 09:06:40,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.43 vs. limit=22.5 2023-06-19 09:07:03,331 INFO [train.py:996] (1/4) Epoch 3, batch 10500, loss[loss=0.2537, simple_loss=0.3085, pruned_loss=0.09942, over 21676.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3364, pruned_loss=0.1036, over 4252831.38 frames. ], batch size: 282, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:07:07,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=428934.0, ans=0.125 2023-06-19 09:07:23,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=428934.0, ans=0.2 2023-06-19 09:08:06,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-19 09:08:28,400 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.98 vs. limit=15.0 2023-06-19 09:08:46,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=429174.0, ans=0.2 2023-06-19 09:08:49,583 INFO [train.py:996] (1/4) Epoch 3, batch 10550, loss[loss=0.2492, simple_loss=0.2969, pruned_loss=0.1008, over 15621.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3312, pruned_loss=0.103, over 4237553.13 frames. ], batch size: 62, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:08:49,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=429234.0, ans=0.1 2023-06-19 09:08:53,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=429234.0, ans=0.0 2023-06-19 09:09:04,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-19 09:09:19,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.97 vs. limit=10.0 2023-06-19 09:09:35,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=429354.0, ans=0.125 2023-06-19 09:10:00,978 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.936e+02 3.551e+02 4.371e+02 5.985e+02, threshold=7.102e+02, percent-clipped=0.0 2023-06-19 09:10:17,778 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-19 09:10:22,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=429474.0, ans=0.1 2023-06-19 09:10:36,933 INFO [train.py:996] (1/4) Epoch 3, batch 10600, loss[loss=0.268, simple_loss=0.3696, pruned_loss=0.0832, over 21198.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3278, pruned_loss=0.102, over 4241421.36 frames. ], batch size: 548, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:10:52,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=429534.0, ans=0.0 2023-06-19 09:11:21,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=429654.0, ans=0.0 2023-06-19 09:12:02,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=429714.0, ans=0.0 2023-06-19 09:12:31,305 INFO [train.py:996] (1/4) Epoch 3, batch 10650, loss[loss=0.1859, simple_loss=0.2695, pruned_loss=0.05115, over 21724.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3289, pruned_loss=0.09916, over 4242225.36 frames. ], batch size: 298, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:12:33,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-19 09:12:57,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=429894.0, ans=0.125 2023-06-19 09:13:08,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=429894.0, ans=0.0 2023-06-19 09:13:17,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=429954.0, ans=0.125 2023-06-19 09:13:41,870 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.491e+02 4.438e+02 6.074e+02 1.034e+03, threshold=8.876e+02, percent-clipped=13.0 2023-06-19 09:13:48,746 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:13:59,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=430074.0, ans=0.1 2023-06-19 09:14:17,989 INFO [train.py:996] (1/4) Epoch 3, batch 10700, loss[loss=0.3122, simple_loss=0.3826, pruned_loss=0.1209, over 21750.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3286, pruned_loss=0.09978, over 4246289.19 frames. ], batch size: 124, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:14:58,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.01 vs. limit=12.0 2023-06-19 09:15:22,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=430254.0, ans=0.125 2023-06-19 09:15:23,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=430314.0, ans=0.2 2023-06-19 09:15:23,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=430314.0, ans=0.125 2023-06-19 09:15:51,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=430374.0, ans=0.0 2023-06-19 09:16:03,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-19 09:16:03,783 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-19 09:16:07,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=430434.0, ans=0.125 2023-06-19 09:16:09,167 INFO [train.py:996] (1/4) Epoch 3, batch 10750, loss[loss=0.3091, simple_loss=0.3668, pruned_loss=0.1257, over 21426.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3399, pruned_loss=0.1053, over 4254402.35 frames. ], batch size: 159, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:16:38,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=430494.0, ans=0.0 2023-06-19 09:16:43,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=22.5 2023-06-19 09:16:48,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=430554.0, ans=0.125 2023-06-19 09:16:55,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=430554.0, ans=0.125 2023-06-19 09:16:57,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-19 09:17:15,269 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-19 09:17:19,507 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.011e+02 3.555e+02 4.505e+02 9.587e+02, threshold=7.110e+02, percent-clipped=1.0 2023-06-19 09:17:21,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=430614.0, ans=0.0 2023-06-19 09:17:26,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=430614.0, ans=0.0 2023-06-19 09:17:40,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=430674.0, ans=0.2 2023-06-19 09:17:47,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=430674.0, ans=0.0 2023-06-19 09:18:01,330 INFO [train.py:996] (1/4) Epoch 3, batch 10800, loss[loss=0.2993, simple_loss=0.3861, pruned_loss=0.1062, over 19835.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3474, pruned_loss=0.1063, over 4254559.77 frames. ], batch size: 702, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:18:13,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-19 09:19:08,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=430914.0, ans=0.1 2023-06-19 09:19:42,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=430974.0, ans=0.2 2023-06-19 09:19:45,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=430974.0, ans=0.0 2023-06-19 09:19:48,708 INFO [train.py:996] (1/4) Epoch 3, batch 10850, loss[loss=0.2182, simple_loss=0.2745, pruned_loss=0.08091, over 21362.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3481, pruned_loss=0.1075, over 4253720.98 frames. ], batch size: 194, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:20:12,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=431094.0, ans=0.0 2023-06-19 09:20:59,262 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.993e+02 3.612e+02 4.329e+02 8.050e+02, threshold=7.223e+02, percent-clipped=3.0 2023-06-19 09:21:14,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=431274.0, ans=0.0 2023-06-19 09:21:32,809 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-19 09:21:37,057 INFO [train.py:996] (1/4) Epoch 3, batch 10900, loss[loss=0.2592, simple_loss=0.3539, pruned_loss=0.08226, over 21809.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.34, pruned_loss=0.1051, over 4246932.31 frames. ], batch size: 371, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:22:02,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-06-19 09:22:07,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-19 09:22:21,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=431454.0, ans=15.0 2023-06-19 09:22:41,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=431454.0, ans=0.0 2023-06-19 09:22:57,644 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:23:22,695 INFO [train.py:996] (1/4) Epoch 3, batch 10950, loss[loss=0.2751, simple_loss=0.325, pruned_loss=0.1126, over 21447.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3348, pruned_loss=0.1029, over 4242192.53 frames. ], batch size: 389, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:23:33,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=431634.0, ans=0.2 2023-06-19 09:24:04,891 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:24:30,119 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.913e+02 3.689e+02 4.516e+02 9.090e+02, threshold=7.379e+02, percent-clipped=2.0 2023-06-19 09:25:07,351 INFO [train.py:996] (1/4) Epoch 3, batch 11000, loss[loss=0.2935, simple_loss=0.3379, pruned_loss=0.1246, over 21557.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3333, pruned_loss=0.1035, over 4246143.94 frames. ], batch size: 195, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:25:11,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431934.0, ans=0.1 2023-06-19 09:25:32,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431994.0, ans=0.1 2023-06-19 09:25:41,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=431994.0, ans=0.0 2023-06-19 09:25:58,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=432054.0, ans=0.0 2023-06-19 09:25:58,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=432054.0, ans=22.5 2023-06-19 09:26:24,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=432114.0, ans=0.125 2023-06-19 09:26:53,494 INFO [train.py:996] (1/4) Epoch 3, batch 11050, loss[loss=0.2435, simple_loss=0.295, pruned_loss=0.09601, over 21562.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3316, pruned_loss=0.1049, over 4247991.42 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:26:58,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=432234.0, ans=0.125 2023-06-19 09:27:06,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=432234.0, ans=0.0 2023-06-19 09:27:37,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=432354.0, ans=0.125 2023-06-19 09:27:55,631 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.002e+02 3.534e+02 4.546e+02 1.059e+03, threshold=7.067e+02, percent-clipped=5.0 2023-06-19 09:28:07,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=432414.0, ans=0.0 2023-06-19 09:28:27,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=432474.0, ans=0.1 2023-06-19 09:28:36,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=22.5 2023-06-19 09:28:36,738 INFO [train.py:996] (1/4) Epoch 3, batch 11100, loss[loss=0.2605, simple_loss=0.308, pruned_loss=0.1065, over 20150.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3299, pruned_loss=0.1057, over 4249032.46 frames. ], batch size: 702, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:28:40,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=432534.0, ans=0.125 2023-06-19 09:29:07,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=432594.0, ans=0.2 2023-06-19 09:29:52,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=432714.0, ans=0.125 2023-06-19 09:30:20,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=432774.0, ans=0.05 2023-06-19 09:30:23,543 INFO [train.py:996] (1/4) Epoch 3, batch 11150, loss[loss=0.3207, simple_loss=0.421, pruned_loss=0.1102, over 20911.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3292, pruned_loss=0.1054, over 4248781.59 frames. ], batch size: 608, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:30:44,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=432894.0, ans=0.125 2023-06-19 09:31:34,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.954e+02 3.953e+02 5.206e+02 1.006e+03, threshold=7.907e+02, percent-clipped=9.0 2023-06-19 09:31:47,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-19 09:31:55,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=433074.0, ans=0.0 2023-06-19 09:31:58,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=433074.0, ans=0.0 2023-06-19 09:32:10,322 INFO [train.py:996] (1/4) Epoch 3, batch 11200, loss[loss=0.276, simple_loss=0.3318, pruned_loss=0.1101, over 21578.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3279, pruned_loss=0.1053, over 4248057.54 frames. ], batch size: 442, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:32:24,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=433134.0, ans=0.125 2023-06-19 09:32:28,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-19 09:33:54,837 INFO [train.py:996] (1/4) Epoch 3, batch 11250, loss[loss=0.2643, simple_loss=0.3189, pruned_loss=0.1048, over 21573.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3273, pruned_loss=0.1054, over 4248669.87 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:33:55,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=433434.0, ans=0.05 2023-06-19 09:33:57,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-19 09:33:59,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=433434.0, ans=0.125 2023-06-19 09:34:15,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=433494.0, ans=0.0 2023-06-19 09:34:16,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=433494.0, ans=0.125 2023-06-19 09:34:25,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=433494.0, ans=0.0 2023-06-19 09:35:00,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.899e+02 3.491e+02 4.112e+02 6.627e+02, threshold=6.983e+02, percent-clipped=0.0 2023-06-19 09:35:11,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433614.0, ans=0.1 2023-06-19 09:35:30,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=433674.0, ans=0.125 2023-06-19 09:35:41,853 INFO [train.py:996] (1/4) Epoch 3, batch 11300, loss[loss=0.254, simple_loss=0.3089, pruned_loss=0.09955, over 21817.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3288, pruned_loss=0.1053, over 4259127.80 frames. ], batch size: 247, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:36:07,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=433794.0, ans=0.125 2023-06-19 09:36:33,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433854.0, ans=0.1 2023-06-19 09:37:19,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=433974.0, ans=0.05 2023-06-19 09:37:28,369 INFO [train.py:996] (1/4) Epoch 3, batch 11350, loss[loss=0.3436, simple_loss=0.3985, pruned_loss=0.1444, over 21754.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3289, pruned_loss=0.1038, over 4266367.91 frames. ], batch size: 124, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:37:31,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-19 09:38:02,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-19 09:38:03,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=434094.0, ans=0.125 2023-06-19 09:38:28,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=434154.0, ans=0.125 2023-06-19 09:38:45,266 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.245e+02 4.001e+02 4.934e+02 9.082e+02, threshold=8.002e+02, percent-clipped=8.0 2023-06-19 09:38:56,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434274.0, ans=0.1 2023-06-19 09:38:58,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=434274.0, ans=0.125 2023-06-19 09:39:05,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-19 09:39:13,779 INFO [train.py:996] (1/4) Epoch 3, batch 11400, loss[loss=0.3707, simple_loss=0.4289, pruned_loss=0.1563, over 21417.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3351, pruned_loss=0.1068, over 4265375.54 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:39:51,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=434394.0, ans=0.2 2023-06-19 09:40:11,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=434454.0, ans=0.04949747468305833 2023-06-19 09:40:27,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=434514.0, ans=0.04949747468305833 2023-06-19 09:40:34,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=434514.0, ans=0.125 2023-06-19 09:41:06,626 INFO [train.py:996] (1/4) Epoch 3, batch 11450, loss[loss=0.2899, simple_loss=0.3723, pruned_loss=0.1038, over 21649.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3351, pruned_loss=0.1053, over 4261484.32 frames. ], batch size: 414, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:41:08,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=434634.0, ans=0.1 2023-06-19 09:41:37,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-19 09:41:43,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=434694.0, ans=0.125 2023-06-19 09:42:19,362 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 3.142e+02 3.588e+02 4.835e+02 7.937e+02, threshold=7.176e+02, percent-clipped=0.0 2023-06-19 09:42:30,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=434874.0, ans=0.125 2023-06-19 09:42:36,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-19 09:42:53,683 INFO [train.py:996] (1/4) Epoch 3, batch 11500, loss[loss=0.2431, simple_loss=0.3274, pruned_loss=0.07936, over 21748.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3396, pruned_loss=0.1072, over 4267185.63 frames. ], batch size: 247, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:42:54,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=434934.0, ans=0.125 2023-06-19 09:42:59,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=434934.0, ans=0.125 2023-06-19 09:43:46,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-06-19 09:44:06,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=435114.0, ans=0.125 2023-06-19 09:44:30,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-19 09:44:46,134 INFO [train.py:996] (1/4) Epoch 3, batch 11550, loss[loss=0.2704, simple_loss=0.3519, pruned_loss=0.09443, over 21644.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3448, pruned_loss=0.1067, over 4270061.90 frames. ], batch size: 263, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:45:08,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=435294.0, ans=0.125 2023-06-19 09:45:54,466 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.996e+02 3.693e+02 5.072e+02 8.592e+02, threshold=7.387e+02, percent-clipped=2.0 2023-06-19 09:46:09,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=435474.0, ans=0.2 2023-06-19 09:46:09,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=435474.0, ans=0.0 2023-06-19 09:46:25,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-19 09:46:35,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=435474.0, ans=0.0 2023-06-19 09:46:38,537 INFO [train.py:996] (1/4) Epoch 3, batch 11600, loss[loss=0.2925, simple_loss=0.3705, pruned_loss=0.1072, over 21306.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3608, pruned_loss=0.1083, over 4271545.81 frames. ], batch size: 159, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:46:42,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=435534.0, ans=0.125 2023-06-19 09:47:16,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=435654.0, ans=0.0 2023-06-19 09:48:12,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435774.0, ans=0.0 2023-06-19 09:48:14,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=435774.0, ans=0.0 2023-06-19 09:48:18,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.28 vs. limit=8.0 2023-06-19 09:48:24,144 INFO [train.py:996] (1/4) Epoch 3, batch 11650, loss[loss=0.3667, simple_loss=0.4183, pruned_loss=0.1576, over 21547.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3666, pruned_loss=0.1087, over 4272319.00 frames. ], batch size: 414, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:48:28,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.20 vs. limit=10.0 2023-06-19 09:49:01,129 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-19 09:49:24,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=436014.0, ans=0.0 2023-06-19 09:49:32,526 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.087e+02 3.546e+02 4.429e+02 8.703e+02, threshold=7.092e+02, percent-clipped=3.0 2023-06-19 09:50:10,362 INFO [train.py:996] (1/4) Epoch 3, batch 11700, loss[loss=0.2418, simple_loss=0.2999, pruned_loss=0.09182, over 21457.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3584, pruned_loss=0.1084, over 4263814.17 frames. ], batch size: 212, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:51:19,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=436314.0, ans=0.125 2023-06-19 09:51:45,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=436374.0, ans=0.025 2023-06-19 09:51:56,339 INFO [train.py:996] (1/4) Epoch 3, batch 11750, loss[loss=0.2622, simple_loss=0.3203, pruned_loss=0.102, over 21709.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3492, pruned_loss=0.1085, over 4256233.72 frames. ], batch size: 282, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:52:01,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=436434.0, ans=0.0 2023-06-19 09:52:26,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=436494.0, ans=0.05 2023-06-19 09:52:33,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=436554.0, ans=0.2 2023-06-19 09:53:03,204 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.275e+02 3.279e+02 3.653e+02 4.613e+02 8.659e+02, threshold=7.305e+02, percent-clipped=4.0 2023-06-19 09:53:35,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-06-19 09:53:41,259 INFO [train.py:996] (1/4) Epoch 3, batch 11800, loss[loss=0.327, simple_loss=0.3861, pruned_loss=0.1339, over 20688.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3505, pruned_loss=0.1114, over 4262924.32 frames. ], batch size: 607, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:55:06,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=436974.0, ans=0.5 2023-06-19 09:55:14,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=436974.0, ans=0.125 2023-06-19 09:55:32,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=436974.0, ans=0.125 2023-06-19 09:55:34,573 INFO [train.py:996] (1/4) Epoch 3, batch 11850, loss[loss=0.3092, simple_loss=0.3859, pruned_loss=0.1163, over 21686.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3527, pruned_loss=0.1105, over 4268642.65 frames. ], batch size: 389, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:56:26,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=437154.0, ans=0.125 2023-06-19 09:56:44,312 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.974e+02 3.438e+02 3.994e+02 6.906e+02, threshold=6.876e+02, percent-clipped=0.0 2023-06-19 09:57:01,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=437274.0, ans=0.125 2023-06-19 09:57:10,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=437274.0, ans=0.125 2023-06-19 09:57:10,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=437274.0, ans=0.2 2023-06-19 09:57:22,854 INFO [train.py:996] (1/4) Epoch 3, batch 11900, loss[loss=0.2426, simple_loss=0.3206, pruned_loss=0.08233, over 21707.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3534, pruned_loss=0.1082, over 4263176.50 frames. ], batch size: 247, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:57:51,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=437394.0, ans=0.2 2023-06-19 09:58:04,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.12 vs. limit=22.5 2023-06-19 09:59:11,120 INFO [train.py:996] (1/4) Epoch 3, batch 11950, loss[loss=0.2962, simple_loss=0.3942, pruned_loss=0.09904, over 21659.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3552, pruned_loss=0.1049, over 4268410.14 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 09:59:15,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-19 09:59:55,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-19 10:00:14,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-19 10:00:26,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=437814.0, ans=0.0 2023-06-19 10:00:29,697 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.768e+02 3.443e+02 4.738e+02 7.856e+02, threshold=6.886e+02, percent-clipped=6.0 2023-06-19 10:00:45,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=437874.0, ans=0.1 2023-06-19 10:00:54,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=437934.0, ans=0.125 2023-06-19 10:00:56,206 INFO [train.py:996] (1/4) Epoch 3, batch 12000, loss[loss=0.2317, simple_loss=0.2885, pruned_loss=0.08748, over 21715.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3495, pruned_loss=0.1028, over 4264829.20 frames. ], batch size: 124, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:00:56,207 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 10:01:15,354 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.279, simple_loss=0.3755, pruned_loss=0.09124, over 1796401.00 frames. 2023-06-19 10:01:15,355 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 10:01:43,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=437994.0, ans=0.125 2023-06-19 10:02:03,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=438054.0, ans=0.125 2023-06-19 10:02:17,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=438054.0, ans=0.125 2023-06-19 10:03:02,145 INFO [train.py:996] (1/4) Epoch 3, batch 12050, loss[loss=0.2385, simple_loss=0.3055, pruned_loss=0.08578, over 19957.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3462, pruned_loss=0.1054, over 4273357.96 frames. ], batch size: 702, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:03:11,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=438234.0, ans=0.0 2023-06-19 10:03:13,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-19 10:03:18,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-19 10:03:19,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=438294.0, ans=0.125 2023-06-19 10:03:35,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=438294.0, ans=0.0 2023-06-19 10:04:06,426 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:04:18,346 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.247e+02 3.603e+02 4.114e+02 7.694e+02, threshold=7.207e+02, percent-clipped=1.0 2023-06-19 10:04:19,241 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=22.5 2023-06-19 10:04:19,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.58 vs. limit=6.0 2023-06-19 10:04:24,023 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:04:49,645 INFO [train.py:996] (1/4) Epoch 3, batch 12100, loss[loss=0.3354, simple_loss=0.3884, pruned_loss=0.1412, over 21456.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3497, pruned_loss=0.11, over 4278339.26 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:04:53,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-19 10:04:57,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=438534.0, ans=0.1 2023-06-19 10:05:35,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-19 10:05:50,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=438654.0, ans=0.0 2023-06-19 10:06:28,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=438774.0, ans=0.0 2023-06-19 10:06:38,086 INFO [train.py:996] (1/4) Epoch 3, batch 12150, loss[loss=0.2785, simple_loss=0.3492, pruned_loss=0.1039, over 21401.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.355, pruned_loss=0.11, over 4273894.65 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:07:33,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=438954.0, ans=0.0 2023-06-19 10:07:55,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 3.586e+02 4.536e+02 5.596e+02 8.610e+02, threshold=9.073e+02, percent-clipped=5.0 2023-06-19 10:08:06,652 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-19 10:08:21,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=439074.0, ans=0.125 2023-06-19 10:08:33,677 INFO [train.py:996] (1/4) Epoch 3, batch 12200, loss[loss=0.2508, simple_loss=0.3061, pruned_loss=0.09772, over 21322.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3516, pruned_loss=0.1095, over 4272644.00 frames. ], batch size: 131, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:08:34,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=439134.0, ans=0.04949747468305833 2023-06-19 10:09:05,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=439254.0, ans=0.2 2023-06-19 10:09:29,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=439314.0, ans=0.0 2023-06-19 10:09:38,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.74 vs. limit=22.5 2023-06-19 10:09:50,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=439374.0, ans=0.0 2023-06-19 10:10:12,827 INFO [train.py:996] (1/4) Epoch 3, batch 12250, loss[loss=0.1866, simple_loss=0.2585, pruned_loss=0.05733, over 21549.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3427, pruned_loss=0.1053, over 4273436.61 frames. ], batch size: 195, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:10:43,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=439494.0, ans=0.2 2023-06-19 10:11:03,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=439554.0, ans=0.2 2023-06-19 10:11:22,032 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 2.671e+02 3.353e+02 4.398e+02 1.093e+03, threshold=6.707e+02, percent-clipped=1.0 2023-06-19 10:11:54,799 INFO [train.py:996] (1/4) Epoch 3, batch 12300, loss[loss=0.3231, simple_loss=0.4008, pruned_loss=0.1227, over 21540.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3337, pruned_loss=0.09795, over 4259948.16 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:12:35,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439794.0, ans=0.1 2023-06-19 10:12:36,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-19 10:12:59,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=439914.0, ans=0.1 2023-06-19 10:13:19,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=439974.0, ans=0.0 2023-06-19 10:13:39,832 INFO [train.py:996] (1/4) Epoch 3, batch 12350, loss[loss=0.2292, simple_loss=0.3106, pruned_loss=0.07392, over 21450.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3374, pruned_loss=0.0981, over 4264727.66 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:14:04,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.75 vs. limit=10.0 2023-06-19 10:14:26,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-19 10:14:53,521 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 2.837e+02 3.590e+02 4.995e+02 8.694e+02, threshold=7.180e+02, percent-clipped=5.0 2023-06-19 10:15:30,246 INFO [train.py:996] (1/4) Epoch 3, batch 12400, loss[loss=0.2929, simple_loss=0.3354, pruned_loss=0.1252, over 21223.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3387, pruned_loss=0.1022, over 4275799.10 frames. ], batch size: 143, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:16:39,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=440514.0, ans=0.125 2023-06-19 10:16:43,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=440514.0, ans=0.125 2023-06-19 10:17:17,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.52 vs. limit=15.0 2023-06-19 10:17:23,796 INFO [train.py:996] (1/4) Epoch 3, batch 12450, loss[loss=0.2992, simple_loss=0.3744, pruned_loss=0.1119, over 21405.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3421, pruned_loss=0.1059, over 4274818.83 frames. ], batch size: 131, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:17:47,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-19 10:17:57,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=440694.0, ans=0.0 2023-06-19 10:18:00,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=440694.0, ans=0.0 2023-06-19 10:18:08,638 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-19 10:18:17,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=440754.0, ans=0.125 2023-06-19 10:18:35,155 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.264e+02 3.825e+02 4.731e+02 7.932e+02, threshold=7.651e+02, percent-clipped=1.0 2023-06-19 10:19:07,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=440874.0, ans=0.125 2023-06-19 10:19:12,375 INFO [train.py:996] (1/4) Epoch 3, batch 12500, loss[loss=0.3637, simple_loss=0.4338, pruned_loss=0.1468, over 21708.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3537, pruned_loss=0.1098, over 4273045.56 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:19:14,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=440934.0, ans=0.0 2023-06-19 10:19:39,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=440994.0, ans=0.1 2023-06-19 10:19:48,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=440994.0, ans=0.0 2023-06-19 10:21:04,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=441234.0, ans=0.07 2023-06-19 10:21:05,737 INFO [train.py:996] (1/4) Epoch 3, batch 12550, loss[loss=0.2696, simple_loss=0.3417, pruned_loss=0.09871, over 21645.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3589, pruned_loss=0.1129, over 4277341.19 frames. ], batch size: 263, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:21:08,641 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.97 vs. limit=10.0 2023-06-19 10:22:22,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.217e+02 3.804e+02 4.730e+02 9.875e+02, threshold=7.608e+02, percent-clipped=0.0 2023-06-19 10:22:35,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=441474.0, ans=0.125 2023-06-19 10:22:52,396 INFO [train.py:996] (1/4) Epoch 3, batch 12600, loss[loss=0.2226, simple_loss=0.2942, pruned_loss=0.07555, over 21142.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3566, pruned_loss=0.1091, over 4266856.09 frames. ], batch size: 143, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:23:01,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=441534.0, ans=0.0 2023-06-19 10:23:21,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=441594.0, ans=0.0 2023-06-19 10:24:26,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=441774.0, ans=0.125 2023-06-19 10:24:29,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-19 10:24:32,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=441774.0, ans=0.0 2023-06-19 10:24:35,190 INFO [train.py:996] (1/4) Epoch 3, batch 12650, loss[loss=0.2652, simple_loss=0.3158, pruned_loss=0.1073, over 21489.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3479, pruned_loss=0.1044, over 4274738.04 frames. ], batch size: 194, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:24:48,590 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:25:30,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-19 10:25:34,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=441954.0, ans=0.0 2023-06-19 10:25:47,079 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.848e+02 3.430e+02 4.434e+02 6.952e+02, threshold=6.860e+02, percent-clipped=1.0 2023-06-19 10:25:47,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=442014.0, ans=0.0 2023-06-19 10:26:17,189 INFO [train.py:996] (1/4) Epoch 3, batch 12700, loss[loss=0.3586, simple_loss=0.4124, pruned_loss=0.1524, over 21791.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3503, pruned_loss=0.1087, over 4282836.60 frames. ], batch size: 124, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:26:40,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=442194.0, ans=0.1 2023-06-19 10:27:08,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=442254.0, ans=0.125 2023-06-19 10:27:25,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=442314.0, ans=0.0 2023-06-19 10:27:34,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-19 10:28:03,205 INFO [train.py:996] (1/4) Epoch 3, batch 12750, loss[loss=0.2463, simple_loss=0.3214, pruned_loss=0.08566, over 21201.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3515, pruned_loss=0.1095, over 4281270.40 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:28:17,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=442434.0, ans=0.0 2023-06-19 10:28:52,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=442554.0, ans=0.125 2023-06-19 10:28:59,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=442554.0, ans=0.07 2023-06-19 10:29:13,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=442614.0, ans=0.1 2023-06-19 10:29:16,184 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.440e+02 3.021e+02 3.668e+02 4.359e+02 6.708e+02, threshold=7.336e+02, percent-clipped=0.0 2023-06-19 10:29:46,091 INFO [train.py:996] (1/4) Epoch 3, batch 12800, loss[loss=0.3332, simple_loss=0.3818, pruned_loss=0.1423, over 21522.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3505, pruned_loss=0.1107, over 4290214.37 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:30:20,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=442794.0, ans=0.0 2023-06-19 10:30:22,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442794.0, ans=0.1 2023-06-19 10:30:38,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.20 vs. limit=6.0 2023-06-19 10:31:11,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-19 10:31:18,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=442974.0, ans=0.125 2023-06-19 10:31:21,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=442974.0, ans=0.125 2023-06-19 10:31:38,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=443034.0, ans=0.2 2023-06-19 10:31:40,098 INFO [train.py:996] (1/4) Epoch 3, batch 12850, loss[loss=0.2627, simple_loss=0.3477, pruned_loss=0.08879, over 21847.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3545, pruned_loss=0.1138, over 4293456.47 frames. ], batch size: 371, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:32:00,009 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-19 10:32:14,182 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:32:48,810 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.077e+02 3.595e+02 4.462e+02 6.932e+02, threshold=7.189e+02, percent-clipped=0.0 2023-06-19 10:33:23,078 INFO [train.py:996] (1/4) Epoch 3, batch 12900, loss[loss=0.2748, simple_loss=0.3571, pruned_loss=0.09628, over 21718.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.35, pruned_loss=0.1083, over 4285553.01 frames. ], batch size: 332, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:33:45,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=443394.0, ans=0.125 2023-06-19 10:34:42,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=443514.0, ans=0.1 2023-06-19 10:35:09,438 INFO [train.py:996] (1/4) Epoch 3, batch 12950, loss[loss=0.3436, simple_loss=0.3931, pruned_loss=0.1471, over 21433.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3489, pruned_loss=0.1059, over 4282037.88 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:35:11,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=443634.0, ans=0.2 2023-06-19 10:35:14,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=443634.0, ans=0.2 2023-06-19 10:35:40,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-06-19 10:36:11,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=443814.0, ans=0.125 2023-06-19 10:36:23,000 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.936e+02 3.406e+02 4.166e+02 8.200e+02, threshold=6.811e+02, percent-clipped=3.0 2023-06-19 10:36:46,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=443874.0, ans=0.125 2023-06-19 10:36:52,651 INFO [train.py:996] (1/4) Epoch 3, batch 13000, loss[loss=0.2213, simple_loss=0.3019, pruned_loss=0.07031, over 21834.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3509, pruned_loss=0.1062, over 4271108.47 frames. ], batch size: 317, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:37:01,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-19 10:37:23,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-19 10:38:29,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=444174.0, ans=0.04949747468305833 2023-06-19 10:38:35,606 INFO [train.py:996] (1/4) Epoch 3, batch 13050, loss[loss=0.3104, simple_loss=0.3606, pruned_loss=0.1301, over 21861.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3462, pruned_loss=0.1032, over 4271408.94 frames. ], batch size: 332, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:39:04,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=444294.0, ans=0.05 2023-06-19 10:39:13,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=444354.0, ans=0.125 2023-06-19 10:39:34,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=444414.0, ans=10.0 2023-06-19 10:39:49,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.662e+02 3.344e+02 3.864e+02 6.973e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-19 10:40:12,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=444474.0, ans=0.0 2023-06-19 10:40:20,621 INFO [train.py:996] (1/4) Epoch 3, batch 13100, loss[loss=0.3001, simple_loss=0.363, pruned_loss=0.1186, over 21818.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3489, pruned_loss=0.1045, over 4277120.40 frames. ], batch size: 118, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:40:37,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=444534.0, ans=10.0 2023-06-19 10:40:41,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=444594.0, ans=0.125 2023-06-19 10:40:49,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=444594.0, ans=0.1 2023-06-19 10:40:54,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=444594.0, ans=0.125 2023-06-19 10:41:45,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-19 10:42:09,719 INFO [train.py:996] (1/4) Epoch 3, batch 13150, loss[loss=0.2406, simple_loss=0.3013, pruned_loss=0.08996, over 21147.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3515, pruned_loss=0.109, over 4280266.52 frames. ], batch size: 159, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:43:13,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=444954.0, ans=0.125 2023-06-19 10:43:20,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=445014.0, ans=0.2 2023-06-19 10:43:24,851 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 3.199e+02 3.918e+02 5.152e+02 8.520e+02, threshold=7.837e+02, percent-clipped=11.0 2023-06-19 10:43:37,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=445074.0, ans=0.125 2023-06-19 10:43:55,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-06-19 10:43:55,530 INFO [train.py:996] (1/4) Epoch 3, batch 13200, loss[loss=0.3607, simple_loss=0.4019, pruned_loss=0.1598, over 21426.00 frames. ], tot_loss[loss=0.284, simple_loss=0.349, pruned_loss=0.1095, over 4265190.23 frames. ], batch size: 471, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:44:17,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=445194.0, ans=0.0 2023-06-19 10:45:17,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=445314.0, ans=0.0 2023-06-19 10:45:38,761 INFO [train.py:996] (1/4) Epoch 3, batch 13250, loss[loss=0.2989, simple_loss=0.3703, pruned_loss=0.1137, over 20070.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3488, pruned_loss=0.1101, over 4264281.18 frames. ], batch size: 702, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:46:59,978 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.966e+02 3.440e+02 4.161e+02 7.151e+02, threshold=6.880e+02, percent-clipped=0.0 2023-06-19 10:47:12,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=445674.0, ans=0.125 2023-06-19 10:47:16,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=445674.0, ans=0.125 2023-06-19 10:47:29,753 INFO [train.py:996] (1/4) Epoch 3, batch 13300, loss[loss=0.2947, simple_loss=0.3556, pruned_loss=0.1169, over 21443.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3507, pruned_loss=0.1087, over 4268474.74 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:47:56,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=445794.0, ans=0.125 2023-06-19 10:48:02,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=445794.0, ans=0.0 2023-06-19 10:48:18,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445854.0, ans=0.1 2023-06-19 10:48:44,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=445914.0, ans=0.2 2023-06-19 10:49:13,522 INFO [train.py:996] (1/4) Epoch 3, batch 13350, loss[loss=0.2406, simple_loss=0.3031, pruned_loss=0.08908, over 21257.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3558, pruned_loss=0.1117, over 4269956.75 frames. ], batch size: 608, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:49:14,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=446034.0, ans=0.07 2023-06-19 10:49:31,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=446034.0, ans=0.125 2023-06-19 10:49:33,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=446094.0, ans=0.125 2023-06-19 10:50:20,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 3.225e+02 3.868e+02 4.526e+02 7.710e+02, threshold=7.735e+02, percent-clipped=4.0 2023-06-19 10:50:55,967 INFO [train.py:996] (1/4) Epoch 3, batch 13400, loss[loss=0.3203, simple_loss=0.3723, pruned_loss=0.1342, over 21598.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3576, pruned_loss=0.1133, over 4267893.01 frames. ], batch size: 471, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:52:30,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-19 10:52:46,472 INFO [train.py:996] (1/4) Epoch 3, batch 13450, loss[loss=0.2929, simple_loss=0.3449, pruned_loss=0.1205, over 21820.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3579, pruned_loss=0.116, over 4271286.66 frames. ], batch size: 118, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:52:51,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=446634.0, ans=0.125 2023-06-19 10:52:55,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=446634.0, ans=0.0 2023-06-19 10:53:07,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=446694.0, ans=0.2 2023-06-19 10:53:43,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=446814.0, ans=0.1 2023-06-19 10:53:56,272 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.091e+02 3.617e+02 4.599e+02 7.916e+02, threshold=7.234e+02, percent-clipped=2.0 2023-06-19 10:54:10,589 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:54:11,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-19 10:54:31,171 INFO [train.py:996] (1/4) Epoch 3, batch 13500, loss[loss=0.2762, simple_loss=0.344, pruned_loss=0.1042, over 21735.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.344, pruned_loss=0.1104, over 4267877.93 frames. ], batch size: 391, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:55:26,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=447054.0, ans=0.0 2023-06-19 10:55:42,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=447114.0, ans=0.0 2023-06-19 10:56:14,911 INFO [train.py:996] (1/4) Epoch 3, batch 13550, loss[loss=0.3245, simple_loss=0.4231, pruned_loss=0.1129, over 21276.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3488, pruned_loss=0.11, over 4268716.18 frames. ], batch size: 548, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:56:25,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-19 10:56:53,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-06-19 10:57:36,371 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.954e+02 3.584e+02 4.667e+02 1.065e+03, threshold=7.167e+02, percent-clipped=1.0 2023-06-19 10:57:53,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=447474.0, ans=0.0 2023-06-19 10:58:03,818 INFO [train.py:996] (1/4) Epoch 3, batch 13600, loss[loss=0.3511, simple_loss=0.4151, pruned_loss=0.1436, over 20754.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3508, pruned_loss=0.1111, over 4260802.28 frames. ], batch size: 607, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:58:13,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=447534.0, ans=0.125 2023-06-19 10:59:16,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.95 vs. limit=15.0 2023-06-19 10:59:45,712 INFO [train.py:996] (1/4) Epoch 3, batch 13650, loss[loss=0.2446, simple_loss=0.3063, pruned_loss=0.09147, over 21753.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3462, pruned_loss=0.1073, over 4262739.19 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:00:01,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=447894.0, ans=0.125 2023-06-19 11:00:03,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=447894.0, ans=0.2 2023-06-19 11:00:59,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=448014.0, ans=0.05 2023-06-19 11:01:01,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-19 11:01:01,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 3.087e+02 4.228e+02 5.339e+02 1.090e+03, threshold=8.456e+02, percent-clipped=10.0 2023-06-19 11:01:28,409 INFO [train.py:996] (1/4) Epoch 3, batch 13700, loss[loss=0.2737, simple_loss=0.3261, pruned_loss=0.1106, over 21627.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.34, pruned_loss=0.1071, over 4263509.00 frames. ], batch size: 263, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:01:52,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=448194.0, ans=0.0 2023-06-19 11:03:01,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=448374.0, ans=0.015 2023-06-19 11:03:11,599 INFO [train.py:996] (1/4) Epoch 3, batch 13750, loss[loss=0.1924, simple_loss=0.2468, pruned_loss=0.06902, over 21795.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3364, pruned_loss=0.1046, over 4263110.83 frames. ], batch size: 118, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:03:15,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=448434.0, ans=0.125 2023-06-19 11:03:49,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=448494.0, ans=0.0 2023-06-19 11:04:34,592 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 3.276e+02 4.043e+02 5.146e+02 9.090e+02, threshold=8.085e+02, percent-clipped=5.0 2023-06-19 11:04:51,628 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:04:56,201 INFO [train.py:996] (1/4) Epoch 3, batch 13800, loss[loss=0.2865, simple_loss=0.373, pruned_loss=0.1, over 21609.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3443, pruned_loss=0.104, over 4268538.24 frames. ], batch size: 263, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:05:07,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=448734.0, ans=0.0 2023-06-19 11:05:35,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=448794.0, ans=0.2 2023-06-19 11:06:03,009 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-19 11:06:18,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=448914.0, ans=0.0 2023-06-19 11:06:39,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-19 11:06:50,091 INFO [train.py:996] (1/4) Epoch 3, batch 13850, loss[loss=0.2782, simple_loss=0.3524, pruned_loss=0.102, over 21802.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3504, pruned_loss=0.1052, over 4275019.89 frames. ], batch size: 282, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:06:58,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449034.0, ans=0.1 2023-06-19 11:07:20,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=449094.0, ans=0.125 2023-06-19 11:07:50,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=449214.0, ans=0.09899494936611666 2023-06-19 11:08:00,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=449214.0, ans=0.125 2023-06-19 11:08:01,640 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.934e+02 3.600e+02 4.326e+02 8.652e+02, threshold=7.199e+02, percent-clipped=1.0 2023-06-19 11:08:32,510 INFO [train.py:996] (1/4) Epoch 3, batch 13900, loss[loss=0.298, simple_loss=0.3597, pruned_loss=0.1181, over 21509.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3549, pruned_loss=0.1096, over 4281168.36 frames. ], batch size: 131, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:08:59,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.97 vs. limit=10.0 2023-06-19 11:09:15,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=449454.0, ans=0.2 2023-06-19 11:09:32,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=449514.0, ans=0.125 2023-06-19 11:09:37,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449514.0, ans=0.1 2023-06-19 11:10:15,203 INFO [train.py:996] (1/4) Epoch 3, batch 13950, loss[loss=0.2665, simple_loss=0.3398, pruned_loss=0.0966, over 21888.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3556, pruned_loss=0.1117, over 4283336.48 frames. ], batch size: 118, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:10:57,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=449754.0, ans=0.1 2023-06-19 11:11:28,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=449814.0, ans=0.05 2023-06-19 11:11:30,118 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 3.017e+02 3.528e+02 4.292e+02 7.550e+02, threshold=7.057e+02, percent-clipped=1.0 2023-06-19 11:11:55,746 INFO [train.py:996] (1/4) Epoch 3, batch 14000, loss[loss=0.1664, simple_loss=0.2279, pruned_loss=0.05245, over 16094.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3494, pruned_loss=0.108, over 4273269.00 frames. ], batch size: 61, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:12:11,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=449934.0, ans=0.125 2023-06-19 11:12:38,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=450054.0, ans=0.125 2023-06-19 11:12:39,163 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-19 11:12:45,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=450054.0, ans=0.2 2023-06-19 11:13:05,127 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-19 11:13:16,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=450174.0, ans=0.2 2023-06-19 11:13:30,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.55 vs. limit=10.0 2023-06-19 11:13:36,911 INFO [train.py:996] (1/4) Epoch 3, batch 14050, loss[loss=0.2726, simple_loss=0.3206, pruned_loss=0.1123, over 21281.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3441, pruned_loss=0.1034, over 4265486.05 frames. ], batch size: 471, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:14:10,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=450294.0, ans=0.0 2023-06-19 11:14:17,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-19 11:14:46,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=450414.0, ans=0.125 2023-06-19 11:14:54,752 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.755e+02 3.797e+02 5.310e+02 9.461e+02, threshold=7.595e+02, percent-clipped=8.0 2023-06-19 11:15:06,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-19 11:15:17,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-19 11:15:24,577 INFO [train.py:996] (1/4) Epoch 3, batch 14100, loss[loss=0.2858, simple_loss=0.3401, pruned_loss=0.1157, over 21864.00 frames. ], tot_loss[loss=0.274, simple_loss=0.341, pruned_loss=0.1035, over 4256980.54 frames. ], batch size: 372, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:15:53,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=450594.0, ans=0.0 2023-06-19 11:16:00,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-19 11:16:04,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=450654.0, ans=0.04949747468305833 2023-06-19 11:16:07,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=450654.0, ans=0.0 2023-06-19 11:16:30,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=450714.0, ans=0.2 2023-06-19 11:16:32,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=450714.0, ans=0.07 2023-06-19 11:16:48,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450774.0, ans=0.1 2023-06-19 11:16:53,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.97 vs. limit=22.5 2023-06-19 11:16:58,760 INFO [train.py:996] (1/4) Epoch 3, batch 14150, loss[loss=0.2809, simple_loss=0.3607, pruned_loss=0.1006, over 21817.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3438, pruned_loss=0.1054, over 4256126.16 frames. ], batch size: 316, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:18:14,079 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.701e+02 3.295e+02 4.437e+02 7.217e+02, threshold=6.589e+02, percent-clipped=0.0 2023-06-19 11:18:14,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451014.0, ans=0.1 2023-06-19 11:18:38,223 INFO [train.py:996] (1/4) Epoch 3, batch 14200, loss[loss=0.2578, simple_loss=0.312, pruned_loss=0.1018, over 21637.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3407, pruned_loss=0.1035, over 4258406.60 frames. ], batch size: 298, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:18:56,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=451194.0, ans=0.2 2023-06-19 11:19:06,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=451194.0, ans=0.0 2023-06-19 11:19:20,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=451254.0, ans=0.1 2023-06-19 11:19:23,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.15 vs. limit=22.5 2023-06-19 11:19:26,699 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-19 11:19:27,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=451254.0, ans=0.0 2023-06-19 11:19:42,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=451314.0, ans=0.125 2023-06-19 11:19:52,664 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-19 11:20:20,537 INFO [train.py:996] (1/4) Epoch 3, batch 14250, loss[loss=0.2719, simple_loss=0.3226, pruned_loss=0.1106, over 21847.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3356, pruned_loss=0.1032, over 4263275.37 frames. ], batch size: 98, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:20:21,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=451434.0, ans=0.0 2023-06-19 11:20:27,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=451434.0, ans=0.125 2023-06-19 11:21:17,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=451554.0, ans=0.2 2023-06-19 11:21:39,199 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.808e+02 3.716e+02 4.608e+02 1.130e+03, threshold=7.432e+02, percent-clipped=7.0 2023-06-19 11:21:41,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=451674.0, ans=0.125 2023-06-19 11:22:04,555 INFO [train.py:996] (1/4) Epoch 3, batch 14300, loss[loss=0.3161, simple_loss=0.3549, pruned_loss=0.1386, over 20273.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3399, pruned_loss=0.104, over 4266528.29 frames. ], batch size: 703, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:22:23,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=451794.0, ans=0.1 2023-06-19 11:22:59,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=451854.0, ans=0.1 2023-06-19 11:23:28,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=451974.0, ans=0.0 2023-06-19 11:23:46,798 INFO [train.py:996] (1/4) Epoch 3, batch 14350, loss[loss=0.2954, simple_loss=0.3582, pruned_loss=0.1163, over 21835.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3433, pruned_loss=0.1039, over 4260242.91 frames. ], batch size: 124, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:25:04,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 3.074e+02 3.738e+02 4.858e+02 1.364e+03, threshold=7.476e+02, percent-clipped=9.0 2023-06-19 11:25:12,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=452274.0, ans=0.0 2023-06-19 11:25:28,326 INFO [train.py:996] (1/4) Epoch 3, batch 14400, loss[loss=0.3453, simple_loss=0.3747, pruned_loss=0.1579, over 21836.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3411, pruned_loss=0.1043, over 4260125.42 frames. ], batch size: 441, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:25:38,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=452334.0, ans=0.0 2023-06-19 11:26:26,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=452514.0, ans=0.0 2023-06-19 11:27:09,079 INFO [train.py:996] (1/4) Epoch 3, batch 14450, loss[loss=0.3013, simple_loss=0.3245, pruned_loss=0.139, over 21414.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.335, pruned_loss=0.1052, over 4270023.38 frames. ], batch size: 509, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:27:56,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=452754.0, ans=0.125 2023-06-19 11:28:26,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.178e+02 3.835e+02 4.854e+02 8.477e+02, threshold=7.671e+02, percent-clipped=4.0 2023-06-19 11:28:51,117 INFO [train.py:996] (1/4) Epoch 3, batch 14500, loss[loss=0.2775, simple_loss=0.3485, pruned_loss=0.1032, over 21211.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3314, pruned_loss=0.1048, over 4272444.31 frames. ], batch size: 159, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:29:27,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=452994.0, ans=0.0 2023-06-19 11:29:59,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=453114.0, ans=0.025 2023-06-19 11:30:17,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=453174.0, ans=0.125 2023-06-19 11:30:34,329 INFO [train.py:996] (1/4) Epoch 3, batch 14550, loss[loss=0.2388, simple_loss=0.3129, pruned_loss=0.08228, over 21397.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3383, pruned_loss=0.1068, over 4270892.66 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:31:16,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-19 11:31:57,097 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.259e+02 4.086e+02 5.797e+02 9.548e+02, threshold=8.171e+02, percent-clipped=6.0 2023-06-19 11:31:57,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=453414.0, ans=0.0 2023-06-19 11:32:09,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=453474.0, ans=0.04949747468305833 2023-06-19 11:32:16,940 INFO [train.py:996] (1/4) Epoch 3, batch 14600, loss[loss=0.2508, simple_loss=0.3035, pruned_loss=0.09902, over 21231.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3462, pruned_loss=0.1113, over 4272821.95 frames. ], batch size: 608, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:32:53,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=453594.0, ans=0.125 2023-06-19 11:33:03,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=453654.0, ans=0.125 2023-06-19 11:33:58,765 INFO [train.py:996] (1/4) Epoch 3, batch 14650, loss[loss=0.2092, simple_loss=0.2827, pruned_loss=0.06786, over 21775.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3462, pruned_loss=0.1092, over 4276995.94 frames. ], batch size: 124, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:34:10,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453834.0, ans=0.1 2023-06-19 11:34:45,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=453954.0, ans=10.0 2023-06-19 11:34:59,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.81 vs. limit=15.0 2023-06-19 11:35:24,892 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 2.524e+02 3.223e+02 4.029e+02 6.900e+02, threshold=6.446e+02, percent-clipped=0.0 2023-06-19 11:35:25,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=454014.0, ans=0.0 2023-06-19 11:35:50,384 INFO [train.py:996] (1/4) Epoch 3, batch 14700, loss[loss=0.2227, simple_loss=0.2848, pruned_loss=0.08036, over 16221.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.339, pruned_loss=0.1022, over 4269942.49 frames. ], batch size: 60, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:36:10,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-19 11:37:38,428 INFO [train.py:996] (1/4) Epoch 3, batch 14750, loss[loss=0.3586, simple_loss=0.4228, pruned_loss=0.1472, over 21760.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3429, pruned_loss=0.1047, over 4264498.18 frames. ], batch size: 332, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:38:19,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=454494.0, ans=0.125 2023-06-19 11:38:32,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=454554.0, ans=0.0 2023-06-19 11:38:51,296 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 3.078e+02 3.855e+02 4.833e+02 8.936e+02, threshold=7.710e+02, percent-clipped=7.0 2023-06-19 11:38:52,266 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.94 vs. limit=22.5 2023-06-19 11:39:10,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=454674.0, ans=0.5 2023-06-19 11:39:10,912 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-19 11:39:20,979 INFO [train.py:996] (1/4) Epoch 3, batch 14800, loss[loss=0.3026, simple_loss=0.356, pruned_loss=0.1246, over 21653.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3587, pruned_loss=0.1136, over 4271357.19 frames. ], batch size: 282, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:40:01,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=454794.0, ans=0.125 2023-06-19 11:40:02,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=454854.0, ans=0.125 2023-06-19 11:40:07,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=454854.0, ans=0.2 2023-06-19 11:40:55,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=454974.0, ans=0.125 2023-06-19 11:41:04,762 INFO [train.py:996] (1/4) Epoch 3, batch 14850, loss[loss=0.3133, simple_loss=0.3717, pruned_loss=0.1275, over 21539.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3533, pruned_loss=0.1138, over 4265098.11 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:41:31,902 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:41:40,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=455094.0, ans=0.1 2023-06-19 11:42:00,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=455154.0, ans=0.1 2023-06-19 11:42:23,063 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.130e+02 3.900e+02 4.785e+02 9.691e+02, threshold=7.799e+02, percent-clipped=2.0 2023-06-19 11:42:28,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=455274.0, ans=0.04949747468305833 2023-06-19 11:42:53,917 INFO [train.py:996] (1/4) Epoch 3, batch 14900, loss[loss=0.3089, simple_loss=0.3645, pruned_loss=0.1266, over 21180.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3565, pruned_loss=0.1148, over 4265991.47 frames. ], batch size: 143, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:43:04,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=455334.0, ans=0.125 2023-06-19 11:43:09,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=455394.0, ans=0.07 2023-06-19 11:44:36,794 INFO [train.py:996] (1/4) Epoch 3, batch 14950, loss[loss=0.2791, simple_loss=0.3615, pruned_loss=0.09835, over 21234.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.3582, pruned_loss=0.1155, over 4262708.37 frames. ], batch size: 549, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:44:38,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=455634.0, ans=0.015 2023-06-19 11:45:27,317 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:45:55,109 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.136e+02 3.778e+02 4.659e+02 7.505e+02, threshold=7.556e+02, percent-clipped=0.0 2023-06-19 11:46:20,050 INFO [train.py:996] (1/4) Epoch 3, batch 15000, loss[loss=0.2572, simple_loss=0.3155, pruned_loss=0.09942, over 21765.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3605, pruned_loss=0.1176, over 4266485.05 frames. ], batch size: 247, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:46:20,051 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 11:46:36,888 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.2722, simple_loss=0.3734, pruned_loss=0.08553, over 1796401.00 frames. 2023-06-19 11:46:36,889 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 11:47:04,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=455994.0, ans=0.125 2023-06-19 11:47:16,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=455994.0, ans=0.125 2023-06-19 11:47:38,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=456054.0, ans=0.125 2023-06-19 11:47:38,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=456054.0, ans=0.0 2023-06-19 11:47:39,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=456054.0, ans=0.125 2023-06-19 11:48:26,272 INFO [train.py:996] (1/4) Epoch 3, batch 15050, loss[loss=0.2423, simple_loss=0.3234, pruned_loss=0.08065, over 21372.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3602, pruned_loss=0.1174, over 4265579.02 frames. ], batch size: 194, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:49:29,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=456414.0, ans=0.125 2023-06-19 11:49:43,737 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.179e+02 3.857e+02 4.850e+02 8.474e+02, threshold=7.714e+02, percent-clipped=3.0 2023-06-19 11:50:08,209 INFO [train.py:996] (1/4) Epoch 3, batch 15100, loss[loss=0.3339, simple_loss=0.4078, pruned_loss=0.13, over 19850.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3625, pruned_loss=0.1163, over 4267428.15 frames. ], batch size: 702, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:50:15,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=456534.0, ans=0.125 2023-06-19 11:50:47,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=456594.0, ans=0.1 2023-06-19 11:50:52,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=456654.0, ans=10.0 2023-06-19 11:51:56,386 INFO [train.py:996] (1/4) Epoch 3, batch 15150, loss[loss=0.2556, simple_loss=0.3114, pruned_loss=0.09989, over 21829.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.359, pruned_loss=0.1162, over 4260784.97 frames. ], batch size: 372, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:52:43,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=456954.0, ans=0.0 2023-06-19 11:52:58,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=457014.0, ans=0.0 2023-06-19 11:53:13,799 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.143e+02 3.638e+02 4.416e+02 6.792e+02, threshold=7.275e+02, percent-clipped=0.0 2023-06-19 11:53:18,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=457074.0, ans=0.125 2023-06-19 11:53:38,823 INFO [train.py:996] (1/4) Epoch 3, batch 15200, loss[loss=0.2498, simple_loss=0.3171, pruned_loss=0.09123, over 21357.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3482, pruned_loss=0.1114, over 4264762.17 frames. ], batch size: 194, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:54:09,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=457194.0, ans=0.125 2023-06-19 11:54:32,618 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-19 11:55:09,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-19 11:55:16,102 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:55:20,845 INFO [train.py:996] (1/4) Epoch 3, batch 15250, loss[loss=0.226, simple_loss=0.2951, pruned_loss=0.07849, over 21327.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3426, pruned_loss=0.1098, over 4274887.23 frames. ], batch size: 211, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:55:47,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=457494.0, ans=0.125 2023-06-19 11:55:47,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=457494.0, ans=0.125 2023-06-19 11:56:03,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=457554.0, ans=0.1 2023-06-19 11:56:09,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=457554.0, ans=0.0 2023-06-19 11:56:17,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457554.0, ans=0.1 2023-06-19 11:56:37,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-19 11:56:43,900 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.938e+02 3.532e+02 4.223e+02 6.837e+02, threshold=7.064e+02, percent-clipped=0.0 2023-06-19 11:57:08,060 INFO [train.py:996] (1/4) Epoch 3, batch 15300, loss[loss=0.2851, simple_loss=0.3406, pruned_loss=0.1148, over 21704.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3449, pruned_loss=0.1124, over 4270570.83 frames. ], batch size: 298, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:57:25,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-19 11:58:05,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=457914.0, ans=0.1 2023-06-19 11:58:27,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=457974.0, ans=0.2 2023-06-19 11:58:38,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=457974.0, ans=0.125 2023-06-19 11:58:44,340 INFO [train.py:996] (1/4) Epoch 3, batch 15350, loss[loss=0.2707, simple_loss=0.3636, pruned_loss=0.08891, over 21659.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3504, pruned_loss=0.1144, over 4275288.49 frames. ], batch size: 263, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:59:23,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=458094.0, ans=0.1 2023-06-19 11:59:54,545 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 3.004e+02 3.619e+02 4.702e+02 1.047e+03, threshold=7.238e+02, percent-clipped=6.0 2023-06-19 12:00:24,061 INFO [train.py:996] (1/4) Epoch 3, batch 15400, loss[loss=0.2551, simple_loss=0.3136, pruned_loss=0.09833, over 21225.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3496, pruned_loss=0.1118, over 4271812.27 frames. ], batch size: 143, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:00:56,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=458394.0, ans=0.2 2023-06-19 12:01:22,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=458514.0, ans=0.2 2023-06-19 12:01:35,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=458514.0, ans=0.0 2023-06-19 12:02:06,909 INFO [train.py:996] (1/4) Epoch 3, batch 15450, loss[loss=0.2844, simple_loss=0.3405, pruned_loss=0.1142, over 21896.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3459, pruned_loss=0.1103, over 4249390.39 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:02:08,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=458634.0, ans=0.125 2023-06-19 12:02:25,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=458634.0, ans=0.125 2023-06-19 12:03:23,948 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.889e+02 3.450e+02 3.978e+02 6.262e+02, threshold=6.899e+02, percent-clipped=0.0 2023-06-19 12:03:54,353 INFO [train.py:996] (1/4) Epoch 3, batch 15500, loss[loss=0.2956, simple_loss=0.3557, pruned_loss=0.1177, over 21758.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3481, pruned_loss=0.1099, over 4241023.26 frames. ], batch size: 247, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:04:03,342 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-19 12:05:04,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=459114.0, ans=0.0 2023-06-19 12:05:11,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-19 12:05:17,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=459174.0, ans=0.125 2023-06-19 12:05:26,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=459174.0, ans=0.0 2023-06-19 12:05:30,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=459174.0, ans=0.125 2023-06-19 12:05:36,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=459234.0, ans=0.0 2023-06-19 12:05:37,218 INFO [train.py:996] (1/4) Epoch 3, batch 15550, loss[loss=0.2404, simple_loss=0.2925, pruned_loss=0.09413, over 21845.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.347, pruned_loss=0.1082, over 4247437.98 frames. ], batch size: 107, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:05:59,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=459294.0, ans=0.125 2023-06-19 12:06:17,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-19 12:06:39,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=22.5 2023-06-19 12:06:40,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=459414.0, ans=0.125 2023-06-19 12:06:54,190 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.997e+02 3.470e+02 4.241e+02 8.422e+02, threshold=6.941e+02, percent-clipped=1.0 2023-06-19 12:07:02,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=459474.0, ans=0.125 2023-06-19 12:07:18,535 INFO [train.py:996] (1/4) Epoch 3, batch 15600, loss[loss=0.2782, simple_loss=0.3577, pruned_loss=0.09932, over 21668.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3434, pruned_loss=0.1073, over 4248755.65 frames. ], batch size: 332, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:07:37,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-19 12:08:10,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-19 12:08:43,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=459774.0, ans=0.125 2023-06-19 12:09:06,277 INFO [train.py:996] (1/4) Epoch 3, batch 15650, loss[loss=0.2695, simple_loss=0.3376, pruned_loss=0.1007, over 21256.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.343, pruned_loss=0.1067, over 4245369.26 frames. ], batch size: 176, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:09:10,004 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:09:27,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=459894.0, ans=0.2 2023-06-19 12:10:22,753 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.856e+02 3.445e+02 4.569e+02 7.529e+02, threshold=6.891e+02, percent-clipped=2.0 2023-06-19 12:10:28,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=460074.0, ans=0.125 2023-06-19 12:10:47,550 INFO [train.py:996] (1/4) Epoch 3, batch 15700, loss[loss=0.2756, simple_loss=0.351, pruned_loss=0.1001, over 21535.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3375, pruned_loss=0.1059, over 4249181.00 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:10:52,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=460134.0, ans=0.125 2023-06-19 12:12:28,102 INFO [train.py:996] (1/4) Epoch 3, batch 15750, loss[loss=0.2897, simple_loss=0.3318, pruned_loss=0.1238, over 22002.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3337, pruned_loss=0.1057, over 4243702.46 frames. ], batch size: 103, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:12:30,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-19 12:12:39,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=460434.0, ans=0.04949747468305833 2023-06-19 12:13:46,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.862e+02 3.455e+02 4.105e+02 6.683e+02, threshold=6.910e+02, percent-clipped=1.0 2023-06-19 12:14:09,465 INFO [train.py:996] (1/4) Epoch 3, batch 15800, loss[loss=0.2898, simple_loss=0.3402, pruned_loss=0.1196, over 21454.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3284, pruned_loss=0.1048, over 4253535.68 frames. ], batch size: 131, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:14:38,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.12 vs. limit=15.0 2023-06-19 12:15:03,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=460854.0, ans=0.1 2023-06-19 12:15:31,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-19 12:15:35,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=460974.0, ans=0.125 2023-06-19 12:15:49,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=460974.0, ans=0.0 2023-06-19 12:15:52,337 INFO [train.py:996] (1/4) Epoch 3, batch 15850, loss[loss=0.2302, simple_loss=0.2966, pruned_loss=0.08183, over 21352.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3314, pruned_loss=0.1068, over 4262028.70 frames. ], batch size: 131, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:15:52,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=461034.0, ans=0.125 2023-06-19 12:17:03,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.41 vs. limit=10.0 2023-06-19 12:17:10,938 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 2.908e+02 3.639e+02 4.203e+02 7.869e+02, threshold=7.277e+02, percent-clipped=1.0 2023-06-19 12:17:26,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=461274.0, ans=0.125 2023-06-19 12:17:35,035 INFO [train.py:996] (1/4) Epoch 3, batch 15900, loss[loss=0.2781, simple_loss=0.3429, pruned_loss=0.1066, over 21820.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3278, pruned_loss=0.106, over 4262534.43 frames. ], batch size: 118, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:17:53,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=461394.0, ans=0.125 2023-06-19 12:17:56,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=461394.0, ans=0.1 2023-06-19 12:18:06,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=461454.0, ans=0.0 2023-06-19 12:18:06,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=461454.0, ans=0.2 2023-06-19 12:18:53,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461514.0, ans=0.1 2023-06-19 12:19:17,492 INFO [train.py:996] (1/4) Epoch 3, batch 15950, loss[loss=0.2295, simple_loss=0.311, pruned_loss=0.07395, over 21785.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3277, pruned_loss=0.1026, over 4257034.26 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:19:34,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=461694.0, ans=0.0 2023-06-19 12:20:36,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.793e+02 3.209e+02 4.219e+02 1.070e+03, threshold=6.418e+02, percent-clipped=5.0 2023-06-19 12:20:50,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-19 12:20:53,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=461874.0, ans=0.1 2023-06-19 12:20:57,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=461874.0, ans=0.0 2023-06-19 12:20:59,879 INFO [train.py:996] (1/4) Epoch 3, batch 16000, loss[loss=0.2051, simple_loss=0.2843, pruned_loss=0.06296, over 21889.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3295, pruned_loss=0.1011, over 4245016.41 frames. ], batch size: 124, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:21:11,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=461934.0, ans=0.125 2023-06-19 12:21:28,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-19 12:21:31,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=462054.0, ans=0.125 2023-06-19 12:21:37,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=462054.0, ans=0.125 2023-06-19 12:22:36,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=462174.0, ans=0.0 2023-06-19 12:22:36,094 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:22:36,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-19 12:22:42,079 INFO [train.py:996] (1/4) Epoch 3, batch 16050, loss[loss=0.2959, simple_loss=0.3751, pruned_loss=0.1083, over 21448.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3332, pruned_loss=0.09952, over 4245434.62 frames. ], batch size: 194, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:22:45,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=462234.0, ans=0.2 2023-06-19 12:23:11,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-19 12:23:32,701 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-19 12:23:39,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=462414.0, ans=0.125 2023-06-19 12:23:42,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-19 12:24:00,102 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 3.084e+02 3.532e+02 4.519e+02 7.240e+02, threshold=7.063e+02, percent-clipped=4.0 2023-06-19 12:24:23,177 INFO [train.py:996] (1/4) Epoch 3, batch 16100, loss[loss=0.3188, simple_loss=0.3635, pruned_loss=0.1371, over 21775.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3383, pruned_loss=0.1017, over 4258596.96 frames. ], batch size: 441, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:24:51,676 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:25:56,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=462834.0, ans=0.04949747468305833 2023-06-19 12:25:57,543 INFO [train.py:996] (1/4) Epoch 3, batch 16150, loss[loss=0.2549, simple_loss=0.3381, pruned_loss=0.08587, over 21627.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3387, pruned_loss=0.1041, over 4270955.87 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:26:12,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=462894.0, ans=0.125 2023-06-19 12:26:20,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462894.0, ans=0.1 2023-06-19 12:26:25,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=462894.0, ans=0.125 2023-06-19 12:26:41,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=462954.0, ans=0.04949747468305833 2023-06-19 12:27:00,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=463014.0, ans=0.0 2023-06-19 12:27:16,606 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.948e+02 3.427e+02 4.312e+02 9.423e+02, threshold=6.854e+02, percent-clipped=2.0 2023-06-19 12:27:18,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=463074.0, ans=0.09899494936611666 2023-06-19 12:27:39,977 INFO [train.py:996] (1/4) Epoch 3, batch 16200, loss[loss=0.373, simple_loss=0.4203, pruned_loss=0.1629, over 21503.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3434, pruned_loss=0.1064, over 4273480.44 frames. ], batch size: 471, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:27:41,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=463134.0, ans=0.2 2023-06-19 12:28:20,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-19 12:29:00,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=463374.0, ans=0.125 2023-06-19 12:29:19,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=463374.0, ans=0.0 2023-06-19 12:29:21,907 INFO [train.py:996] (1/4) Epoch 3, batch 16250, loss[loss=0.2299, simple_loss=0.2904, pruned_loss=0.08474, over 21812.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3434, pruned_loss=0.107, over 4277542.68 frames. ], batch size: 118, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:30:45,989 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.795e+02 3.241e+02 4.405e+02 7.562e+02, threshold=6.482e+02, percent-clipped=2.0 2023-06-19 12:30:49,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=463674.0, ans=0.0 2023-06-19 12:31:03,296 INFO [train.py:996] (1/4) Epoch 3, batch 16300, loss[loss=0.3064, simple_loss=0.3797, pruned_loss=0.1166, over 21421.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3369, pruned_loss=0.1025, over 4277048.67 frames. ], batch size: 507, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:31:25,094 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-19 12:32:27,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-19 12:32:37,018 INFO [train.py:996] (1/4) Epoch 3, batch 16350, loss[loss=0.2816, simple_loss=0.3469, pruned_loss=0.1081, over 20723.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.338, pruned_loss=0.1041, over 4277091.84 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:32:49,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-19 12:33:10,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=464094.0, ans=0.1 2023-06-19 12:34:02,463 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.064e+02 3.648e+02 5.135e+02 1.076e+03, threshold=7.296e+02, percent-clipped=9.0 2023-06-19 12:34:17,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=464334.0, ans=0.125 2023-06-19 12:34:18,650 INFO [train.py:996] (1/4) Epoch 3, batch 16400, loss[loss=0.2597, simple_loss=0.32, pruned_loss=0.09976, over 21945.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3416, pruned_loss=0.1054, over 4279157.75 frames. ], batch size: 316, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:34:31,477 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.93 vs. limit=15.0 2023-06-19 12:34:47,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=464394.0, ans=0.125 2023-06-19 12:35:55,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-19 12:36:00,605 INFO [train.py:996] (1/4) Epoch 3, batch 16450, loss[loss=0.3054, simple_loss=0.3557, pruned_loss=0.1276, over 21865.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3409, pruned_loss=0.1062, over 4282684.89 frames. ], batch size: 298, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:36:01,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=464634.0, ans=0.125 2023-06-19 12:36:07,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=464634.0, ans=0.125 2023-06-19 12:36:17,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=464694.0, ans=0.04949747468305833 2023-06-19 12:36:54,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-19 12:37:21,749 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.011e+02 3.468e+02 3.986e+02 7.351e+02, threshold=6.935e+02, percent-clipped=1.0 2023-06-19 12:37:22,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=464874.0, ans=0.125 2023-06-19 12:37:38,556 INFO [train.py:996] (1/4) Epoch 3, batch 16500, loss[loss=0.1836, simple_loss=0.233, pruned_loss=0.06711, over 21247.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3391, pruned_loss=0.106, over 4279771.70 frames. ], batch size: 143, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:38:02,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=464994.0, ans=0.0 2023-06-19 12:38:32,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=465054.0, ans=0.035 2023-06-19 12:38:46,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=465114.0, ans=0.1 2023-06-19 12:39:00,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=465174.0, ans=0.125 2023-06-19 12:39:15,960 INFO [train.py:996] (1/4) Epoch 3, batch 16550, loss[loss=0.3003, simple_loss=0.3751, pruned_loss=0.1128, over 21254.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3358, pruned_loss=0.1025, over 4283549.59 frames. ], batch size: 548, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:39:41,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-19 12:40:22,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=465354.0, ans=0.125 2023-06-19 12:40:25,066 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.63 vs. limit=22.5 2023-06-19 12:40:42,236 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.124e+02 3.787e+02 4.304e+02 9.133e+02, threshold=7.574e+02, percent-clipped=3.0 2023-06-19 12:41:09,186 INFO [train.py:996] (1/4) Epoch 3, batch 16600, loss[loss=0.4752, simple_loss=0.5504, pruned_loss=0.2, over 19705.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3479, pruned_loss=0.1074, over 4281207.06 frames. ], batch size: 702, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:41:09,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=465534.0, ans=0.125 2023-06-19 12:41:13,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=465534.0, ans=0.125 2023-06-19 12:41:57,877 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:42:58,986 INFO [train.py:996] (1/4) Epoch 3, batch 16650, loss[loss=0.2808, simple_loss=0.35, pruned_loss=0.1058, over 21800.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3587, pruned_loss=0.1113, over 4275999.08 frames. ], batch size: 282, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:43:11,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=465834.0, ans=0.0 2023-06-19 12:43:21,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=465894.0, ans=0.125 2023-06-19 12:43:33,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=465894.0, ans=0.125 2023-06-19 12:43:58,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=466014.0, ans=0.0 2023-06-19 12:44:12,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=466014.0, ans=0.2 2023-06-19 12:44:12,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=466014.0, ans=0.125 2023-06-19 12:44:27,301 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.247e+02 3.781e+02 4.657e+02 6.369e+02, threshold=7.563e+02, percent-clipped=0.0 2023-06-19 12:44:49,105 INFO [train.py:996] (1/4) Epoch 3, batch 16700, loss[loss=0.2876, simple_loss=0.3768, pruned_loss=0.09918, over 20728.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3601, pruned_loss=0.1119, over 4269228.47 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:45:21,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=466194.0, ans=0.0 2023-06-19 12:45:59,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=466314.0, ans=0.5 2023-06-19 12:46:07,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=466314.0, ans=0.125 2023-06-19 12:46:33,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=466434.0, ans=0.1 2023-06-19 12:46:33,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=466434.0, ans=0.2 2023-06-19 12:46:35,198 INFO [train.py:996] (1/4) Epoch 3, batch 16750, loss[loss=0.273, simple_loss=0.3579, pruned_loss=0.09411, over 19855.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3616, pruned_loss=0.1136, over 4260495.09 frames. ], batch size: 702, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:46:40,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=466434.0, ans=0.0 2023-06-19 12:47:10,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=466494.0, ans=0.0 2023-06-19 12:47:16,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=466494.0, ans=0.0 2023-06-19 12:48:00,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=466674.0, ans=0.0 2023-06-19 12:48:01,815 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.898e+02 3.370e+02 4.211e+02 9.702e+02, threshold=6.740e+02, percent-clipped=1.0 2023-06-19 12:48:22,820 INFO [train.py:996] (1/4) Epoch 3, batch 16800, loss[loss=0.3743, simple_loss=0.4158, pruned_loss=0.1664, over 21636.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3651, pruned_loss=0.1141, over 4259433.47 frames. ], batch size: 471, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:49:14,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=466854.0, ans=0.0 2023-06-19 12:50:04,938 INFO [train.py:996] (1/4) Epoch 3, batch 16850, loss[loss=0.236, simple_loss=0.2995, pruned_loss=0.08629, over 21154.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3613, pruned_loss=0.1141, over 4264701.04 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:50:26,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=467094.0, ans=0.1 2023-06-19 12:50:29,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=467094.0, ans=0.125 2023-06-19 12:50:33,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=467094.0, ans=0.07 2023-06-19 12:50:34,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=467094.0, ans=0.0 2023-06-19 12:50:39,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=467094.0, ans=0.125 2023-06-19 12:50:53,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=467154.0, ans=0.125 2023-06-19 12:50:59,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=467154.0, ans=0.0 2023-06-19 12:51:25,336 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.898e+02 3.370e+02 4.330e+02 9.168e+02, threshold=6.739e+02, percent-clipped=5.0 2023-06-19 12:51:45,951 INFO [train.py:996] (1/4) Epoch 3, batch 16900, loss[loss=0.2182, simple_loss=0.2913, pruned_loss=0.07258, over 21677.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3545, pruned_loss=0.1114, over 4271046.34 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:51:46,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=467334.0, ans=0.2 2023-06-19 12:51:58,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=467334.0, ans=0.0 2023-06-19 12:52:35,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=467454.0, ans=0.125 2023-06-19 12:53:26,662 INFO [train.py:996] (1/4) Epoch 3, batch 16950, loss[loss=0.2746, simple_loss=0.3309, pruned_loss=0.1092, over 21920.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3464, pruned_loss=0.109, over 4266913.50 frames. ], batch size: 351, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:53:47,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=467694.0, ans=0.1 2023-06-19 12:54:30,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=467814.0, ans=0.0 2023-06-19 12:54:46,688 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.834e+02 3.306e+02 3.951e+02 5.809e+02, threshold=6.612e+02, percent-clipped=0.0 2023-06-19 12:54:48,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=467874.0, ans=0.0 2023-06-19 12:55:08,152 INFO [train.py:996] (1/4) Epoch 3, batch 17000, loss[loss=0.2633, simple_loss=0.3152, pruned_loss=0.1057, over 21790.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3436, pruned_loss=0.1095, over 4266912.37 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:55:20,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=467934.0, ans=0.125 2023-06-19 12:55:38,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-19 12:55:55,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=468054.0, ans=0.125 2023-06-19 12:56:16,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=468114.0, ans=0.125 2023-06-19 12:56:25,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=468114.0, ans=0.125 2023-06-19 12:56:29,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=12.0 2023-06-19 12:56:49,418 INFO [train.py:996] (1/4) Epoch 3, batch 17050, loss[loss=0.2621, simple_loss=0.3351, pruned_loss=0.09456, over 21685.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3522, pruned_loss=0.1132, over 4267266.55 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:57:29,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=468294.0, ans=0.125 2023-06-19 12:57:34,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=468294.0, ans=0.125 2023-06-19 12:57:44,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=468354.0, ans=0.2 2023-06-19 12:58:14,519 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.418e+02 3.024e+02 3.438e+02 4.032e+02 7.555e+02, threshold=6.877e+02, percent-clipped=1.0 2023-06-19 12:58:30,301 INFO [train.py:996] (1/4) Epoch 3, batch 17100, loss[loss=0.2631, simple_loss=0.3181, pruned_loss=0.104, over 21365.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3517, pruned_loss=0.1147, over 4271878.62 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:58:52,389 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=22.5 2023-06-19 12:59:01,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=468594.0, ans=0.0 2023-06-19 12:59:21,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=468654.0, ans=0.125 2023-06-19 12:59:31,876 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=12.0 2023-06-19 13:00:11,699 INFO [train.py:996] (1/4) Epoch 3, batch 17150, loss[loss=0.2476, simple_loss=0.315, pruned_loss=0.09011, over 21844.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3466, pruned_loss=0.1128, over 4272882.60 frames. ], batch size: 118, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:00:25,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=468834.0, ans=0.0 2023-06-19 13:01:21,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=469014.0, ans=0.1 2023-06-19 13:01:33,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.33 vs. limit=10.0 2023-06-19 13:01:38,233 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 2.874e+02 3.285e+02 3.849e+02 6.375e+02, threshold=6.570e+02, percent-clipped=0.0 2023-06-19 13:02:09,758 INFO [train.py:996] (1/4) Epoch 3, batch 17200, loss[loss=0.3307, simple_loss=0.3839, pruned_loss=0.1388, over 21375.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3459, pruned_loss=0.1114, over 4276145.90 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:02:46,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469254.0, ans=0.125 2023-06-19 13:02:48,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=469254.0, ans=0.125 2023-06-19 13:02:51,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469254.0, ans=0.125 2023-06-19 13:03:53,527 INFO [train.py:996] (1/4) Epoch 3, batch 17250, loss[loss=0.2991, simple_loss=0.3666, pruned_loss=0.1159, over 21483.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3504, pruned_loss=0.1144, over 4280327.66 frames. ], batch size: 211, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:04:11,798 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:04:18,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=469494.0, ans=0.0 2023-06-19 13:04:26,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=469554.0, ans=0.2 2023-06-19 13:04:46,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=469614.0, ans=0.125 2023-06-19 13:05:13,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=469614.0, ans=0.2 2023-06-19 13:05:17,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=469614.0, ans=0.0 2023-06-19 13:05:20,318 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.292e+02 3.993e+02 5.117e+02 9.442e+02, threshold=7.987e+02, percent-clipped=7.0 2023-06-19 13:05:27,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=469674.0, ans=0.125 2023-06-19 13:05:37,144 INFO [train.py:996] (1/4) Epoch 3, batch 17300, loss[loss=0.2908, simple_loss=0.3442, pruned_loss=0.1187, over 21808.00 frames. ], tot_loss[loss=0.2976, simple_loss=0.3593, pruned_loss=0.118, over 4275651.12 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:06:11,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=469794.0, ans=0.0 2023-06-19 13:06:46,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=469914.0, ans=0.125 2023-06-19 13:07:15,532 INFO [train.py:996] (1/4) Epoch 3, batch 17350, loss[loss=0.2225, simple_loss=0.3041, pruned_loss=0.07045, over 21606.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3584, pruned_loss=0.1165, over 4269657.60 frames. ], batch size: 230, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:07:41,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=470094.0, ans=0.125 2023-06-19 13:07:42,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=470094.0, ans=0.0 2023-06-19 13:07:48,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-19 13:08:37,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=470214.0, ans=0.0 2023-06-19 13:08:42,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.889e+02 3.414e+02 4.320e+02 8.908e+02, threshold=6.829e+02, percent-clipped=3.0 2023-06-19 13:08:58,904 INFO [train.py:996] (1/4) Epoch 3, batch 17400, loss[loss=0.2987, simple_loss=0.365, pruned_loss=0.1162, over 21751.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3558, pruned_loss=0.1125, over 4271915.07 frames. ], batch size: 332, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:09:54,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=470454.0, ans=0.125 2023-06-19 13:10:05,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=470454.0, ans=0.125 2023-06-19 13:10:38,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=470574.0, ans=0.0 2023-06-19 13:10:47,896 INFO [train.py:996] (1/4) Epoch 3, batch 17450, loss[loss=0.2328, simple_loss=0.2973, pruned_loss=0.08411, over 21198.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.352, pruned_loss=0.1097, over 4272050.85 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:11:35,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=470754.0, ans=0.2 2023-06-19 13:12:06,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.874e+02 3.534e+02 4.725e+02 8.315e+02, threshold=7.067e+02, percent-clipped=5.0 2023-06-19 13:12:27,737 INFO [train.py:996] (1/4) Epoch 3, batch 17500, loss[loss=0.2817, simple_loss=0.3434, pruned_loss=0.11, over 21696.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3445, pruned_loss=0.1051, over 4272815.51 frames. ], batch size: 389, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:13:05,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=470994.0, ans=0.125 2023-06-19 13:13:36,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=471114.0, ans=0.125 2023-06-19 13:14:07,330 INFO [train.py:996] (1/4) Epoch 3, batch 17550, loss[loss=0.2484, simple_loss=0.3348, pruned_loss=0.08094, over 21740.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3442, pruned_loss=0.1037, over 4271416.48 frames. ], batch size: 112, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:14:12,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=471234.0, ans=0.2 2023-06-19 13:14:39,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=471294.0, ans=0.125 2023-06-19 13:14:52,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=471354.0, ans=0.125 2023-06-19 13:15:26,287 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.757e+02 3.626e+02 4.370e+02 8.420e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-19 13:15:46,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.85 vs. limit=22.5 2023-06-19 13:15:48,149 INFO [train.py:996] (1/4) Epoch 3, batch 17600, loss[loss=0.2521, simple_loss=0.3351, pruned_loss=0.08454, over 16248.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3456, pruned_loss=0.103, over 4261938.16 frames. ], batch size: 61, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:16:11,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=471534.0, ans=0.125 2023-06-19 13:16:12,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=471534.0, ans=0.125 2023-06-19 13:16:24,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=471594.0, ans=10.0 2023-06-19 13:17:04,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=471714.0, ans=0.1 2023-06-19 13:17:35,466 INFO [train.py:996] (1/4) Epoch 3, batch 17650, loss[loss=0.2905, simple_loss=0.36, pruned_loss=0.1106, over 21186.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3442, pruned_loss=0.1036, over 4258201.98 frames. ], batch size: 143, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:18:07,422 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:18:10,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=471894.0, ans=0.0 2023-06-19 13:18:56,001 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.888e+02 3.326e+02 4.505e+02 7.697e+02, threshold=6.651e+02, percent-clipped=2.0 2023-06-19 13:19:09,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=472074.0, ans=0.2 2023-06-19 13:19:17,600 INFO [train.py:996] (1/4) Epoch 3, batch 17700, loss[loss=0.3207, simple_loss=0.3866, pruned_loss=0.1274, over 21764.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3371, pruned_loss=0.1, over 4262734.84 frames. ], batch size: 124, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:19:51,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472194.0, ans=0.1 2023-06-19 13:19:52,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-19 13:20:49,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=472374.0, ans=0.0 2023-06-19 13:20:59,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472374.0, ans=0.1 2023-06-19 13:21:10,361 INFO [train.py:996] (1/4) Epoch 3, batch 17750, loss[loss=0.3601, simple_loss=0.4067, pruned_loss=0.1567, over 21447.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.347, pruned_loss=0.1062, over 4262126.02 frames. ], batch size: 471, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:21:25,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=472494.0, ans=0.2 2023-06-19 13:21:54,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=472554.0, ans=0.125 2023-06-19 13:21:59,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=472554.0, ans=0.0 2023-06-19 13:22:32,931 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.975e+02 3.463e+02 4.383e+02 8.374e+02, threshold=6.927e+02, percent-clipped=5.0 2023-06-19 13:22:35,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-19 13:22:54,290 INFO [train.py:996] (1/4) Epoch 3, batch 17800, loss[loss=0.2227, simple_loss=0.303, pruned_loss=0.07123, over 21590.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3463, pruned_loss=0.1051, over 4266495.40 frames. ], batch size: 263, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:23:05,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-19 13:23:25,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=472794.0, ans=0.125 2023-06-19 13:23:29,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-19 13:24:00,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=472914.0, ans=0.125 2023-06-19 13:24:20,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.72 vs. limit=5.0 2023-06-19 13:24:37,162 INFO [train.py:996] (1/4) Epoch 3, batch 17850, loss[loss=0.416, simple_loss=0.4446, pruned_loss=0.1937, over 21348.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3471, pruned_loss=0.1059, over 4263494.01 frames. ], batch size: 507, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:24:44,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-19 13:24:45,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=473034.0, ans=0.0 2023-06-19 13:25:22,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=473154.0, ans=0.125 2023-06-19 13:25:24,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=473154.0, ans=0.0 2023-06-19 13:26:02,586 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.042e+02 3.981e+02 5.013e+02 8.666e+02, threshold=7.962e+02, percent-clipped=5.0 2023-06-19 13:26:05,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=473274.0, ans=0.2 2023-06-19 13:26:05,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=473274.0, ans=0.125 2023-06-19 13:26:10,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=473274.0, ans=0.125 2023-06-19 13:26:12,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=473274.0, ans=0.125 2023-06-19 13:26:18,583 INFO [train.py:996] (1/4) Epoch 3, batch 17900, loss[loss=0.2635, simple_loss=0.3525, pruned_loss=0.08727, over 21557.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3538, pruned_loss=0.1088, over 4259185.18 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:26:18,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=473334.0, ans=0.125 2023-06-19 13:26:57,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=473394.0, ans=0.05 2023-06-19 13:28:02,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=473574.0, ans=0.0 2023-06-19 13:28:06,415 INFO [train.py:996] (1/4) Epoch 3, batch 17950, loss[loss=0.2167, simple_loss=0.2956, pruned_loss=0.06892, over 21304.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3526, pruned_loss=0.1056, over 4259293.37 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:28:10,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-19 13:28:22,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=473634.0, ans=0.2 2023-06-19 13:28:35,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-19 13:28:59,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=473754.0, ans=0.0 2023-06-19 13:29:18,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=473814.0, ans=0.125 2023-06-19 13:29:26,493 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.695e+02 3.486e+02 4.539e+02 1.017e+03, threshold=6.972e+02, percent-clipped=4.0 2023-06-19 13:29:43,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=473874.0, ans=0.07 2023-06-19 13:29:43,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=473874.0, ans=0.0 2023-06-19 13:29:47,379 INFO [train.py:996] (1/4) Epoch 3, batch 18000, loss[loss=0.28, simple_loss=0.3173, pruned_loss=0.1213, over 17336.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3454, pruned_loss=0.1033, over 4247877.76 frames. ], batch size: 74, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:29:47,379 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 13:30:08,397 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.2748, simple_loss=0.3795, pruned_loss=0.08502, over 1796401.00 frames. 2023-06-19 13:30:08,398 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 13:30:10,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473934.0, ans=0.1 2023-06-19 13:30:31,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-19 13:31:49,929 INFO [train.py:996] (1/4) Epoch 3, batch 18050, loss[loss=0.241, simple_loss=0.3015, pruned_loss=0.0903, over 21484.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3392, pruned_loss=0.1027, over 4252425.16 frames. ], batch size: 195, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:32:07,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=474234.0, ans=0.125 2023-06-19 13:32:28,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=474294.0, ans=0.0 2023-06-19 13:33:10,551 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.274e+02 3.707e+02 4.391e+02 9.006e+02, threshold=7.414e+02, percent-clipped=2.0 2023-06-19 13:33:32,296 INFO [train.py:996] (1/4) Epoch 3, batch 18100, loss[loss=0.2931, simple_loss=0.3492, pruned_loss=0.1185, over 21149.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3442, pruned_loss=0.1058, over 4262349.20 frames. ], batch size: 143, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:34:07,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=474594.0, ans=0.125 2023-06-19 13:34:13,840 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:34:50,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-19 13:35:08,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-19 13:35:09,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474774.0, ans=0.1 2023-06-19 13:35:18,752 INFO [train.py:996] (1/4) Epoch 3, batch 18150, loss[loss=0.2601, simple_loss=0.3175, pruned_loss=0.1013, over 21261.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3458, pruned_loss=0.105, over 4271638.05 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:35:20,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=474834.0, ans=0.125 2023-06-19 13:35:55,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474954.0, ans=0.1 2023-06-19 13:35:57,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.68 vs. limit=10.0 2023-06-19 13:36:05,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=474954.0, ans=0.125 2023-06-19 13:36:31,884 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.411e+02 3.099e+02 3.760e+02 4.824e+02 9.400e+02, threshold=7.520e+02, percent-clipped=8.0 2023-06-19 13:36:52,781 INFO [train.py:996] (1/4) Epoch 3, batch 18200, loss[loss=0.2739, simple_loss=0.3214, pruned_loss=0.1132, over 21859.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3403, pruned_loss=0.1062, over 4279826.81 frames. ], batch size: 107, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:37:38,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=475254.0, ans=0.125 2023-06-19 13:38:31,996 INFO [train.py:996] (1/4) Epoch 3, batch 18250, loss[loss=0.3045, simple_loss=0.3514, pruned_loss=0.1288, over 21778.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.33, pruned_loss=0.1011, over 4269262.05 frames. ], batch size: 441, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:38:35,870 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-19 13:39:43,181 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:39:45,972 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.479e+02 3.020e+02 3.989e+02 8.042e+02, threshold=6.040e+02, percent-clipped=2.0 2023-06-19 13:39:47,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-19 13:40:02,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=475674.0, ans=0.125 2023-06-19 13:40:06,820 INFO [train.py:996] (1/4) Epoch 3, batch 18300, loss[loss=0.3228, simple_loss=0.4126, pruned_loss=0.1165, over 21837.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3304, pruned_loss=0.1014, over 4276493.68 frames. ], batch size: 371, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:40:07,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=475734.0, ans=0.125 2023-06-19 13:40:08,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=475734.0, ans=0.125 2023-06-19 13:41:46,932 INFO [train.py:996] (1/4) Epoch 3, batch 18350, loss[loss=0.2377, simple_loss=0.3089, pruned_loss=0.0833, over 21574.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3388, pruned_loss=0.1026, over 4279436.74 frames. ], batch size: 263, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:42:06,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=476094.0, ans=0.125 2023-06-19 13:42:59,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-19 13:43:00,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=476214.0, ans=0.125 2023-06-19 13:43:08,845 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.933e+02 3.430e+02 4.228e+02 7.523e+02, threshold=6.860e+02, percent-clipped=6.0 2023-06-19 13:43:22,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=476274.0, ans=0.125 2023-06-19 13:43:28,030 INFO [train.py:996] (1/4) Epoch 3, batch 18400, loss[loss=0.2073, simple_loss=0.2953, pruned_loss=0.05965, over 21668.00 frames. ], tot_loss[loss=0.266, simple_loss=0.332, pruned_loss=0.1, over 4267481.02 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:43:46,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=476334.0, ans=0.0 2023-06-19 13:43:49,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=476394.0, ans=0.035 2023-06-19 13:44:30,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=476514.0, ans=0.2 2023-06-19 13:44:49,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=476574.0, ans=0.125 2023-06-19 13:45:05,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=476574.0, ans=0.125 2023-06-19 13:45:08,765 INFO [train.py:996] (1/4) Epoch 3, batch 18450, loss[loss=0.2076, simple_loss=0.2815, pruned_loss=0.06686, over 21209.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3281, pruned_loss=0.09599, over 4258094.73 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:46:02,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=476754.0, ans=0.0 2023-06-19 13:46:16,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-19 13:46:31,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.812e+02 3.346e+02 4.382e+02 1.092e+03, threshold=6.692e+02, percent-clipped=3.0 2023-06-19 13:46:49,830 INFO [train.py:996] (1/4) Epoch 3, batch 18500, loss[loss=0.3156, simple_loss=0.4443, pruned_loss=0.09348, over 19764.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.326, pruned_loss=0.09544, over 4252366.65 frames. ], batch size: 702, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:47:08,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=476934.0, ans=15.0 2023-06-19 13:47:18,870 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-19 13:47:28,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=476994.0, ans=0.2 2023-06-19 13:47:46,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=477054.0, ans=0.0 2023-06-19 13:48:15,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=477174.0, ans=0.0 2023-06-19 13:48:20,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=477174.0, ans=10.0 2023-06-19 13:48:30,908 INFO [train.py:996] (1/4) Epoch 3, batch 18550, loss[loss=0.2523, simple_loss=0.3116, pruned_loss=0.09648, over 21401.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3259, pruned_loss=0.09585, over 4242724.31 frames. ], batch size: 132, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:48:33,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=477234.0, ans=15.0 2023-06-19 13:48:54,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=477294.0, ans=0.125 2023-06-19 13:48:56,401 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-19 13:49:09,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=477294.0, ans=0.125 2023-06-19 13:49:31,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-19 13:49:57,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.41 vs. limit=15.0 2023-06-19 13:49:59,807 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.099e+02 3.527e+02 4.215e+02 7.049e+02, threshold=7.053e+02, percent-clipped=1.0 2023-06-19 13:50:06,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=477474.0, ans=0.0 2023-06-19 13:50:11,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=477534.0, ans=0.125 2023-06-19 13:50:13,142 INFO [train.py:996] (1/4) Epoch 3, batch 18600, loss[loss=0.2648, simple_loss=0.3359, pruned_loss=0.09687, over 21663.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3228, pruned_loss=0.09662, over 4224968.93 frames. ], batch size: 391, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:50:13,710 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:51:19,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=477714.0, ans=0.07 2023-06-19 13:51:59,731 INFO [train.py:996] (1/4) Epoch 3, batch 18650, loss[loss=0.2547, simple_loss=0.314, pruned_loss=0.09771, over 21632.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3213, pruned_loss=0.09626, over 4240948.38 frames. ], batch size: 415, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:52:41,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=477954.0, ans=0.0 2023-06-19 13:53:20,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=478074.0, ans=0.125 2023-06-19 13:53:21,443 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.016e+02 3.610e+02 4.241e+02 7.263e+02, threshold=7.220e+02, percent-clipped=2.0 2023-06-19 13:53:33,746 INFO [train.py:996] (1/4) Epoch 3, batch 18700, loss[loss=0.2674, simple_loss=0.3181, pruned_loss=0.1084, over 21367.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3193, pruned_loss=0.09773, over 4247192.85 frames. ], batch size: 144, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:54:37,375 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-19 13:54:51,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=478314.0, ans=0.125 2023-06-19 13:55:15,239 INFO [train.py:996] (1/4) Epoch 3, batch 18750, loss[loss=0.3967, simple_loss=0.4488, pruned_loss=0.1723, over 21498.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3228, pruned_loss=0.1013, over 4259686.04 frames. ], batch size: 471, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:55:32,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=478434.0, ans=0.0 2023-06-19 13:56:10,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-19 13:56:43,450 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.016e+02 3.473e+02 4.351e+02 6.634e+02, threshold=6.946e+02, percent-clipped=0.0 2023-06-19 13:56:56,572 INFO [train.py:996] (1/4) Epoch 3, batch 18800, loss[loss=0.1677, simple_loss=0.2348, pruned_loss=0.05025, over 16778.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3273, pruned_loss=0.102, over 4261357.68 frames. ], batch size: 62, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:57:17,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-19 13:57:38,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478794.0, ans=0.1 2023-06-19 13:57:59,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=478914.0, ans=0.1 2023-06-19 13:58:00,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=478914.0, ans=0.1 2023-06-19 13:58:39,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=478974.0, ans=0.125 2023-06-19 13:58:44,186 INFO [train.py:996] (1/4) Epoch 3, batch 18850, loss[loss=0.2323, simple_loss=0.3059, pruned_loss=0.07938, over 21354.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3214, pruned_loss=0.09571, over 4251446.97 frames. ], batch size: 211, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:58:57,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=479034.0, ans=0.07 2023-06-19 13:59:10,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=479094.0, ans=0.0 2023-06-19 13:59:17,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2023-06-19 13:59:56,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=479214.0, ans=0.125 2023-06-19 14:00:08,732 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.738e+02 3.222e+02 4.135e+02 8.390e+02, threshold=6.445e+02, percent-clipped=2.0 2023-06-19 14:00:24,830 INFO [train.py:996] (1/4) Epoch 3, batch 18900, loss[loss=0.2817, simple_loss=0.3501, pruned_loss=0.1067, over 20023.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3195, pruned_loss=0.09601, over 4245250.51 frames. ], batch size: 702, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:00:32,162 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.90 vs. limit=10.0 2023-06-19 14:00:33,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=479334.0, ans=0.0 2023-06-19 14:02:03,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=479574.0, ans=0.025 2023-06-19 14:02:07,433 INFO [train.py:996] (1/4) Epoch 3, batch 18950, loss[loss=0.2887, simple_loss=0.3312, pruned_loss=0.1231, over 21342.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3221, pruned_loss=0.09954, over 4259759.69 frames. ], batch size: 144, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:02:12,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=479634.0, ans=0.0 2023-06-19 14:02:38,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.97 vs. limit=10.0 2023-06-19 14:02:44,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=479694.0, ans=0.125 2023-06-19 14:03:15,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=479814.0, ans=0.0 2023-06-19 14:03:40,002 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.897e+02 3.488e+02 4.402e+02 6.601e+02, threshold=6.976e+02, percent-clipped=2.0 2023-06-19 14:03:42,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=479874.0, ans=0.125 2023-06-19 14:03:56,916 INFO [train.py:996] (1/4) Epoch 3, batch 19000, loss[loss=0.2944, simple_loss=0.3458, pruned_loss=0.1214, over 21181.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3314, pruned_loss=0.1016, over 4266583.59 frames. ], batch size: 143, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:04:34,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-19 14:05:08,414 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:05:39,659 INFO [train.py:996] (1/4) Epoch 3, batch 19050, loss[loss=0.2777, simple_loss=0.3233, pruned_loss=0.116, over 21180.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3374, pruned_loss=0.1065, over 4274317.57 frames. ], batch size: 608, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:06:12,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=480294.0, ans=0.125 2023-06-19 14:07:04,465 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.242e+02 3.668e+02 4.263e+02 6.635e+02, threshold=7.336e+02, percent-clipped=0.0 2023-06-19 14:07:07,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=480474.0, ans=0.125 2023-06-19 14:07:21,775 INFO [train.py:996] (1/4) Epoch 3, batch 19100, loss[loss=0.2828, simple_loss=0.3251, pruned_loss=0.1202, over 21265.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3362, pruned_loss=0.108, over 4277796.47 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:07:47,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=480594.0, ans=0.0 2023-06-19 14:07:54,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.47 vs. limit=10.0 2023-06-19 14:09:11,234 INFO [train.py:996] (1/4) Epoch 3, batch 19150, loss[loss=0.2859, simple_loss=0.3748, pruned_loss=0.09851, over 21686.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3398, pruned_loss=0.1094, over 4261806.97 frames. ], batch size: 298, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:09:25,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=480834.0, ans=0.035 2023-06-19 14:09:25,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=480834.0, ans=0.125 2023-06-19 14:09:40,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=480894.0, ans=0.0 2023-06-19 14:09:44,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=480894.0, ans=0.125 2023-06-19 14:10:20,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=481014.0, ans=0.125 2023-06-19 14:10:43,711 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 3.009e+02 3.597e+02 4.510e+02 7.028e+02, threshold=7.194e+02, percent-clipped=0.0 2023-06-19 14:10:47,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481074.0, ans=0.1 2023-06-19 14:10:55,125 INFO [train.py:996] (1/4) Epoch 3, batch 19200, loss[loss=0.2812, simple_loss=0.3714, pruned_loss=0.09549, over 21734.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.35, pruned_loss=0.1095, over 4267972.06 frames. ], batch size: 332, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:10:59,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-19 14:11:38,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=481254.0, ans=0.2 2023-06-19 14:12:16,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=481374.0, ans=0.0 2023-06-19 14:12:35,834 INFO [train.py:996] (1/4) Epoch 3, batch 19250, loss[loss=0.2282, simple_loss=0.3178, pruned_loss=0.06933, over 21731.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3489, pruned_loss=0.1035, over 4271067.19 frames. ], batch size: 414, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:12:55,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=481494.0, ans=0.2 2023-06-19 14:13:24,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=481554.0, ans=0.125 2023-06-19 14:13:54,152 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 2.436e+02 3.016e+02 3.592e+02 9.679e+02, threshold=6.032e+02, percent-clipped=2.0 2023-06-19 14:14:10,909 INFO [train.py:996] (1/4) Epoch 3, batch 19300, loss[loss=0.2471, simple_loss=0.3147, pruned_loss=0.08977, over 21779.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3446, pruned_loss=0.1028, over 4282786.62 frames. ], batch size: 298, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:14:14,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=481734.0, ans=0.125 2023-06-19 14:14:26,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481734.0, ans=0.1 2023-06-19 14:14:31,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=481794.0, ans=0.0 2023-06-19 14:15:21,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=481914.0, ans=0.125 2023-06-19 14:15:54,288 INFO [train.py:996] (1/4) Epoch 3, batch 19350, loss[loss=0.2017, simple_loss=0.2772, pruned_loss=0.06314, over 21477.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3393, pruned_loss=0.09847, over 4274850.18 frames. ], batch size: 212, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:16:59,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=482214.0, ans=0.125 2023-06-19 14:16:59,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=482214.0, ans=0.2 2023-06-19 14:17:13,361 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.752e+02 3.460e+02 4.444e+02 7.574e+02, threshold=6.920e+02, percent-clipped=6.0 2023-06-19 14:17:13,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482274.0, ans=0.1 2023-06-19 14:17:17,196 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:17:24,654 INFO [train.py:996] (1/4) Epoch 3, batch 19400, loss[loss=0.2467, simple_loss=0.308, pruned_loss=0.09266, over 21797.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3356, pruned_loss=0.09666, over 4276048.41 frames. ], batch size: 247, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:17:36,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=482334.0, ans=0.0 2023-06-19 14:17:54,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=482394.0, ans=0.2 2023-06-19 14:19:05,665 INFO [train.py:996] (1/4) Epoch 3, batch 19450, loss[loss=0.2579, simple_loss=0.3058, pruned_loss=0.1049, over 21505.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3347, pruned_loss=0.1002, over 4287559.05 frames. ], batch size: 195, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:19:23,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=482634.0, ans=0.125 2023-06-19 14:19:27,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=482694.0, ans=0.125 2023-06-19 14:20:15,919 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:20:23,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=482814.0, ans=0.125 2023-06-19 14:20:37,738 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.381e+02 3.021e+02 3.528e+02 4.324e+02 6.786e+02, threshold=7.055e+02, percent-clipped=0.0 2023-06-19 14:20:52,465 INFO [train.py:996] (1/4) Epoch 3, batch 19500, loss[loss=0.2736, simple_loss=0.3305, pruned_loss=0.1084, over 21631.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3304, pruned_loss=0.1018, over 4284784.77 frames. ], batch size: 298, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:21:02,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482934.0, ans=0.1 2023-06-19 14:21:11,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=482994.0, ans=0.125 2023-06-19 14:21:12,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=482994.0, ans=0.0 2023-06-19 14:21:22,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=482994.0, ans=0.125 2023-06-19 14:21:36,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-19 14:21:59,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=483114.0, ans=0.0 2023-06-19 14:22:33,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=483234.0, ans=0.125 2023-06-19 14:22:34,931 INFO [train.py:996] (1/4) Epoch 3, batch 19550, loss[loss=0.2546, simple_loss=0.3458, pruned_loss=0.08172, over 21655.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3251, pruned_loss=0.0992, over 4280143.42 frames. ], batch size: 389, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:23:01,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=483294.0, ans=0.07 2023-06-19 14:23:04,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=483294.0, ans=0.0 2023-06-19 14:23:15,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.34 vs. limit=22.5 2023-06-19 14:23:28,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=483354.0, ans=0.5 2023-06-19 14:23:37,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=483414.0, ans=0.125 2023-06-19 14:24:06,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.887e+02 3.715e+02 4.750e+02 9.269e+02, threshold=7.430e+02, percent-clipped=4.0 2023-06-19 14:24:16,395 INFO [train.py:996] (1/4) Epoch 3, batch 19600, loss[loss=0.2623, simple_loss=0.314, pruned_loss=0.1053, over 21509.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3267, pruned_loss=0.0995, over 4281638.36 frames. ], batch size: 212, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:24:28,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=483534.0, ans=0.125 2023-06-19 14:25:00,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=483654.0, ans=0.125 2023-06-19 14:25:02,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-19 14:25:31,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=483714.0, ans=0.0 2023-06-19 14:25:58,826 INFO [train.py:996] (1/4) Epoch 3, batch 19650, loss[loss=0.3699, simple_loss=0.4053, pruned_loss=0.1673, over 21606.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3343, pruned_loss=0.1056, over 4273291.51 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:26:01,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=483834.0, ans=0.0 2023-06-19 14:26:18,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-19 14:26:29,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=483894.0, ans=0.125 2023-06-19 14:27:34,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 2.992e+02 3.430e+02 3.953e+02 7.302e+02, threshold=6.859e+02, percent-clipped=0.0 2023-06-19 14:27:44,511 INFO [train.py:996] (1/4) Epoch 3, batch 19700, loss[loss=0.3128, simple_loss=0.3916, pruned_loss=0.1171, over 21576.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3367, pruned_loss=0.1054, over 4267921.91 frames. ], batch size: 473, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:27:48,625 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-19 14:28:16,060 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:28:29,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-06-19 14:29:33,043 INFO [train.py:996] (1/4) Epoch 3, batch 19750, loss[loss=0.3038, simple_loss=0.3761, pruned_loss=0.1158, over 21502.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3439, pruned_loss=0.1058, over 4268657.63 frames. ], batch size: 548, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:30:04,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=15.0 2023-06-19 14:30:07,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-19 14:30:08,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=484494.0, ans=0.1 2023-06-19 14:30:38,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=484614.0, ans=0.125 2023-06-19 14:31:05,863 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.206e+02 3.832e+02 4.660e+02 9.927e+02, threshold=7.664e+02, percent-clipped=2.0 2023-06-19 14:31:15,136 INFO [train.py:996] (1/4) Epoch 3, batch 19800, loss[loss=0.2693, simple_loss=0.3454, pruned_loss=0.09665, over 21786.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3463, pruned_loss=0.1073, over 4269396.92 frames. ], batch size: 415, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:31:30,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=484734.0, ans=0.125 2023-06-19 14:32:11,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=484854.0, ans=0.1 2023-06-19 14:32:33,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=484914.0, ans=0.125 2023-06-19 14:33:03,124 INFO [train.py:996] (1/4) Epoch 3, batch 19850, loss[loss=0.1401, simple_loss=0.1817, pruned_loss=0.04928, over 17428.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3365, pruned_loss=0.1011, over 4262148.62 frames. ], batch size: 61, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:33:59,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=485154.0, ans=0.2 2023-06-19 14:34:06,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=485214.0, ans=0.0 2023-06-19 14:34:10,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=485214.0, ans=0.125 2023-06-19 14:34:28,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=485274.0, ans=0.0 2023-06-19 14:34:29,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.663e+02 3.192e+02 3.932e+02 5.931e+02, threshold=6.384e+02, percent-clipped=0.0 2023-06-19 14:34:45,233 INFO [train.py:996] (1/4) Epoch 3, batch 19900, loss[loss=0.2289, simple_loss=0.2932, pruned_loss=0.08231, over 21985.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3384, pruned_loss=0.09892, over 4265937.96 frames. ], batch size: 103, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:35:06,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=485394.0, ans=0.125 2023-06-19 14:35:17,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=485394.0, ans=0.125 2023-06-19 14:35:41,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=485454.0, ans=0.2 2023-06-19 14:35:42,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=485454.0, ans=6.0 2023-06-19 14:35:46,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=485514.0, ans=0.125 2023-06-19 14:36:19,366 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.66 vs. limit=12.0 2023-06-19 14:36:25,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=485574.0, ans=0.125 2023-06-19 14:36:33,272 INFO [train.py:996] (1/4) Epoch 3, batch 19950, loss[loss=0.2522, simple_loss=0.2994, pruned_loss=0.1025, over 14932.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3319, pruned_loss=0.09882, over 4265273.49 frames. ], batch size: 61, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:37:18,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=485754.0, ans=0.07 2023-06-19 14:37:20,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.51 vs. limit=10.0 2023-06-19 14:37:39,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485814.0, ans=0.1 2023-06-19 14:37:44,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=485814.0, ans=0.0 2023-06-19 14:37:45,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-19 14:37:59,666 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.893e+02 3.575e+02 4.384e+02 6.859e+02, threshold=7.149e+02, percent-clipped=1.0 2023-06-19 14:38:00,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=485874.0, ans=0.5 2023-06-19 14:38:14,238 INFO [train.py:996] (1/4) Epoch 3, batch 20000, loss[loss=0.3113, simple_loss=0.3706, pruned_loss=0.126, over 21839.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3332, pruned_loss=0.09951, over 4273315.66 frames. ], batch size: 371, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:38:55,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=486054.0, ans=0.125 2023-06-19 14:39:17,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=486114.0, ans=0.0 2023-06-19 14:39:54,942 INFO [train.py:996] (1/4) Epoch 3, batch 20050, loss[loss=0.2834, simple_loss=0.3433, pruned_loss=0.1118, over 21866.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3356, pruned_loss=0.1023, over 4277806.97 frames. ], batch size: 371, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:40:43,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=486354.0, ans=0.1 2023-06-19 14:41:28,455 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.834e+02 3.316e+02 3.890e+02 7.458e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-19 14:41:38,305 INFO [train.py:996] (1/4) Epoch 3, batch 20100, loss[loss=0.2353, simple_loss=0.2717, pruned_loss=0.09946, over 17057.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3391, pruned_loss=0.1064, over 4278358.76 frames. ], batch size: 61, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:41:38,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=486534.0, ans=0.0 2023-06-19 14:42:01,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=486594.0, ans=0.2 2023-06-19 14:42:03,311 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:42:30,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-19 14:42:35,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=486654.0, ans=0.125 2023-06-19 14:43:03,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=486714.0, ans=0.2 2023-06-19 14:43:12,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-19 14:43:27,778 INFO [train.py:996] (1/4) Epoch 3, batch 20150, loss[loss=0.3406, simple_loss=0.4057, pruned_loss=0.1378, over 21856.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3509, pruned_loss=0.1105, over 4277587.15 frames. ], batch size: 124, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:43:48,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=486894.0, ans=0.125 2023-06-19 14:44:06,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-19 14:45:02,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=487074.0, ans=0.125 2023-06-19 14:45:05,235 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 3.281e+02 3.907e+02 5.074e+02 8.084e+02, threshold=7.814e+02, percent-clipped=7.0 2023-06-19 14:45:11,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-19 14:45:13,852 INFO [train.py:996] (1/4) Epoch 3, batch 20200, loss[loss=0.1735, simple_loss=0.2027, pruned_loss=0.07216, over 16185.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3581, pruned_loss=0.1153, over 4276978.25 frames. ], batch size: 60, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:45:46,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-19 14:46:12,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=487254.0, ans=0.125 2023-06-19 14:46:31,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=487314.0, ans=0.125 2023-06-19 14:46:36,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=487314.0, ans=0.125 2023-06-19 14:46:37,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-19 14:46:41,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=487374.0, ans=0.2 2023-06-19 14:47:01,478 INFO [train.py:996] (1/4) Epoch 3, batch 20250, loss[loss=0.2497, simple_loss=0.3279, pruned_loss=0.08571, over 21739.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3581, pruned_loss=0.1126, over 4266881.46 frames. ], batch size: 247, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:47:22,239 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.21 vs. limit=6.0 2023-06-19 14:48:03,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=487554.0, ans=0.2 2023-06-19 14:48:09,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=487614.0, ans=0.125 2023-06-19 14:48:29,759 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.865e+02 3.489e+02 4.461e+02 6.612e+02, threshold=6.978e+02, percent-clipped=0.0 2023-06-19 14:48:37,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=487674.0, ans=0.125 2023-06-19 14:48:43,465 INFO [train.py:996] (1/4) Epoch 3, batch 20300, loss[loss=0.2378, simple_loss=0.3095, pruned_loss=0.08302, over 21224.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3538, pruned_loss=0.1086, over 4264779.24 frames. ], batch size: 159, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:48:50,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487734.0, ans=0.1 2023-06-19 14:49:03,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=487794.0, ans=0.2 2023-06-19 14:49:29,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=487854.0, ans=0.1 2023-06-19 14:50:08,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=487974.0, ans=0.04949747468305833 2023-06-19 14:50:23,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=488034.0, ans=0.2 2023-06-19 14:50:24,295 INFO [train.py:996] (1/4) Epoch 3, batch 20350, loss[loss=0.2811, simple_loss=0.3435, pruned_loss=0.1093, over 21249.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3528, pruned_loss=0.1086, over 4265169.95 frames. ], batch size: 143, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:50:24,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=488034.0, ans=0.125 2023-06-19 14:50:29,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-19 14:51:13,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=488154.0, ans=0.125 2023-06-19 14:51:26,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=488214.0, ans=0.2 2023-06-19 14:51:45,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=488274.0, ans=0.125 2023-06-19 14:51:51,829 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.942e+02 3.646e+02 4.954e+02 9.108e+02, threshold=7.293e+02, percent-clipped=8.0 2023-06-19 14:52:05,465 INFO [train.py:996] (1/4) Epoch 3, batch 20400, loss[loss=0.2289, simple_loss=0.298, pruned_loss=0.07992, over 17285.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3557, pruned_loss=0.1118, over 4264302.37 frames. ], batch size: 64, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:52:14,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.34 vs. limit=10.0 2023-06-19 14:52:24,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=488394.0, ans=0.0 2023-06-19 14:52:48,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=22.5 2023-06-19 14:53:28,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=488574.0, ans=0.125 2023-06-19 14:53:41,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=488634.0, ans=0.0 2023-06-19 14:53:42,576 INFO [train.py:996] (1/4) Epoch 3, batch 20450, loss[loss=0.2217, simple_loss=0.2614, pruned_loss=0.09094, over 20204.00 frames. ], tot_loss[loss=0.293, simple_loss=0.357, pruned_loss=0.1145, over 4256525.79 frames. ], batch size: 703, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:54:28,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=488754.0, ans=0.125 2023-06-19 14:54:37,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-19 14:54:54,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=488814.0, ans=0.125 2023-06-19 14:54:59,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=488874.0, ans=0.035 2023-06-19 14:55:15,700 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.996e+02 3.435e+02 4.162e+02 7.102e+02, threshold=6.869e+02, percent-clipped=0.0 2023-06-19 14:55:22,276 INFO [train.py:996] (1/4) Epoch 3, batch 20500, loss[loss=0.2843, simple_loss=0.3423, pruned_loss=0.1131, over 21842.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3523, pruned_loss=0.1141, over 4266657.53 frames. ], batch size: 351, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:57:05,156 INFO [train.py:996] (1/4) Epoch 3, batch 20550, loss[loss=0.2798, simple_loss=0.3414, pruned_loss=0.1091, over 21274.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3451, pruned_loss=0.1122, over 4257355.27 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 14:57:58,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-06-19 14:58:39,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 2.727e+02 3.175e+02 3.881e+02 7.747e+02, threshold=6.350e+02, percent-clipped=1.0 2023-06-19 14:58:45,307 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-19 14:58:45,940 INFO [train.py:996] (1/4) Epoch 3, batch 20600, loss[loss=0.2723, simple_loss=0.3207, pruned_loss=0.112, over 21346.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3464, pruned_loss=0.1098, over 4267912.79 frames. ], batch size: 176, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 14:58:47,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=489534.0, ans=0.125 2023-06-19 14:59:30,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=489654.0, ans=0.0 2023-06-19 14:59:50,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=489714.0, ans=0.125 2023-06-19 15:00:21,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=489774.0, ans=0.125 2023-06-19 15:00:27,222 INFO [train.py:996] (1/4) Epoch 3, batch 20650, loss[loss=0.2763, simple_loss=0.3349, pruned_loss=0.1088, over 17168.00 frames. ], tot_loss[loss=0.2816, simple_loss=0.3426, pruned_loss=0.1104, over 4272004.11 frames. ], batch size: 60, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:00:38,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=489834.0, ans=0.0 2023-06-19 15:00:40,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=15.0 2023-06-19 15:00:41,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=489834.0, ans=0.2 2023-06-19 15:01:14,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=489954.0, ans=0.1 2023-06-19 15:01:17,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=489954.0, ans=0.125 2023-06-19 15:01:20,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=489954.0, ans=0.125 2023-06-19 15:01:36,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=490014.0, ans=0.125 2023-06-19 15:01:50,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=490014.0, ans=0.2 2023-06-19 15:01:55,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=490074.0, ans=0.0 2023-06-19 15:02:03,113 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.806e+02 3.239e+02 3.703e+02 6.671e+02, threshold=6.478e+02, percent-clipped=1.0 2023-06-19 15:02:10,370 INFO [train.py:996] (1/4) Epoch 3, batch 20700, loss[loss=0.248, simple_loss=0.3146, pruned_loss=0.09072, over 21766.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3329, pruned_loss=0.1055, over 4269017.89 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:02:20,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=490134.0, ans=0.0 2023-06-19 15:02:55,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=490254.0, ans=0.0 2023-06-19 15:03:08,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-19 15:03:34,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=490374.0, ans=0.0 2023-06-19 15:03:36,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=490374.0, ans=0.125 2023-06-19 15:03:50,840 INFO [train.py:996] (1/4) Epoch 3, batch 20750, loss[loss=0.2095, simple_loss=0.2714, pruned_loss=0.0738, over 21789.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3341, pruned_loss=0.1044, over 4266787.67 frames. ], batch size: 118, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:03:53,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=490434.0, ans=0.125 2023-06-19 15:04:14,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=490434.0, ans=0.125 2023-06-19 15:04:18,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-06-19 15:05:27,640 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.202e+02 3.808e+02 5.093e+02 1.097e+03, threshold=7.616e+02, percent-clipped=4.0 2023-06-19 15:05:33,778 INFO [train.py:996] (1/4) Epoch 3, batch 20800, loss[loss=0.3184, simple_loss=0.3482, pruned_loss=0.1443, over 21357.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3391, pruned_loss=0.106, over 4264172.87 frames. ], batch size: 507, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:06:05,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=490794.0, ans=0.125 2023-06-19 15:06:29,578 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-06-19 15:06:41,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=490914.0, ans=0.0 2023-06-19 15:06:46,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=490914.0, ans=0.125 2023-06-19 15:07:02,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490974.0, ans=0.1 2023-06-19 15:07:04,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=15.0 2023-06-19 15:07:10,050 INFO [train.py:996] (1/4) Epoch 3, batch 20850, loss[loss=0.3326, simple_loss=0.3718, pruned_loss=0.1467, over 21933.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3298, pruned_loss=0.103, over 4265110.91 frames. ], batch size: 113, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:07:37,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=491094.0, ans=0.0 2023-06-19 15:08:45,558 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 3.010e+02 3.845e+02 4.738e+02 1.149e+03, threshold=7.690e+02, percent-clipped=6.0 2023-06-19 15:08:56,648 INFO [train.py:996] (1/4) Epoch 3, batch 20900, loss[loss=0.2638, simple_loss=0.3277, pruned_loss=0.09999, over 21539.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3333, pruned_loss=0.1051, over 4271573.09 frames. ], batch size: 195, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:09:49,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=491454.0, ans=0.125 2023-06-19 15:10:05,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=491514.0, ans=0.2 2023-06-19 15:10:30,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=491634.0, ans=0.125 2023-06-19 15:10:31,158 INFO [train.py:996] (1/4) Epoch 3, batch 20950, loss[loss=0.2255, simple_loss=0.2896, pruned_loss=0.08075, over 21436.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3285, pruned_loss=0.09992, over 4278146.56 frames. ], batch size: 194, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:10:36,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2023-06-19 15:11:11,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=491754.0, ans=0.125 2023-06-19 15:11:26,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=491754.0, ans=0.125 2023-06-19 15:12:00,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=491874.0, ans=0.0 2023-06-19 15:12:04,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.584e+02 3.033e+02 4.070e+02 6.900e+02, threshold=6.066e+02, percent-clipped=0.0 2023-06-19 15:12:11,005 INFO [train.py:996] (1/4) Epoch 3, batch 21000, loss[loss=0.3036, simple_loss=0.3701, pruned_loss=0.1185, over 21885.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3273, pruned_loss=0.1001, over 4284800.03 frames. ], batch size: 107, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:12:11,005 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 15:12:29,321 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.2787, simple_loss=0.3805, pruned_loss=0.08847, over 1796401.00 frames. 2023-06-19 15:12:29,322 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 15:12:31,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=491934.0, ans=0.1 2023-06-19 15:12:45,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=491934.0, ans=0.125 2023-06-19 15:13:57,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=492174.0, ans=0.0 2023-06-19 15:14:04,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=492234.0, ans=0.2 2023-06-19 15:14:05,449 INFO [train.py:996] (1/4) Epoch 3, batch 21050, loss[loss=0.2543, simple_loss=0.2987, pruned_loss=0.1049, over 20230.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3245, pruned_loss=0.1004, over 4277029.04 frames. ], batch size: 703, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:14:45,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492294.0, ans=0.1 2023-06-19 15:14:52,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=492354.0, ans=0.125 2023-06-19 15:15:01,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=492354.0, ans=0.125 2023-06-19 15:15:39,128 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.805e+02 3.251e+02 3.947e+02 6.448e+02, threshold=6.502e+02, percent-clipped=2.0 2023-06-19 15:15:41,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=492474.0, ans=0.0 2023-06-19 15:15:44,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=492534.0, ans=0.0 2023-06-19 15:15:45,678 INFO [train.py:996] (1/4) Epoch 3, batch 21100, loss[loss=0.2988, simple_loss=0.34, pruned_loss=0.1288, over 21804.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3205, pruned_loss=0.09997, over 4268248.39 frames. ], batch size: 102, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:15:57,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=492534.0, ans=0.0 2023-06-19 15:16:55,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-19 15:17:03,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=492714.0, ans=0.125 2023-06-19 15:17:27,191 INFO [train.py:996] (1/4) Epoch 3, batch 21150, loss[loss=0.3008, simple_loss=0.3529, pruned_loss=0.1243, over 21807.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3168, pruned_loss=0.09985, over 4262400.24 frames. ], batch size: 98, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:17:44,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=492834.0, ans=0.09899494936611666 2023-06-19 15:18:04,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=492894.0, ans=0.125 2023-06-19 15:18:32,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-19 15:18:34,745 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-19 15:18:38,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=493014.0, ans=0.125 2023-06-19 15:18:44,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=493014.0, ans=0.2 2023-06-19 15:19:03,369 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.674e+02 3.023e+02 3.632e+02 5.729e+02, threshold=6.045e+02, percent-clipped=0.0 2023-06-19 15:19:13,121 INFO [train.py:996] (1/4) Epoch 3, batch 21200, loss[loss=0.2361, simple_loss=0.2941, pruned_loss=0.08906, over 21303.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3134, pruned_loss=0.0996, over 4248861.16 frames. ], batch size: 471, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:20:29,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=493314.0, ans=0.125 2023-06-19 15:20:44,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=493374.0, ans=0.2 2023-06-19 15:20:49,038 INFO [train.py:996] (1/4) Epoch 3, batch 21250, loss[loss=0.2687, simple_loss=0.3259, pruned_loss=0.1057, over 21290.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3098, pruned_loss=0.09886, over 4257575.74 frames. ], batch size: 131, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:20:49,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493434.0, ans=0.1 2023-06-19 15:21:23,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493494.0, ans=0.1 2023-06-19 15:21:23,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=493494.0, ans=0.5 2023-06-19 15:21:37,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=493554.0, ans=0.125 2023-06-19 15:21:50,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=493554.0, ans=0.125 2023-06-19 15:21:58,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-19 15:21:59,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=493614.0, ans=0.0 2023-06-19 15:22:09,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=493614.0, ans=0.125 2023-06-19 15:22:24,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 3.377e+02 3.991e+02 5.475e+02 9.358e+02, threshold=7.981e+02, percent-clipped=20.0 2023-06-19 15:22:25,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-19 15:22:28,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=493734.0, ans=0.125 2023-06-19 15:22:28,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493734.0, ans=0.1 2023-06-19 15:22:29,360 INFO [train.py:996] (1/4) Epoch 3, batch 21300, loss[loss=0.2491, simple_loss=0.3116, pruned_loss=0.09332, over 21883.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3196, pruned_loss=0.1026, over 4269049.93 frames. ], batch size: 124, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:22:49,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=493794.0, ans=0.125 2023-06-19 15:23:12,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.36 vs. limit=22.5 2023-06-19 15:23:18,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=493854.0, ans=0.125 2023-06-19 15:23:29,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=493854.0, ans=0.2 2023-06-19 15:23:44,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-19 15:24:04,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493974.0, ans=0.1 2023-06-19 15:24:17,160 INFO [train.py:996] (1/4) Epoch 3, batch 21350, loss[loss=0.2759, simple_loss=0.3657, pruned_loss=0.09307, over 19720.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3232, pruned_loss=0.1033, over 4254968.41 frames. ], batch size: 703, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:24:59,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.02 vs. limit=15.0 2023-06-19 15:25:06,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-19 15:25:19,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=494154.0, ans=0.125 2023-06-19 15:25:54,678 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.815e+02 3.178e+02 3.883e+02 6.278e+02, threshold=6.357e+02, percent-clipped=0.0 2023-06-19 15:26:10,342 INFO [train.py:996] (1/4) Epoch 3, batch 21400, loss[loss=0.365, simple_loss=0.4118, pruned_loss=0.1591, over 21421.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3269, pruned_loss=0.103, over 4260158.22 frames. ], batch size: 509, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:26:54,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494454.0, ans=0.1 2023-06-19 15:26:59,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=494454.0, ans=0.5 2023-06-19 15:27:00,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494454.0, ans=0.1 2023-06-19 15:27:06,004 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-19 15:27:10,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=494514.0, ans=0.0 2023-06-19 15:27:21,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=494574.0, ans=0.125 2023-06-19 15:27:45,362 INFO [train.py:996] (1/4) Epoch 3, batch 21450, loss[loss=0.2921, simple_loss=0.3449, pruned_loss=0.1196, over 21471.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3307, pruned_loss=0.1044, over 4266860.69 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:28:09,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494694.0, ans=0.1 2023-06-19 15:28:32,946 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-19 15:28:54,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=494814.0, ans=0.04949747468305833 2023-06-19 15:29:21,100 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.895e+02 3.297e+02 3.892e+02 6.030e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-19 15:29:31,461 INFO [train.py:996] (1/4) Epoch 3, batch 21500, loss[loss=0.2674, simple_loss=0.319, pruned_loss=0.1078, over 21795.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3304, pruned_loss=0.1068, over 4257911.32 frames. ], batch size: 351, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:29:36,490 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:30:02,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=494994.0, ans=0.2 2023-06-19 15:30:39,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2023-06-19 15:30:53,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=495174.0, ans=0.0 2023-06-19 15:31:06,789 INFO [train.py:996] (1/4) Epoch 3, batch 21550, loss[loss=0.2725, simple_loss=0.3221, pruned_loss=0.1115, over 21831.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3233, pruned_loss=0.1038, over 4254294.03 frames. ], batch size: 98, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:31:08,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=495234.0, ans=0.125 2023-06-19 15:31:59,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=495354.0, ans=0.125 2023-06-19 15:32:04,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=495354.0, ans=0.0 2023-06-19 15:32:48,691 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.022e+02 3.623e+02 4.651e+02 8.178e+02, threshold=7.247e+02, percent-clipped=4.0 2023-06-19 15:32:57,580 INFO [train.py:996] (1/4) Epoch 3, batch 21600, loss[loss=0.2339, simple_loss=0.2877, pruned_loss=0.09007, over 21597.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3168, pruned_loss=0.1008, over 4255747.17 frames. ], batch size: 231, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:33:16,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=495534.0, ans=0.0 2023-06-19 15:33:33,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=495594.0, ans=0.0 2023-06-19 15:33:49,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=495654.0, ans=0.0 2023-06-19 15:34:39,688 INFO [train.py:996] (1/4) Epoch 3, batch 21650, loss[loss=0.3674, simple_loss=0.4317, pruned_loss=0.1515, over 21448.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3243, pruned_loss=0.09954, over 4254811.95 frames. ], batch size: 507, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:35:05,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=495894.0, ans=0.035 2023-06-19 15:35:19,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-19 15:35:34,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=496014.0, ans=0.0 2023-06-19 15:36:17,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=496074.0, ans=0.125 2023-06-19 15:36:18,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.153e+02 4.282e+02 5.472e+02 1.270e+03, threshold=8.565e+02, percent-clipped=5.0 2023-06-19 15:36:20,145 INFO [train.py:996] (1/4) Epoch 3, batch 21700, loss[loss=0.2615, simple_loss=0.3099, pruned_loss=0.1066, over 21379.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3241, pruned_loss=0.09666, over 4257291.08 frames. ], batch size: 194, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:36:39,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=496194.0, ans=0.125 2023-06-19 15:36:56,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496254.0, ans=0.1 2023-06-19 15:37:14,514 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-19 15:37:20,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-19 15:37:51,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=22.5 2023-06-19 15:37:53,979 INFO [train.py:996] (1/4) Epoch 3, batch 21750, loss[loss=0.248, simple_loss=0.298, pruned_loss=0.09901, over 21847.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3198, pruned_loss=0.09699, over 4244196.65 frames. ], batch size: 352, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:38:29,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=496494.0, ans=0.0 2023-06-19 15:38:39,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=496554.0, ans=0.125 2023-06-19 15:38:42,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=496554.0, ans=0.2 2023-06-19 15:39:34,237 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.772e+02 3.220e+02 4.013e+02 6.187e+02, threshold=6.439e+02, percent-clipped=0.0 2023-06-19 15:39:41,178 INFO [train.py:996] (1/4) Epoch 3, batch 21800, loss[loss=0.2402, simple_loss=0.2955, pruned_loss=0.09247, over 14925.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3197, pruned_loss=0.09888, over 4226877.33 frames. ], batch size: 60, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:40:09,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496794.0, ans=0.1 2023-06-19 15:40:40,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=496914.0, ans=0.1 2023-06-19 15:41:08,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=496974.0, ans=0.125 2023-06-19 15:41:23,658 INFO [train.py:996] (1/4) Epoch 3, batch 21850, loss[loss=0.2851, simple_loss=0.3453, pruned_loss=0.1124, over 21845.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3245, pruned_loss=0.09988, over 4231182.53 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:42:10,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=497154.0, ans=0.0 2023-06-19 15:42:10,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2023-06-19 15:42:19,050 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2023-06-19 15:42:54,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497274.0, ans=0.1 2023-06-19 15:43:07,044 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.253e+02 3.918e+02 5.054e+02 8.247e+02, threshold=7.836e+02, percent-clipped=6.0 2023-06-19 15:43:08,870 INFO [train.py:996] (1/4) Epoch 3, batch 21900, loss[loss=0.2641, simple_loss=0.3141, pruned_loss=0.107, over 21309.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3274, pruned_loss=0.1023, over 4248254.64 frames. ], batch size: 144, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:43:12,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=497334.0, ans=0.0 2023-06-19 15:43:34,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=497394.0, ans=0.0 2023-06-19 15:43:42,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.41 vs. limit=6.0 2023-06-19 15:44:31,483 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-19 15:44:49,951 INFO [train.py:996] (1/4) Epoch 3, batch 21950, loss[loss=0.2598, simple_loss=0.3109, pruned_loss=0.1044, over 21794.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3219, pruned_loss=0.1007, over 4249521.56 frames. ], batch size: 107, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:45:16,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-19 15:45:56,914 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:46:30,999 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.810e+02 3.300e+02 3.802e+02 6.470e+02, threshold=6.601e+02, percent-clipped=0.0 2023-06-19 15:46:32,690 INFO [train.py:996] (1/4) Epoch 3, batch 22000, loss[loss=0.2331, simple_loss=0.2947, pruned_loss=0.08573, over 21704.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3156, pruned_loss=0.09726, over 4251776.19 frames. ], batch size: 333, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:46:37,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=497934.0, ans=0.0 2023-06-19 15:46:47,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=12.0 2023-06-19 15:46:52,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=497994.0, ans=0.125 2023-06-19 15:47:33,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=498114.0, ans=0.125 2023-06-19 15:47:47,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=498114.0, ans=0.2 2023-06-19 15:48:13,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=498174.0, ans=0.0 2023-06-19 15:48:16,210 INFO [train.py:996] (1/4) Epoch 3, batch 22050, loss[loss=0.4046, simple_loss=0.4632, pruned_loss=0.173, over 21488.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3206, pruned_loss=0.09874, over 4240558.72 frames. ], batch size: 471, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:48:24,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=498234.0, ans=0.125 2023-06-19 15:48:38,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=498294.0, ans=0.0 2023-06-19 15:49:32,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=498414.0, ans=0.125 2023-06-19 15:49:48,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=498474.0, ans=0.2 2023-06-19 15:49:51,098 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-06-19 15:49:58,778 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.520e+02 4.276e+02 5.874e+02 8.679e+02, threshold=8.552e+02, percent-clipped=13.0 2023-06-19 15:49:58,800 INFO [train.py:996] (1/4) Epoch 3, batch 22100, loss[loss=0.2746, simple_loss=0.3268, pruned_loss=0.1112, over 21815.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3326, pruned_loss=0.1044, over 4236316.16 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:50:05,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=498534.0, ans=0.125 2023-06-19 15:50:08,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=498534.0, ans=0.2 2023-06-19 15:51:00,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498714.0, ans=0.1 2023-06-19 15:51:00,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498714.0, ans=0.1 2023-06-19 15:51:38,265 INFO [train.py:996] (1/4) Epoch 3, batch 22150, loss[loss=0.2653, simple_loss=0.321, pruned_loss=0.1048, over 21671.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3324, pruned_loss=0.105, over 4250213.46 frames. ], batch size: 230, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:51:49,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=498834.0, ans=0.125 2023-06-19 15:52:20,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=498954.0, ans=0.125 2023-06-19 15:53:07,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-19 15:53:18,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.814e+02 3.451e+02 4.432e+02 8.221e+02, threshold=6.902e+02, percent-clipped=0.0 2023-06-19 15:53:18,973 INFO [train.py:996] (1/4) Epoch 3, batch 22200, loss[loss=0.2869, simple_loss=0.3363, pruned_loss=0.1187, over 21587.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3357, pruned_loss=0.1068, over 4265875.08 frames. ], batch size: 548, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:53:53,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=499194.0, ans=0.2 2023-06-19 15:53:55,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=499254.0, ans=0.125 2023-06-19 15:54:30,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=499314.0, ans=0.125 2023-06-19 15:54:43,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2023-06-19 15:54:50,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=499374.0, ans=0.125 2023-06-19 15:55:01,261 INFO [train.py:996] (1/4) Epoch 3, batch 22250, loss[loss=0.3207, simple_loss=0.3905, pruned_loss=0.1255, over 21761.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3435, pruned_loss=0.1086, over 4275431.44 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:55:21,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-06-19 15:55:42,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=499554.0, ans=0.125 2023-06-19 15:55:57,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=499614.0, ans=0.025 2023-06-19 15:55:59,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=499614.0, ans=0.125 2023-06-19 15:56:41,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.175e+02 3.826e+02 4.848e+02 6.426e+02, threshold=7.653e+02, percent-clipped=0.0 2023-06-19 15:56:41,226 INFO [train.py:996] (1/4) Epoch 3, batch 22300, loss[loss=0.2671, simple_loss=0.3207, pruned_loss=0.1068, over 21936.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3454, pruned_loss=0.1109, over 4280811.64 frames. ], batch size: 316, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:58:21,535 INFO [train.py:996] (1/4) Epoch 3, batch 22350, loss[loss=0.2721, simple_loss=0.3449, pruned_loss=0.09964, over 21710.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.345, pruned_loss=0.112, over 4288105.06 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:58:36,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=500094.0, ans=0.1 2023-06-19 15:59:29,143 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:59:51,403 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:00:02,194 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.767e+02 3.274e+02 4.023e+02 7.731e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-19 16:00:02,215 INFO [train.py:996] (1/4) Epoch 3, batch 22400, loss[loss=0.2693, simple_loss=0.3348, pruned_loss=0.1019, over 21589.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3415, pruned_loss=0.1078, over 4292621.60 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:00:02,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=500334.0, ans=0.0 2023-06-19 16:00:03,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-19 16:00:15,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=500334.0, ans=0.2 2023-06-19 16:00:34,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=500394.0, ans=0.2 2023-06-19 16:01:19,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=500514.0, ans=0.125 2023-06-19 16:01:27,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-19 16:01:30,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=500574.0, ans=0.125 2023-06-19 16:01:42,902 INFO [train.py:996] (1/4) Epoch 3, batch 22450, loss[loss=0.2441, simple_loss=0.2915, pruned_loss=0.09837, over 21632.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3363, pruned_loss=0.1073, over 4274853.00 frames. ], batch size: 445, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:02:04,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=500694.0, ans=0.95 2023-06-19 16:02:14,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=500694.0, ans=0.125 2023-06-19 16:02:55,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=500814.0, ans=0.125 2023-06-19 16:03:24,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=500874.0, ans=0.0 2023-06-19 16:03:27,116 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 3.119e+02 3.860e+02 5.027e+02 1.347e+03, threshold=7.719e+02, percent-clipped=7.0 2023-06-19 16:03:27,138 INFO [train.py:996] (1/4) Epoch 3, batch 22500, loss[loss=0.2924, simple_loss=0.3835, pruned_loss=0.1006, over 21761.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3299, pruned_loss=0.1068, over 4276891.03 frames. ], batch size: 282, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:03:28,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.38 vs. limit=6.0 2023-06-19 16:03:30,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-19 16:04:12,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=500994.0, ans=0.2 2023-06-19 16:04:15,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=501054.0, ans=0.0 2023-06-19 16:05:00,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=501174.0, ans=0.05 2023-06-19 16:05:00,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=501174.0, ans=0.125 2023-06-19 16:05:10,035 INFO [train.py:996] (1/4) Epoch 3, batch 22550, loss[loss=0.2362, simple_loss=0.3005, pruned_loss=0.08593, over 21479.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3354, pruned_loss=0.1068, over 4274595.24 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:05:14,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-06-19 16:05:14,591 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.07 vs. limit=15.0 2023-06-19 16:05:55,097 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:06:10,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=501354.0, ans=0.125 2023-06-19 16:06:24,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=501414.0, ans=0.2 2023-06-19 16:06:42,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=501474.0, ans=0.5 2023-06-19 16:07:06,024 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 3.038e+02 3.713e+02 4.829e+02 9.473e+02, threshold=7.425e+02, percent-clipped=2.0 2023-06-19 16:07:06,044 INFO [train.py:996] (1/4) Epoch 3, batch 22600, loss[loss=0.2429, simple_loss=0.3065, pruned_loss=0.08965, over 21712.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3372, pruned_loss=0.107, over 4280843.93 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:07:34,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=501594.0, ans=0.1 2023-06-19 16:07:38,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=501594.0, ans=0.125 2023-06-19 16:08:01,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-19 16:08:21,451 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.93 vs. limit=22.5 2023-06-19 16:08:40,605 INFO [train.py:996] (1/4) Epoch 3, batch 22650, loss[loss=0.237, simple_loss=0.2944, pruned_loss=0.08982, over 22029.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.334, pruned_loss=0.1056, over 4280263.77 frames. ], batch size: 103, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:09:03,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=501834.0, ans=0.125 2023-06-19 16:09:46,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=502014.0, ans=0.125 2023-06-19 16:10:23,368 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.918e+02 3.422e+02 4.341e+02 8.662e+02, threshold=6.843e+02, percent-clipped=1.0 2023-06-19 16:10:23,389 INFO [train.py:996] (1/4) Epoch 3, batch 22700, loss[loss=0.2429, simple_loss=0.296, pruned_loss=0.09487, over 21815.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.326, pruned_loss=0.1047, over 4279098.83 frames. ], batch size: 352, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:10:40,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=502134.0, ans=0.0 2023-06-19 16:10:48,161 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.82 vs. limit=22.5 2023-06-19 16:10:59,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502194.0, ans=0.1 2023-06-19 16:11:20,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=502254.0, ans=0.125 2023-06-19 16:11:28,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=502314.0, ans=0.125 2023-06-19 16:11:41,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=502314.0, ans=0.125 2023-06-19 16:11:41,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=10.0 2023-06-19 16:12:10,389 INFO [train.py:996] (1/4) Epoch 3, batch 22750, loss[loss=0.3151, simple_loss=0.3658, pruned_loss=0.1322, over 21541.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.327, pruned_loss=0.1069, over 4275578.20 frames. ], batch size: 230, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:12:37,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=502494.0, ans=0.1 2023-06-19 16:12:56,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=502554.0, ans=0.125 2023-06-19 16:13:24,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=502674.0, ans=0.0 2023-06-19 16:13:51,336 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.384e+02 3.991e+02 5.040e+02 7.219e+02, threshold=7.983e+02, percent-clipped=3.0 2023-06-19 16:13:51,368 INFO [train.py:996] (1/4) Epoch 3, batch 22800, loss[loss=0.2943, simple_loss=0.3413, pruned_loss=0.1236, over 21673.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3324, pruned_loss=0.1102, over 4284974.50 frames. ], batch size: 441, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:13:51,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=502734.0, ans=0.0 2023-06-19 16:13:56,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502734.0, ans=0.1 2023-06-19 16:14:07,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=502734.0, ans=0.1 2023-06-19 16:14:13,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=502734.0, ans=0.07 2023-06-19 16:14:18,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=502794.0, ans=0.0 2023-06-19 16:14:45,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=502854.0, ans=0.0 2023-06-19 16:14:45,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=502854.0, ans=0.125 2023-06-19 16:14:48,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=502914.0, ans=0.0 2023-06-19 16:15:04,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=502974.0, ans=0.0 2023-06-19 16:15:32,969 INFO [train.py:996] (1/4) Epoch 3, batch 22850, loss[loss=0.251, simple_loss=0.3098, pruned_loss=0.09614, over 21497.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3292, pruned_loss=0.1094, over 4275791.51 frames. ], batch size: 131, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:16:33,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=503214.0, ans=0.125 2023-06-19 16:16:39,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=503214.0, ans=0.125 2023-06-19 16:17:16,729 INFO [train.py:996] (1/4) Epoch 3, batch 22900, loss[loss=0.2782, simple_loss=0.3372, pruned_loss=0.1096, over 21770.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3311, pruned_loss=0.1083, over 4267646.06 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:17:18,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.187e+02 3.862e+02 4.458e+02 8.142e+02, threshold=7.724e+02, percent-clipped=1.0 2023-06-19 16:19:04,705 INFO [train.py:996] (1/4) Epoch 3, batch 22950, loss[loss=0.327, simple_loss=0.433, pruned_loss=0.1105, over 21500.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3442, pruned_loss=0.1059, over 4264590.07 frames. ], batch size: 471, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:19:13,030 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:19:20,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=503634.0, ans=0.125 2023-06-19 16:20:21,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=503814.0, ans=0.95 2023-06-19 16:20:45,475 INFO [train.py:996] (1/4) Epoch 3, batch 23000, loss[loss=0.2727, simple_loss=0.3315, pruned_loss=0.1069, over 21879.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3426, pruned_loss=0.1031, over 4265184.00 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:20:51,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.906e+02 3.294e+02 4.043e+02 6.729e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-19 16:20:55,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=503934.0, ans=0.125 2023-06-19 16:21:13,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=503994.0, ans=0.2 2023-06-19 16:21:31,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=504054.0, ans=0.0 2023-06-19 16:22:05,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504174.0, ans=0.1 2023-06-19 16:22:30,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=504174.0, ans=0.125 2023-06-19 16:22:30,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=504174.0, ans=0.0 2023-06-19 16:22:33,405 INFO [train.py:996] (1/4) Epoch 3, batch 23050, loss[loss=0.2844, simple_loss=0.3614, pruned_loss=0.1037, over 17317.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.346, pruned_loss=0.1065, over 4262984.49 frames. ], batch size: 60, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:23:04,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=504354.0, ans=0.125 2023-06-19 16:23:13,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=504354.0, ans=0.125 2023-06-19 16:23:39,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=504414.0, ans=0.125 2023-06-19 16:24:08,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=504474.0, ans=0.125 2023-06-19 16:24:14,570 INFO [train.py:996] (1/4) Epoch 3, batch 23100, loss[loss=0.2098, simple_loss=0.2684, pruned_loss=0.07564, over 21614.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3405, pruned_loss=0.1066, over 4264357.59 frames. ], batch size: 282, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:24:16,318 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.949e+02 3.465e+02 4.322e+02 6.088e+02, threshold=6.930e+02, percent-clipped=0.0 2023-06-19 16:24:19,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-19 16:24:34,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.72 vs. limit=22.5 2023-06-19 16:24:53,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=504654.0, ans=0.0 2023-06-19 16:24:54,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=12.0 2023-06-19 16:25:13,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=504714.0, ans=0.125 2023-06-19 16:25:16,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=504714.0, ans=0.125 2023-06-19 16:25:20,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=504714.0, ans=0.2 2023-06-19 16:25:39,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=504774.0, ans=0.125 2023-06-19 16:25:39,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=504774.0, ans=0.05 2023-06-19 16:25:48,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=504834.0, ans=0.2 2023-06-19 16:25:49,774 INFO [train.py:996] (1/4) Epoch 3, batch 23150, loss[loss=0.244, simple_loss=0.3053, pruned_loss=0.0914, over 15671.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.334, pruned_loss=0.1057, over 4264719.18 frames. ], batch size: 60, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:25:54,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=504834.0, ans=0.125 2023-06-19 16:26:03,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.99 vs. limit=10.0 2023-06-19 16:26:57,454 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:27:29,887 INFO [train.py:996] (1/4) Epoch 3, batch 23200, loss[loss=0.2724, simple_loss=0.3434, pruned_loss=0.1007, over 21880.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3327, pruned_loss=0.106, over 4265152.74 frames. ], batch size: 118, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:27:31,408 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.188e+02 3.765e+02 4.583e+02 7.279e+02, threshold=7.530e+02, percent-clipped=1.0 2023-06-19 16:27:33,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505134.0, ans=0.1 2023-06-19 16:27:45,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=505194.0, ans=0.125 2023-06-19 16:27:47,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=505194.0, ans=0.0 2023-06-19 16:29:11,832 INFO [train.py:996] (1/4) Epoch 3, batch 23250, loss[loss=0.2878, simple_loss=0.3433, pruned_loss=0.1162, over 21888.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3331, pruned_loss=0.108, over 4273950.04 frames. ], batch size: 371, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:29:14,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=505434.0, ans=0.125 2023-06-19 16:29:18,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-06-19 16:29:20,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=505434.0, ans=0.0 2023-06-19 16:29:37,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=505494.0, ans=0.125 2023-06-19 16:29:46,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=22.5 2023-06-19 16:30:06,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-06-19 16:30:17,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=505614.0, ans=0.0 2023-06-19 16:30:37,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=505674.0, ans=0.2 2023-06-19 16:30:55,039 INFO [train.py:996] (1/4) Epoch 3, batch 23300, loss[loss=0.2933, simple_loss=0.3851, pruned_loss=0.1007, over 21728.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3416, pruned_loss=0.1108, over 4279039.78 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:30:56,666 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.151e+02 3.585e+02 4.227e+02 7.319e+02, threshold=7.169e+02, percent-clipped=0.0 2023-06-19 16:31:26,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=505794.0, ans=0.0 2023-06-19 16:31:30,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=505794.0, ans=0.2 2023-06-19 16:31:58,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=505914.0, ans=0.0 2023-06-19 16:32:25,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=505974.0, ans=0.125 2023-06-19 16:32:33,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505974.0, ans=0.1 2023-06-19 16:32:38,659 INFO [train.py:996] (1/4) Epoch 3, batch 23350, loss[loss=0.2722, simple_loss=0.3591, pruned_loss=0.09263, over 20777.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.347, pruned_loss=0.1093, over 4273010.32 frames. ], batch size: 607, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:32:58,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=506094.0, ans=0.125 2023-06-19 16:33:17,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=506094.0, ans=0.2 2023-06-19 16:33:36,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=506154.0, ans=0.125 2023-06-19 16:33:53,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-06-19 16:34:21,300 INFO [train.py:996] (1/4) Epoch 3, batch 23400, loss[loss=0.3103, simple_loss=0.3616, pruned_loss=0.1295, over 21930.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3392, pruned_loss=0.1048, over 4276433.34 frames. ], batch size: 333, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:34:22,870 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.587e+02 3.042e+02 3.768e+02 6.854e+02, threshold=6.085e+02, percent-clipped=0.0 2023-06-19 16:34:51,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.95 vs. limit=12.0 2023-06-19 16:35:48,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506574.0, ans=0.1 2023-06-19 16:36:07,906 INFO [train.py:996] (1/4) Epoch 3, batch 23450, loss[loss=0.2682, simple_loss=0.3276, pruned_loss=0.1044, over 21581.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3412, pruned_loss=0.1077, over 4283222.28 frames. ], batch size: 230, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:36:18,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=506634.0, ans=0.04949747468305833 2023-06-19 16:36:25,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=506634.0, ans=0.125 2023-06-19 16:36:35,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-19 16:36:52,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=506754.0, ans=0.125 2023-06-19 16:37:25,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-19 16:37:28,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=506874.0, ans=0.125 2023-06-19 16:37:49,342 INFO [train.py:996] (1/4) Epoch 3, batch 23500, loss[loss=0.2532, simple_loss=0.3102, pruned_loss=0.09808, over 21885.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3401, pruned_loss=0.1088, over 4279914.36 frames. ], batch size: 298, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:37:50,952 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.271e+02 4.126e+02 5.318e+02 8.868e+02, threshold=8.252e+02, percent-clipped=14.0 2023-06-19 16:37:57,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506934.0, ans=0.1 2023-06-19 16:39:30,984 INFO [train.py:996] (1/4) Epoch 3, batch 23550, loss[loss=0.2338, simple_loss=0.297, pruned_loss=0.08532, over 21780.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3349, pruned_loss=0.1088, over 4274852.05 frames. ], batch size: 112, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:39:31,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507234.0, ans=0.1 2023-06-19 16:39:33,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2023-06-19 16:39:48,644 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:39:54,313 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.30 vs. limit=10.0 2023-06-19 16:40:26,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=507354.0, ans=0.0 2023-06-19 16:40:47,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=507414.0, ans=0.125 2023-06-19 16:41:17,607 INFO [train.py:996] (1/4) Epoch 3, batch 23600, loss[loss=0.3029, simple_loss=0.3687, pruned_loss=0.1186, over 21153.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3372, pruned_loss=0.1094, over 4269054.01 frames. ], batch size: 143, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:41:19,224 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 3.135e+02 3.693e+02 4.651e+02 9.053e+02, threshold=7.385e+02, percent-clipped=1.0 2023-06-19 16:42:09,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=507654.0, ans=0.1 2023-06-19 16:42:21,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=507714.0, ans=10.0 2023-06-19 16:42:22,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=507714.0, ans=0.0 2023-06-19 16:42:44,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=507774.0, ans=0.05 2023-06-19 16:43:00,371 INFO [train.py:996] (1/4) Epoch 3, batch 23650, loss[loss=0.2886, simple_loss=0.3591, pruned_loss=0.1091, over 21279.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3374, pruned_loss=0.1082, over 4268795.25 frames. ], batch size: 143, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:43:24,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=507834.0, ans=0.2 2023-06-19 16:43:42,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=507954.0, ans=0.1 2023-06-19 16:44:48,397 INFO [train.py:996] (1/4) Epoch 3, batch 23700, loss[loss=0.2216, simple_loss=0.2969, pruned_loss=0.07317, over 21722.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.34, pruned_loss=0.1069, over 4268199.19 frames. ], batch size: 298, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:44:49,942 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.801e+02 3.226e+02 4.051e+02 6.982e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-19 16:46:12,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=508374.0, ans=0.0 2023-06-19 16:46:29,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=22.5 2023-06-19 16:46:36,720 INFO [train.py:996] (1/4) Epoch 3, batch 23750, loss[loss=0.2892, simple_loss=0.3664, pruned_loss=0.106, over 21575.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3431, pruned_loss=0.1075, over 4269573.16 frames. ], batch size: 414, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:47:05,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=508494.0, ans=0.0 2023-06-19 16:47:30,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-19 16:47:46,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=508614.0, ans=0.125 2023-06-19 16:48:21,129 INFO [train.py:996] (1/4) Epoch 3, batch 23800, loss[loss=0.253, simple_loss=0.3036, pruned_loss=0.1012, over 21391.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3398, pruned_loss=0.1044, over 4268272.50 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:48:22,756 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.690e+02 3.256e+02 4.075e+02 6.648e+02, threshold=6.511e+02, percent-clipped=1.0 2023-06-19 16:48:54,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-19 16:48:55,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=508794.0, ans=0.125 2023-06-19 16:50:01,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=508974.0, ans=0.125 2023-06-19 16:50:05,855 INFO [train.py:996] (1/4) Epoch 3, batch 23850, loss[loss=0.2828, simple_loss=0.3495, pruned_loss=0.1081, over 21331.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3518, pruned_loss=0.1086, over 4264247.49 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:50:08,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=509034.0, ans=0.125 2023-06-19 16:50:21,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=509034.0, ans=0.125 2023-06-19 16:50:27,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.63 vs. limit=15.0 2023-06-19 16:51:48,146 INFO [train.py:996] (1/4) Epoch 3, batch 23900, loss[loss=0.2842, simple_loss=0.3506, pruned_loss=0.1089, over 21757.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3591, pruned_loss=0.1113, over 4271696.51 frames. ], batch size: 124, lr: 1.03e-02, grad_scale: 16.0 2023-06-19 16:51:51,136 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 3.185e+02 4.046e+02 5.288e+02 1.128e+03, threshold=8.092e+02, percent-clipped=13.0 2023-06-19 16:51:53,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=509334.0, ans=0.0 2023-06-19 16:52:03,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=509334.0, ans=0.2 2023-06-19 16:52:15,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=509394.0, ans=0.125 2023-06-19 16:52:20,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=509394.0, ans=0.2 2023-06-19 16:52:30,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=509454.0, ans=0.125 2023-06-19 16:53:16,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=509574.0, ans=0.125 2023-06-19 16:53:28,828 INFO [train.py:996] (1/4) Epoch 3, batch 23950, loss[loss=0.2748, simple_loss=0.3263, pruned_loss=0.1117, over 21662.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3515, pruned_loss=0.1106, over 4277148.41 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 16.0 2023-06-19 16:53:45,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=509634.0, ans=0.125 2023-06-19 16:53:58,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=509694.0, ans=0.0 2023-06-19 16:54:33,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509754.0, ans=0.1 2023-06-19 16:55:15,642 INFO [train.py:996] (1/4) Epoch 3, batch 24000, loss[loss=0.3359, simple_loss=0.399, pruned_loss=0.1364, over 21466.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3537, pruned_loss=0.1142, over 4279185.19 frames. ], batch size: 131, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:55:15,643 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 16:55:25,237 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.9697, 3.5437, 3.4506, 1.9996], device='cuda:1') 2023-06-19 16:55:27,369 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.1195, 1.9818, 3.1495, 3.2994], device='cuda:1') 2023-06-19 16:55:31,884 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.2855, simple_loss=0.3833, pruned_loss=0.09389, over 1796401.00 frames. 2023-06-19 16:55:31,885 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 16:55:35,239 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.049e+02 3.553e+02 4.728e+02 8.625e+02, threshold=7.107e+02, percent-clipped=2.0 2023-06-19 16:55:37,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509934.0, ans=0.1 2023-06-19 16:56:16,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=510054.0, ans=0.125 2023-06-19 16:56:46,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.03 vs. limit=10.0 2023-06-19 16:57:10,303 INFO [train.py:996] (1/4) Epoch 3, batch 24050, loss[loss=0.2274, simple_loss=0.3096, pruned_loss=0.07255, over 21721.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3527, pruned_loss=0.1135, over 4271381.10 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:57:44,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=510294.0, ans=0.0 2023-06-19 16:58:08,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=510414.0, ans=0.125 2023-06-19 16:58:53,195 INFO [train.py:996] (1/4) Epoch 3, batch 24100, loss[loss=0.3036, simple_loss=0.3791, pruned_loss=0.114, over 21280.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3502, pruned_loss=0.1096, over 4271271.46 frames. ], batch size: 548, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:58:56,271 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.988e+02 3.709e+02 5.089e+02 1.009e+03, threshold=7.417e+02, percent-clipped=9.0 2023-06-19 16:59:38,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-19 17:00:30,536 INFO [train.py:996] (1/4) Epoch 3, batch 24150, loss[loss=0.2935, simple_loss=0.3439, pruned_loss=0.1216, over 21417.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3513, pruned_loss=0.1126, over 4279732.18 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:00:55,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=510894.0, ans=0.5 2023-06-19 17:00:55,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=510894.0, ans=0.05 2023-06-19 17:00:59,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=510894.0, ans=0.125 2023-06-19 17:01:47,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=511074.0, ans=0.0 2023-06-19 17:01:47,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=511074.0, ans=0.04949747468305833 2023-06-19 17:01:47,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-19 17:02:10,248 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-19 17:02:11,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=511074.0, ans=0.125 2023-06-19 17:02:14,034 INFO [train.py:996] (1/4) Epoch 3, batch 24200, loss[loss=0.2794, simple_loss=0.3345, pruned_loss=0.1121, over 21344.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3566, pruned_loss=0.1153, over 4287749.54 frames. ], batch size: 131, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:02:17,187 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.206e+02 3.739e+02 4.662e+02 8.285e+02, threshold=7.479e+02, percent-clipped=1.0 2023-06-19 17:02:47,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=511194.0, ans=0.125 2023-06-19 17:03:05,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=511254.0, ans=0.0 2023-06-19 17:03:52,810 INFO [train.py:996] (1/4) Epoch 3, batch 24250, loss[loss=0.1997, simple_loss=0.2956, pruned_loss=0.05188, over 21696.00 frames. ], tot_loss[loss=0.2826, simple_loss=0.3517, pruned_loss=0.1067, over 4282985.03 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:04:09,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=511494.0, ans=0.125 2023-06-19 17:04:33,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=511554.0, ans=15.0 2023-06-19 17:05:33,891 INFO [train.py:996] (1/4) Epoch 3, batch 24300, loss[loss=0.185, simple_loss=0.27, pruned_loss=0.05006, over 21647.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3443, pruned_loss=0.09989, over 4280509.98 frames. ], batch size: 414, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:05:37,130 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.288e+02 2.786e+02 3.535e+02 7.213e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-19 17:07:15,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=512034.0, ans=0.125 2023-06-19 17:07:16,698 INFO [train.py:996] (1/4) Epoch 3, batch 24350, loss[loss=0.2874, simple_loss=0.3355, pruned_loss=0.1197, over 21439.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3404, pruned_loss=0.1001, over 4289395.65 frames. ], batch size: 194, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:07:39,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-19 17:08:26,347 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:08:40,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=512214.0, ans=0.125 2023-06-19 17:09:05,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=512334.0, ans=0.125 2023-06-19 17:09:06,332 INFO [train.py:996] (1/4) Epoch 3, batch 24400, loss[loss=0.2817, simple_loss=0.3532, pruned_loss=0.1051, over 21705.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.344, pruned_loss=0.1042, over 4282671.65 frames. ], batch size: 351, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:09:08,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=512334.0, ans=0.125 2023-06-19 17:09:09,700 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 3.270e+02 4.131e+02 5.260e+02 7.879e+02, threshold=8.262e+02, percent-clipped=18.0 2023-06-19 17:09:13,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=512334.0, ans=0.125 2023-06-19 17:09:22,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.10 vs. limit=6.0 2023-06-19 17:09:34,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=512394.0, ans=0.0 2023-06-19 17:09:59,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-19 17:10:09,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-19 17:10:48,968 INFO [train.py:996] (1/4) Epoch 3, batch 24450, loss[loss=0.2235, simple_loss=0.2962, pruned_loss=0.07537, over 21320.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3471, pruned_loss=0.1052, over 4276853.24 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:11:33,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=512754.0, ans=0.125 2023-06-19 17:11:42,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=22.5 2023-06-19 17:11:47,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=512754.0, ans=0.125 2023-06-19 17:11:54,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-19 17:12:09,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=512874.0, ans=0.125 2023-06-19 17:12:20,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=512874.0, ans=0.125 2023-06-19 17:12:30,295 INFO [train.py:996] (1/4) Epoch 3, batch 24500, loss[loss=0.2901, simple_loss=0.3411, pruned_loss=0.1195, over 21922.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3448, pruned_loss=0.1037, over 4275092.99 frames. ], batch size: 351, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:12:30,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=512934.0, ans=0.125 2023-06-19 17:12:33,619 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.872e+02 3.383e+02 4.151e+02 6.413e+02, threshold=6.766e+02, percent-clipped=0.0 2023-06-19 17:13:49,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=513174.0, ans=0.2 2023-06-19 17:14:07,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=513174.0, ans=0.1 2023-06-19 17:14:12,252 INFO [train.py:996] (1/4) Epoch 3, batch 24550, loss[loss=0.3511, simple_loss=0.405, pruned_loss=0.1486, over 21236.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3487, pruned_loss=0.1077, over 4278135.49 frames. ], batch size: 143, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:14:12,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=513234.0, ans=0.0 2023-06-19 17:15:54,302 INFO [train.py:996] (1/4) Epoch 3, batch 24600, loss[loss=0.2526, simple_loss=0.3032, pruned_loss=0.101, over 21244.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3444, pruned_loss=0.1083, over 4274348.41 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:15:57,341 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.888e+02 3.572e+02 4.375e+02 7.058e+02, threshold=7.144e+02, percent-clipped=1.0 2023-06-19 17:17:22,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.29 vs. limit=22.5 2023-06-19 17:17:35,821 INFO [train.py:996] (1/4) Epoch 3, batch 24650, loss[loss=0.2376, simple_loss=0.282, pruned_loss=0.09662, over 21246.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.337, pruned_loss=0.1074, over 4262662.10 frames. ], batch size: 549, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:18:34,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=514014.0, ans=0.125 2023-06-19 17:18:39,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=514014.0, ans=0.2 2023-06-19 17:18:41,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=514014.0, ans=0.07 2023-06-19 17:18:59,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.10 vs. limit=12.0 2023-06-19 17:19:13,012 INFO [train.py:996] (1/4) Epoch 3, batch 24700, loss[loss=0.2208, simple_loss=0.2979, pruned_loss=0.07187, over 20708.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3338, pruned_loss=0.1042, over 4250971.29 frames. ], batch size: 607, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:19:16,026 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.157e+02 3.618e+02 4.336e+02 6.867e+02, threshold=7.236e+02, percent-clipped=0.0 2023-06-19 17:19:44,556 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.98 vs. limit=10.0 2023-06-19 17:19:45,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=514194.0, ans=0.125 2023-06-19 17:19:48,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=514194.0, ans=0.2 2023-06-19 17:20:08,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=514254.0, ans=0.07 2023-06-19 17:20:39,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=514374.0, ans=0.125 2023-06-19 17:20:55,371 INFO [train.py:996] (1/4) Epoch 3, batch 24750, loss[loss=0.2429, simple_loss=0.2873, pruned_loss=0.09923, over 21306.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3266, pruned_loss=0.1015, over 4258367.14 frames. ], batch size: 144, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:22:03,001 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-19 17:22:10,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=514614.0, ans=0.125 2023-06-19 17:22:36,149 INFO [train.py:996] (1/4) Epoch 3, batch 24800, loss[loss=0.2966, simple_loss=0.3416, pruned_loss=0.1258, over 21633.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3225, pruned_loss=0.1016, over 4266593.38 frames. ], batch size: 473, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:22:39,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.753e+02 3.144e+02 3.669e+02 5.851e+02, threshold=6.289e+02, percent-clipped=0.0 2023-06-19 17:23:21,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=514794.0, ans=0.0 2023-06-19 17:24:19,454 INFO [train.py:996] (1/4) Epoch 3, batch 24850, loss[loss=0.2374, simple_loss=0.3024, pruned_loss=0.08614, over 21646.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3232, pruned_loss=0.1028, over 4275143.29 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:25:07,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=515154.0, ans=0.2 2023-06-19 17:25:07,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-19 17:25:16,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515154.0, ans=0.1 2023-06-19 17:25:20,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-19 17:25:25,154 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:25:41,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=515214.0, ans=0.1 2023-06-19 17:26:02,428 INFO [train.py:996] (1/4) Epoch 3, batch 24900, loss[loss=0.2277, simple_loss=0.281, pruned_loss=0.08719, over 21853.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3291, pruned_loss=0.1049, over 4281722.90 frames. ], batch size: 102, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:26:11,121 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.873e+02 3.627e+02 4.468e+02 7.935e+02, threshold=7.253e+02, percent-clipped=5.0 2023-06-19 17:26:33,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=515394.0, ans=0.0 2023-06-19 17:27:10,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=515454.0, ans=0.0 2023-06-19 17:27:14,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=515514.0, ans=0.0 2023-06-19 17:27:32,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=515574.0, ans=0.2 2023-06-19 17:27:57,499 INFO [train.py:996] (1/4) Epoch 3, batch 24950, loss[loss=0.3359, simple_loss=0.4027, pruned_loss=0.1345, over 21839.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3372, pruned_loss=0.1088, over 4281939.61 frames. ], batch size: 118, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:28:06,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=515634.0, ans=0.0 2023-06-19 17:28:07,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=515634.0, ans=0.0 2023-06-19 17:28:14,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=515634.0, ans=0.2 2023-06-19 17:28:16,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=515634.0, ans=0.2 2023-06-19 17:28:31,641 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-06-19 17:29:07,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=515814.0, ans=0.1 2023-06-19 17:29:46,102 INFO [train.py:996] (1/4) Epoch 3, batch 25000, loss[loss=0.3342, simple_loss=0.3924, pruned_loss=0.138, over 21323.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3436, pruned_loss=0.1115, over 4283935.14 frames. ], batch size: 549, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:29:48,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=515934.0, ans=0.125 2023-06-19 17:29:49,478 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.951e+02 3.694e+02 4.326e+02 9.045e+02, threshold=7.388e+02, percent-clipped=1.0 2023-06-19 17:31:16,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=516174.0, ans=0.125 2023-06-19 17:31:28,963 INFO [train.py:996] (1/4) Epoch 3, batch 25050, loss[loss=0.2537, simple_loss=0.3077, pruned_loss=0.09983, over 21687.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3362, pruned_loss=0.1099, over 4287613.91 frames. ], batch size: 333, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:31:37,835 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.19 vs. limit=6.0 2023-06-19 17:31:42,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=516234.0, ans=0.1 2023-06-19 17:32:38,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=516414.0, ans=0.5 2023-06-19 17:32:48,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=516474.0, ans=0.0 2023-06-19 17:33:10,994 INFO [train.py:996] (1/4) Epoch 3, batch 25100, loss[loss=0.244, simple_loss=0.2951, pruned_loss=0.09651, over 21548.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3293, pruned_loss=0.1073, over 4286804.35 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:33:13,881 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.071e+02 3.524e+02 4.196e+02 8.233e+02, threshold=7.049e+02, percent-clipped=3.0 2023-06-19 17:33:18,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=516534.0, ans=0.0 2023-06-19 17:33:29,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=516594.0, ans=0.0 2023-06-19 17:33:29,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.66 vs. limit=22.5 2023-06-19 17:33:45,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=516594.0, ans=0.0 2023-06-19 17:34:39,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=516774.0, ans=0.02 2023-06-19 17:34:46,308 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:34:47,465 INFO [train.py:996] (1/4) Epoch 3, batch 25150, loss[loss=0.2438, simple_loss=0.3246, pruned_loss=0.0815, over 21059.00 frames. ], tot_loss[loss=0.271, simple_loss=0.333, pruned_loss=0.1045, over 4278700.52 frames. ], batch size: 608, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:35:25,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-19 17:36:11,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=517074.0, ans=0.125 2023-06-19 17:36:24,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=517074.0, ans=0.125 2023-06-19 17:36:29,174 INFO [train.py:996] (1/4) Epoch 3, batch 25200, loss[loss=0.2475, simple_loss=0.3411, pruned_loss=0.07694, over 21815.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3312, pruned_loss=0.1015, over 4270099.15 frames. ], batch size: 316, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:36:32,443 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.662e+02 3.153e+02 4.538e+02 8.599e+02, threshold=6.306e+02, percent-clipped=6.0 2023-06-19 17:36:59,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=517194.0, ans=0.1 2023-06-19 17:37:33,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-19 17:38:08,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=517434.0, ans=0.125 2023-06-19 17:38:10,283 INFO [train.py:996] (1/4) Epoch 3, batch 25250, loss[loss=0.2797, simple_loss=0.3256, pruned_loss=0.1169, over 21443.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3286, pruned_loss=0.1, over 4269655.34 frames. ], batch size: 441, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:39:00,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=517554.0, ans=0.125 2023-06-19 17:39:56,950 INFO [train.py:996] (1/4) Epoch 3, batch 25300, loss[loss=0.2748, simple_loss=0.3542, pruned_loss=0.09768, over 21252.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3274, pruned_loss=0.1, over 4261006.41 frames. ], batch size: 548, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:40:00,371 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.138e+02 3.706e+02 4.437e+02 8.805e+02, threshold=7.413e+02, percent-clipped=6.0 2023-06-19 17:40:00,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=517734.0, ans=0.125 2023-06-19 17:40:01,188 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-19 17:40:36,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=517854.0, ans=0.125 2023-06-19 17:40:38,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=517854.0, ans=0.125 2023-06-19 17:40:59,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-19 17:41:40,338 INFO [train.py:996] (1/4) Epoch 3, batch 25350, loss[loss=0.2127, simple_loss=0.2957, pruned_loss=0.06482, over 21402.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3302, pruned_loss=0.09981, over 4251249.75 frames. ], batch size: 211, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:41:50,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=518034.0, ans=0.125 2023-06-19 17:42:25,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518154.0, ans=0.1 2023-06-19 17:43:21,807 INFO [train.py:996] (1/4) Epoch 3, batch 25400, loss[loss=0.2527, simple_loss=0.3038, pruned_loss=0.1008, over 21585.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3242, pruned_loss=0.09883, over 4242334.57 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:43:24,802 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.818e+02 3.409e+02 4.580e+02 8.063e+02, threshold=6.817e+02, percent-clipped=2.0 2023-06-19 17:43:43,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=12.0 2023-06-19 17:44:13,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=518514.0, ans=0.0 2023-06-19 17:44:24,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=518514.0, ans=0.1 2023-06-19 17:45:02,513 INFO [train.py:996] (1/4) Epoch 3, batch 25450, loss[loss=0.2746, simple_loss=0.3471, pruned_loss=0.101, over 21391.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3252, pruned_loss=0.1012, over 4244267.52 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:45:07,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=518634.0, ans=0.1 2023-06-19 17:45:11,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=518634.0, ans=0.125 2023-06-19 17:45:11,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=518634.0, ans=0.0 2023-06-19 17:45:21,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-19 17:45:22,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=518694.0, ans=0.125 2023-06-19 17:45:40,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=518754.0, ans=0.0 2023-06-19 17:46:46,181 INFO [train.py:996] (1/4) Epoch 3, batch 25500, loss[loss=0.2041, simple_loss=0.2979, pruned_loss=0.05514, over 21592.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3246, pruned_loss=0.09711, over 4242185.61 frames. ], batch size: 263, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:46:49,359 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 2.642e+02 3.063e+02 3.580e+02 7.751e+02, threshold=6.127e+02, percent-clipped=1.0 2023-06-19 17:46:54,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=518934.0, ans=0.125 2023-06-19 17:46:57,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=518934.0, ans=0.125 2023-06-19 17:47:22,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519054.0, ans=0.1 2023-06-19 17:47:34,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=519054.0, ans=0.125 2023-06-19 17:47:50,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=519114.0, ans=0.125 2023-06-19 17:47:57,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=519114.0, ans=0.2 2023-06-19 17:48:23,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519174.0, ans=0.1 2023-06-19 17:48:31,146 INFO [train.py:996] (1/4) Epoch 3, batch 25550, loss[loss=0.2675, simple_loss=0.3611, pruned_loss=0.08689, over 21879.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3319, pruned_loss=0.09764, over 4242509.69 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:49:11,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=519354.0, ans=0.125 2023-06-19 17:49:19,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=519354.0, ans=0.0 2023-06-19 17:50:07,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=519474.0, ans=0.125 2023-06-19 17:50:20,659 INFO [train.py:996] (1/4) Epoch 3, batch 25600, loss[loss=0.2887, simple_loss=0.3517, pruned_loss=0.1129, over 21501.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3361, pruned_loss=0.09796, over 4253221.27 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:50:23,735 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.720e+02 3.211e+02 3.853e+02 6.629e+02, threshold=6.421e+02, percent-clipped=1.0 2023-06-19 17:50:50,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519594.0, ans=0.1 2023-06-19 17:51:48,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=519774.0, ans=0.125 2023-06-19 17:51:57,679 INFO [train.py:996] (1/4) Epoch 3, batch 25650, loss[loss=0.2878, simple_loss=0.3337, pruned_loss=0.121, over 21800.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.336, pruned_loss=0.1009, over 4260037.95 frames. ], batch size: 371, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:52:39,627 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.38 vs. limit=10.0 2023-06-19 17:53:02,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=520014.0, ans=0.0 2023-06-19 17:53:24,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=520074.0, ans=0.0 2023-06-19 17:53:33,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=520074.0, ans=0.125 2023-06-19 17:53:39,104 INFO [train.py:996] (1/4) Epoch 3, batch 25700, loss[loss=0.2597, simple_loss=0.3054, pruned_loss=0.107, over 21125.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.334, pruned_loss=0.1034, over 4267086.44 frames. ], batch size: 608, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:53:46,984 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.093e+02 3.779e+02 4.609e+02 9.934e+02, threshold=7.559e+02, percent-clipped=6.0 2023-06-19 17:54:54,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=520314.0, ans=0.09899494936611666 2023-06-19 17:54:56,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=520314.0, ans=0.125 2023-06-19 17:55:05,922 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.533e-03 2023-06-19 17:55:28,409 INFO [train.py:996] (1/4) Epoch 3, batch 25750, loss[loss=0.2694, simple_loss=0.3351, pruned_loss=0.1019, over 19998.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3416, pruned_loss=0.1072, over 4272935.13 frames. ], batch size: 702, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:55:41,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=520434.0, ans=0.0 2023-06-19 17:56:13,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=520494.0, ans=0.05 2023-06-19 17:56:21,050 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-19 17:57:00,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=520674.0, ans=0.0 2023-06-19 17:57:00,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=520674.0, ans=10.0 2023-06-19 17:57:14,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=520734.0, ans=0.1 2023-06-19 17:57:15,531 INFO [train.py:996] (1/4) Epoch 3, batch 25800, loss[loss=0.3575, simple_loss=0.4047, pruned_loss=0.1551, over 21311.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3547, pruned_loss=0.1125, over 4272943.76 frames. ], batch size: 507, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:57:25,026 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.639e+02 4.483e+02 6.036e+02 1.254e+03, threshold=8.967e+02, percent-clipped=11.0 2023-06-19 17:57:49,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=520794.0, ans=0.125 2023-06-19 17:58:30,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=520914.0, ans=0.125 2023-06-19 17:59:06,165 INFO [train.py:996] (1/4) Epoch 3, batch 25850, loss[loss=0.2994, simple_loss=0.3594, pruned_loss=0.1197, over 21836.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3558, pruned_loss=0.1118, over 4274259.44 frames. ], batch size: 107, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:59:17,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=521034.0, ans=15.0 2023-06-19 17:59:54,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=521154.0, ans=0.125 2023-06-19 18:00:56,487 INFO [train.py:996] (1/4) Epoch 3, batch 25900, loss[loss=0.3439, simple_loss=0.4173, pruned_loss=0.1353, over 21737.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3577, pruned_loss=0.1126, over 4279122.72 frames. ], batch size: 247, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:01:01,404 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.484e+02 3.101e+02 3.467e+02 4.368e+02 8.294e+02, threshold=6.933e+02, percent-clipped=0.0 2023-06-19 18:01:05,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=15.0 2023-06-19 18:02:36,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=521574.0, ans=0.0 2023-06-19 18:02:39,548 INFO [train.py:996] (1/4) Epoch 3, batch 25950, loss[loss=0.2917, simple_loss=0.3632, pruned_loss=0.1101, over 21712.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3621, pruned_loss=0.115, over 4281237.81 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:02:51,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=521634.0, ans=0.125 2023-06-19 18:03:26,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=521754.0, ans=0.2 2023-06-19 18:03:58,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=521814.0, ans=0.125 2023-06-19 18:04:09,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-19 18:04:18,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-19 18:04:24,049 INFO [train.py:996] (1/4) Epoch 3, batch 26000, loss[loss=0.3368, simple_loss=0.389, pruned_loss=0.1423, over 21347.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3618, pruned_loss=0.1129, over 4282041.25 frames. ], batch size: 131, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:04:32,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=521934.0, ans=0.125 2023-06-19 18:04:35,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.072e+02 3.699e+02 4.692e+02 7.013e+02, threshold=7.398e+02, percent-clipped=1.0 2023-06-19 18:04:47,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-19 18:05:05,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=522054.0, ans=0.0 2023-06-19 18:05:39,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=522114.0, ans=0.125 2023-06-19 18:06:05,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=522234.0, ans=0.2 2023-06-19 18:06:06,926 INFO [train.py:996] (1/4) Epoch 3, batch 26050, loss[loss=0.294, simple_loss=0.3435, pruned_loss=0.1222, over 21960.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3608, pruned_loss=0.1139, over 4280519.88 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:06:26,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=522294.0, ans=0.125 2023-06-19 18:06:36,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=522294.0, ans=0.125 2023-06-19 18:07:38,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=522474.0, ans=0.125 2023-06-19 18:07:49,530 INFO [train.py:996] (1/4) Epoch 3, batch 26100, loss[loss=0.2572, simple_loss=0.3141, pruned_loss=0.1002, over 21863.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3564, pruned_loss=0.1138, over 4286673.97 frames. ], batch size: 298, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:07:59,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=522534.0, ans=0.09899494936611666 2023-06-19 18:08:01,012 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.956e+02 3.379e+02 4.537e+02 7.018e+02, threshold=6.758e+02, percent-clipped=0.0 2023-06-19 18:08:56,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=522654.0, ans=0.125 2023-06-19 18:09:07,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.07 vs. limit=22.5 2023-06-19 18:09:31,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=522774.0, ans=0.0 2023-06-19 18:09:39,636 INFO [train.py:996] (1/4) Epoch 3, batch 26150, loss[loss=0.298, simple_loss=0.3576, pruned_loss=0.1192, over 21321.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3522, pruned_loss=0.1132, over 4291023.58 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:10:00,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-19 18:10:02,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-19 18:11:18,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=523074.0, ans=0.0 2023-06-19 18:11:23,925 INFO [train.py:996] (1/4) Epoch 3, batch 26200, loss[loss=0.2956, simple_loss=0.3864, pruned_loss=0.1024, over 21749.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3535, pruned_loss=0.1114, over 4288746.43 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:11:30,761 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 3.095e+02 3.569e+02 4.232e+02 6.752e+02, threshold=7.138e+02, percent-clipped=0.0 2023-06-19 18:12:39,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=523314.0, ans=0.0 2023-06-19 18:13:06,878 INFO [train.py:996] (1/4) Epoch 3, batch 26250, loss[loss=0.2535, simple_loss=0.3241, pruned_loss=0.09144, over 21842.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3565, pruned_loss=0.1091, over 4281677.71 frames. ], batch size: 298, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:13:12,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=523434.0, ans=0.0 2023-06-19 18:14:13,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=523614.0, ans=0.125 2023-06-19 18:14:23,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=523614.0, ans=0.0 2023-06-19 18:14:44,539 INFO [train.py:996] (1/4) Epoch 3, batch 26300, loss[loss=0.2797, simple_loss=0.3339, pruned_loss=0.1128, over 21628.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3533, pruned_loss=0.1102, over 4289173.25 frames. ], batch size: 212, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:14:46,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=523734.0, ans=0.0 2023-06-19 18:14:51,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.134e+02 3.781e+02 4.659e+02 7.680e+02, threshold=7.563e+02, percent-clipped=3.0 2023-06-19 18:15:29,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=523794.0, ans=0.1 2023-06-19 18:16:09,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=523914.0, ans=0.2 2023-06-19 18:16:39,669 INFO [train.py:996] (1/4) Epoch 3, batch 26350, loss[loss=0.3004, simple_loss=0.3588, pruned_loss=0.121, over 21618.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3517, pruned_loss=0.1114, over 4287962.88 frames. ], batch size: 263, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:16:43,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=524034.0, ans=0.125 2023-06-19 18:17:33,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=524154.0, ans=0.1 2023-06-19 18:17:34,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=524154.0, ans=0.125 2023-06-19 18:18:01,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=524274.0, ans=0.04949747468305833 2023-06-19 18:18:05,338 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:18:16,283 INFO [train.py:996] (1/4) Epoch 3, batch 26400, loss[loss=0.2596, simple_loss=0.306, pruned_loss=0.1066, over 21730.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3459, pruned_loss=0.111, over 4270895.09 frames. ], batch size: 112, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:18:28,546 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.857e+02 3.384e+02 4.347e+02 8.285e+02, threshold=6.769e+02, percent-clipped=0.0 2023-06-19 18:18:30,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=524334.0, ans=0.0 2023-06-19 18:18:32,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524334.0, ans=0.1 2023-06-19 18:18:40,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=524334.0, ans=0.125 2023-06-19 18:19:26,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=524514.0, ans=0.1 2023-06-19 18:20:12,385 INFO [train.py:996] (1/4) Epoch 3, batch 26450, loss[loss=0.3032, simple_loss=0.4216, pruned_loss=0.09235, over 20757.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3454, pruned_loss=0.1102, over 4270470.74 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:20:32,009 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-06-19 18:20:32,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=524694.0, ans=0.125 2023-06-19 18:20:51,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2023-06-19 18:21:03,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=524754.0, ans=0.125 2023-06-19 18:21:23,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=524814.0, ans=0.125 2023-06-19 18:21:56,760 INFO [train.py:996] (1/4) Epoch 3, batch 26500, loss[loss=0.3522, simple_loss=0.4128, pruned_loss=0.1458, over 21668.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3504, pruned_loss=0.1098, over 4264628.17 frames. ], batch size: 441, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:22:04,687 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.354e+02 4.139e+02 5.566e+02 7.518e+02, threshold=8.277e+02, percent-clipped=7.0 2023-06-19 18:22:40,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.08 vs. limit=10.0 2023-06-19 18:23:39,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=525174.0, ans=0.1 2023-06-19 18:23:42,726 INFO [train.py:996] (1/4) Epoch 3, batch 26550, loss[loss=0.2112, simple_loss=0.273, pruned_loss=0.07468, over 21264.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3462, pruned_loss=0.1051, over 4262630.58 frames. ], batch size: 176, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:23:43,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=525234.0, ans=0.125 2023-06-19 18:24:03,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=12.0 2023-06-19 18:24:03,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=525294.0, ans=12.0 2023-06-19 18:25:22,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=525474.0, ans=0.0 2023-06-19 18:25:30,078 INFO [train.py:996] (1/4) Epoch 3, batch 26600, loss[loss=0.2657, simple_loss=0.3274, pruned_loss=0.102, over 21576.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3452, pruned_loss=0.1018, over 4262943.71 frames. ], batch size: 263, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:25:35,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=525534.0, ans=0.125 2023-06-19 18:25:38,687 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 3.044e+02 3.646e+02 4.264e+02 8.431e+02, threshold=7.292e+02, percent-clipped=1.0 2023-06-19 18:25:50,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=525594.0, ans=0.125 2023-06-19 18:27:08,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=525774.0, ans=0.125 2023-06-19 18:27:13,235 INFO [train.py:996] (1/4) Epoch 3, batch 26650, loss[loss=0.1959, simple_loss=0.2592, pruned_loss=0.0663, over 21127.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3374, pruned_loss=0.1003, over 4248186.62 frames. ], batch size: 143, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:27:43,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525894.0, ans=0.1 2023-06-19 18:27:51,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=525894.0, ans=0.07 2023-06-19 18:28:29,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-19 18:28:48,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=526074.0, ans=0.1 2023-06-19 18:28:55,360 INFO [train.py:996] (1/4) Epoch 3, batch 26700, loss[loss=0.2948, simple_loss=0.35, pruned_loss=0.1198, over 21904.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3281, pruned_loss=0.096, over 4251365.11 frames. ], batch size: 107, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:29:00,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=526134.0, ans=0.0 2023-06-19 18:29:03,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 2.681e+02 3.249e+02 4.280e+02 9.861e+02, threshold=6.499e+02, percent-clipped=1.0 2023-06-19 18:29:58,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=526314.0, ans=0.1 2023-06-19 18:30:38,036 INFO [train.py:996] (1/4) Epoch 3, batch 26750, loss[loss=0.3012, simple_loss=0.3699, pruned_loss=0.1163, over 21322.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3273, pruned_loss=0.0953, over 4261553.11 frames. ], batch size: 548, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:30:41,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=526434.0, ans=0.0 2023-06-19 18:31:03,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=526434.0, ans=0.125 2023-06-19 18:31:32,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=526554.0, ans=0.0 2023-06-19 18:31:47,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=526554.0, ans=0.0 2023-06-19 18:32:06,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=526674.0, ans=0.125 2023-06-19 18:32:09,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=526674.0, ans=0.125 2023-06-19 18:32:35,828 INFO [train.py:996] (1/4) Epoch 3, batch 26800, loss[loss=0.3136, simple_loss=0.3707, pruned_loss=0.1283, over 20684.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3375, pruned_loss=0.1018, over 4268805.98 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:32:49,009 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 3.066e+02 3.643e+02 4.361e+02 8.068e+02, threshold=7.286e+02, percent-clipped=5.0 2023-06-19 18:33:06,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=526794.0, ans=0.0 2023-06-19 18:33:40,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=526914.0, ans=0.125 2023-06-19 18:33:47,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=526914.0, ans=0.125 2023-06-19 18:33:49,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=526914.0, ans=0.125 2023-06-19 18:34:23,767 INFO [train.py:996] (1/4) Epoch 3, batch 26850, loss[loss=0.2181, simple_loss=0.2771, pruned_loss=0.07951, over 21641.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3386, pruned_loss=0.1044, over 4270859.35 frames. ], batch size: 247, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:35:21,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=527214.0, ans=0.125 2023-06-19 18:35:29,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=527214.0, ans=0.125 2023-06-19 18:35:51,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=527274.0, ans=0.2 2023-06-19 18:35:55,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=527274.0, ans=0.125 2023-06-19 18:36:05,753 INFO [train.py:996] (1/4) Epoch 3, batch 26900, loss[loss=0.2997, simple_loss=0.3211, pruned_loss=0.1391, over 21517.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3303, pruned_loss=0.1031, over 4273667.33 frames. ], batch size: 512, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:36:11,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527334.0, ans=0.1 2023-06-19 18:36:14,251 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.946e+02 3.321e+02 4.106e+02 6.345e+02, threshold=6.642e+02, percent-clipped=0.0 2023-06-19 18:36:40,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=527454.0, ans=0.0 2023-06-19 18:36:51,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=527454.0, ans=0.125 2023-06-19 18:37:02,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=527514.0, ans=0.125 2023-06-19 18:37:25,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=527574.0, ans=0.0 2023-06-19 18:37:49,033 INFO [train.py:996] (1/4) Epoch 3, batch 26950, loss[loss=0.3004, simple_loss=0.3957, pruned_loss=0.1025, over 20803.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3299, pruned_loss=0.1035, over 4276813.48 frames. ], batch size: 608, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:37:49,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=527634.0, ans=0.125 2023-06-19 18:38:07,946 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-19 18:38:16,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-19 18:38:28,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=527754.0, ans=0.2 2023-06-19 18:38:38,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=527754.0, ans=0.125 2023-06-19 18:39:14,819 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:39:32,573 INFO [train.py:996] (1/4) Epoch 3, batch 27000, loss[loss=0.2669, simple_loss=0.3677, pruned_loss=0.08304, over 20760.00 frames. ], tot_loss[loss=0.266, simple_loss=0.33, pruned_loss=0.1009, over 4266918.53 frames. ], batch size: 608, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:39:32,573 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 18:39:49,113 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.2602, simple_loss=0.3579, pruned_loss=0.0813, over 1796401.00 frames. 2023-06-19 18:39:49,113 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 18:39:59,035 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.939e+02 3.560e+02 4.603e+02 8.017e+02, threshold=7.120e+02, percent-clipped=5.0 2023-06-19 18:40:27,536 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:40:32,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=528054.0, ans=0.0 2023-06-19 18:40:44,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=528054.0, ans=0.2 2023-06-19 18:40:56,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=528114.0, ans=0.0 2023-06-19 18:41:33,636 INFO [train.py:996] (1/4) Epoch 3, batch 27050, loss[loss=0.2697, simple_loss=0.3385, pruned_loss=0.1005, over 21932.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3318, pruned_loss=0.09757, over 4271567.65 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:41:50,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=528234.0, ans=0.2 2023-06-19 18:42:04,012 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-19 18:42:08,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=528294.0, ans=0.015 2023-06-19 18:42:39,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=528414.0, ans=0.125 2023-06-19 18:43:16,425 INFO [train.py:996] (1/4) Epoch 3, batch 27100, loss[loss=0.2999, simple_loss=0.3825, pruned_loss=0.1087, over 21828.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3344, pruned_loss=0.09952, over 4272276.62 frames. ], batch size: 371, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:43:21,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=528534.0, ans=0.125 2023-06-19 18:43:30,993 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.773e+02 3.198e+02 4.013e+02 8.418e+02, threshold=6.395e+02, percent-clipped=2.0 2023-06-19 18:43:40,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=528594.0, ans=0.0 2023-06-19 18:43:52,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=528594.0, ans=0.2 2023-06-19 18:44:04,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=528654.0, ans=0.0 2023-06-19 18:44:52,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=528774.0, ans=0.125 2023-06-19 18:45:00,951 INFO [train.py:996] (1/4) Epoch 3, batch 27150, loss[loss=0.2847, simple_loss=0.3709, pruned_loss=0.09923, over 21796.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.347, pruned_loss=0.1041, over 4278495.54 frames. ], batch size: 282, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:45:51,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=528954.0, ans=0.125 2023-06-19 18:46:10,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=529014.0, ans=0.0 2023-06-19 18:46:20,674 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-19 18:46:36,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=529074.0, ans=0.0 2023-06-19 18:46:49,405 INFO [train.py:996] (1/4) Epoch 3, batch 27200, loss[loss=0.258, simple_loss=0.3249, pruned_loss=0.09558, over 20058.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3546, pruned_loss=0.1063, over 4272452.62 frames. ], batch size: 703, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:46:59,294 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.363e+02 3.936e+02 4.684e+02 8.685e+02, threshold=7.872e+02, percent-clipped=10.0 2023-06-19 18:47:11,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=529194.0, ans=0.2 2023-06-19 18:47:55,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=529314.0, ans=0.125 2023-06-19 18:48:22,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=529374.0, ans=0.2 2023-06-19 18:48:31,953 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:48:33,207 INFO [train.py:996] (1/4) Epoch 3, batch 27250, loss[loss=0.2916, simple_loss=0.3516, pruned_loss=0.1158, over 21708.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3588, pruned_loss=0.1114, over 4278047.69 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:49:32,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=529554.0, ans=0.2 2023-06-19 18:49:55,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=529614.0, ans=0.125 2023-06-19 18:50:09,696 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-19 18:50:28,578 INFO [train.py:996] (1/4) Epoch 3, batch 27300, loss[loss=0.2486, simple_loss=0.3258, pruned_loss=0.08566, over 21497.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3601, pruned_loss=0.1116, over 4278380.46 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:50:32,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=529734.0, ans=0.1 2023-06-19 18:50:43,355 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.140e+02 3.530e+02 4.339e+02 7.752e+02, threshold=7.060e+02, percent-clipped=0.0 2023-06-19 18:50:52,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=529794.0, ans=0.125 2023-06-19 18:51:20,888 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-19 18:51:22,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=529854.0, ans=0.2 2023-06-19 18:51:25,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=529854.0, ans=0.125 2023-06-19 18:51:49,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=529974.0, ans=0.0 2023-06-19 18:51:50,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=529974.0, ans=0.125 2023-06-19 18:51:51,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=529974.0, ans=0.1 2023-06-19 18:52:17,057 INFO [train.py:996] (1/4) Epoch 3, batch 27350, loss[loss=0.2599, simple_loss=0.3388, pruned_loss=0.09054, over 21832.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3628, pruned_loss=0.1133, over 4276344.82 frames. ], batch size: 282, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:53:58,591 INFO [train.py:996] (1/4) Epoch 3, batch 27400, loss[loss=0.2635, simple_loss=0.3199, pruned_loss=0.1036, over 21774.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.3592, pruned_loss=0.1123, over 4275999.80 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:54:09,811 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.048e+02 3.444e+02 4.008e+02 7.916e+02, threshold=6.888e+02, percent-clipped=1.0 2023-06-19 18:54:28,276 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:54:34,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=530454.0, ans=0.0 2023-06-19 18:55:07,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=530514.0, ans=0.125 2023-06-19 18:55:30,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=530574.0, ans=0.1 2023-06-19 18:55:39,639 INFO [train.py:996] (1/4) Epoch 3, batch 27450, loss[loss=0.2616, simple_loss=0.3112, pruned_loss=0.106, over 22013.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3522, pruned_loss=0.1101, over 4278093.05 frames. ], batch size: 103, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:55:40,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=530634.0, ans=0.2 2023-06-19 18:56:14,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-19 18:56:20,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=530754.0, ans=0.125 2023-06-19 18:56:47,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=530814.0, ans=0.0 2023-06-19 18:56:49,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.96 vs. limit=22.5 2023-06-19 18:56:52,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530814.0, ans=0.1 2023-06-19 18:57:21,064 INFO [train.py:996] (1/4) Epoch 3, batch 27500, loss[loss=0.2682, simple_loss=0.3185, pruned_loss=0.1089, over 21850.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3507, pruned_loss=0.1106, over 4277914.50 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:57:32,515 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.096e+02 3.719e+02 4.715e+02 7.955e+02, threshold=7.439e+02, percent-clipped=2.0 2023-06-19 18:57:40,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-19 18:57:46,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=530994.0, ans=0.5 2023-06-19 18:58:00,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=531054.0, ans=0.0 2023-06-19 18:58:28,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=531114.0, ans=0.125 2023-06-19 18:59:01,846 INFO [train.py:996] (1/4) Epoch 3, batch 27550, loss[loss=0.2166, simple_loss=0.2895, pruned_loss=0.07186, over 21586.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3435, pruned_loss=0.1059, over 4282784.87 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:59:14,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-19 18:59:38,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-19 19:00:25,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.70 vs. limit=15.0 2023-06-19 19:00:42,095 INFO [train.py:996] (1/4) Epoch 3, batch 27600, loss[loss=0.3841, simple_loss=0.4729, pruned_loss=0.1476, over 19903.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3377, pruned_loss=0.105, over 4276591.29 frames. ], batch size: 702, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:00:53,345 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.652e+02 3.386e+02 4.273e+02 7.001e+02, threshold=6.773e+02, percent-clipped=0.0 2023-06-19 19:01:45,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=531714.0, ans=0.125 2023-06-19 19:01:51,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=531714.0, ans=0.1 2023-06-19 19:01:52,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=531714.0, ans=0.125 2023-06-19 19:02:03,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=531774.0, ans=0.0 2023-06-19 19:02:23,370 INFO [train.py:996] (1/4) Epoch 3, batch 27650, loss[loss=0.2436, simple_loss=0.2978, pruned_loss=0.09473, over 21662.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3311, pruned_loss=0.1043, over 4277847.57 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:03:06,104 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:03:11,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=531954.0, ans=0.125 2023-06-19 19:04:05,891 INFO [train.py:996] (1/4) Epoch 3, batch 27700, loss[loss=0.2936, simple_loss=0.3586, pruned_loss=0.1143, over 21733.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.331, pruned_loss=0.1022, over 4287118.57 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:04:17,179 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.929e+02 3.310e+02 4.348e+02 7.080e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-19 19:04:30,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=532194.0, ans=0.125 2023-06-19 19:04:38,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=532194.0, ans=0.125 2023-06-19 19:05:03,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=532314.0, ans=0.125 2023-06-19 19:05:10,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=532314.0, ans=0.0 2023-06-19 19:05:47,562 INFO [train.py:996] (1/4) Epoch 3, batch 27750, loss[loss=0.2493, simple_loss=0.3281, pruned_loss=0.0853, over 21772.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3342, pruned_loss=0.1019, over 4280325.01 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:05:51,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=532434.0, ans=0.125 2023-06-19 19:05:59,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=8.0 2023-06-19 19:06:02,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=532494.0, ans=0.125 2023-06-19 19:06:09,195 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:06:28,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=532554.0, ans=0.125 2023-06-19 19:06:45,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-19 19:07:29,249 INFO [train.py:996] (1/4) Epoch 3, batch 27800, loss[loss=0.2658, simple_loss=0.3217, pruned_loss=0.105, over 21863.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3319, pruned_loss=0.102, over 4287446.33 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:07:40,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.741e+02 3.231e+02 4.040e+02 7.271e+02, threshold=6.461e+02, percent-clipped=1.0 2023-06-19 19:08:37,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=532914.0, ans=0.0 2023-06-19 19:09:12,017 INFO [train.py:996] (1/4) Epoch 3, batch 27850, loss[loss=0.2555, simple_loss=0.3353, pruned_loss=0.08785, over 21454.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3314, pruned_loss=0.1031, over 4291869.44 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:09:19,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=533034.0, ans=0.125 2023-06-19 19:09:22,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=533034.0, ans=0.0 2023-06-19 19:09:32,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=533094.0, ans=0.0 2023-06-19 19:10:03,420 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-19 19:10:30,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=533214.0, ans=0.0 2023-06-19 19:10:58,474 INFO [train.py:996] (1/4) Epoch 3, batch 27900, loss[loss=0.3934, simple_loss=0.4642, pruned_loss=0.1613, over 21450.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3416, pruned_loss=0.1052, over 4294278.42 frames. ], batch size: 507, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:11:06,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=533334.0, ans=0.1 2023-06-19 19:11:16,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.014e+02 3.653e+02 4.966e+02 8.433e+02, threshold=7.306e+02, percent-clipped=7.0 2023-06-19 19:11:16,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=533334.0, ans=0.0 2023-06-19 19:11:31,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-19 19:11:50,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=533454.0, ans=0.125 2023-06-19 19:12:08,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=533514.0, ans=0.125 2023-06-19 19:12:47,270 INFO [train.py:996] (1/4) Epoch 3, batch 27950, loss[loss=0.3075, simple_loss=0.366, pruned_loss=0.1245, over 21365.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3418, pruned_loss=0.1009, over 4286176.91 frames. ], batch size: 131, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:14:09,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=533874.0, ans=0.0 2023-06-19 19:14:30,699 INFO [train.py:996] (1/4) Epoch 3, batch 28000, loss[loss=0.2588, simple_loss=0.3133, pruned_loss=0.1022, over 21449.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3386, pruned_loss=0.09817, over 4285807.55 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:14:48,825 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.713e+02 3.305e+02 4.071e+02 8.310e+02, threshold=6.609e+02, percent-clipped=2.0 2023-06-19 19:14:56,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=533994.0, ans=0.125 2023-06-19 19:15:34,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=534114.0, ans=0.1 2023-06-19 19:15:49,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-19 19:15:52,197 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:15:59,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=534174.0, ans=10.0 2023-06-19 19:16:07,046 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:16:20,854 INFO [train.py:996] (1/4) Epoch 3, batch 28050, loss[loss=0.292, simple_loss=0.3658, pruned_loss=0.1091, over 21275.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3366, pruned_loss=0.09981, over 4279936.84 frames. ], batch size: 548, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:17:11,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=534354.0, ans=0.0 2023-06-19 19:17:22,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=534414.0, ans=0.0 2023-06-19 19:18:03,130 INFO [train.py:996] (1/4) Epoch 3, batch 28100, loss[loss=0.2747, simple_loss=0.318, pruned_loss=0.1157, over 21597.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3352, pruned_loss=0.09967, over 4275584.90 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:18:21,032 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.997e+02 3.623e+02 4.325e+02 7.130e+02, threshold=7.246e+02, percent-clipped=1.0 2023-06-19 19:19:01,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.10 vs. limit=15.0 2023-06-19 19:19:03,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-19 19:19:04,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=534714.0, ans=0.125 2023-06-19 19:19:29,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=534774.0, ans=0.1 2023-06-19 19:19:44,119 INFO [train.py:996] (1/4) Epoch 3, batch 28150, loss[loss=0.2434, simple_loss=0.2968, pruned_loss=0.09507, over 21568.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3287, pruned_loss=0.1001, over 4270571.98 frames. ], batch size: 332, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:19:45,163 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-19 19:19:51,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=534834.0, ans=0.0 2023-06-19 19:20:56,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=535014.0, ans=0.2 2023-06-19 19:21:12,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=535074.0, ans=0.0 2023-06-19 19:21:20,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-19 19:21:24,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=535074.0, ans=0.5 2023-06-19 19:21:27,383 INFO [train.py:996] (1/4) Epoch 3, batch 28200, loss[loss=0.2573, simple_loss=0.3176, pruned_loss=0.0985, over 21424.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.327, pruned_loss=0.1024, over 4272602.03 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:21:50,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.171e+02 3.933e+02 4.825e+02 1.002e+03, threshold=7.866e+02, percent-clipped=3.0 2023-06-19 19:22:07,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-19 19:22:54,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=535374.0, ans=0.0 2023-06-19 19:23:01,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=535374.0, ans=0.125 2023-06-19 19:23:19,288 INFO [train.py:996] (1/4) Epoch 3, batch 28250, loss[loss=0.2717, simple_loss=0.3304, pruned_loss=0.1065, over 21869.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3331, pruned_loss=0.1063, over 4261198.50 frames. ], batch size: 107, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:23:21,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=535434.0, ans=0.0 2023-06-19 19:23:23,772 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-19 19:23:44,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=535494.0, ans=0.0 2023-06-19 19:23:49,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=535494.0, ans=0.125 2023-06-19 19:23:58,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=535554.0, ans=0.0 2023-06-19 19:24:40,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535674.0, ans=0.1 2023-06-19 19:25:00,420 INFO [train.py:996] (1/4) Epoch 3, batch 28300, loss[loss=0.2882, simple_loss=0.3668, pruned_loss=0.1048, over 21521.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.331, pruned_loss=0.1039, over 4259392.41 frames. ], batch size: 508, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:25:13,873 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.857e+02 3.337e+02 4.140e+02 8.167e+02, threshold=6.674e+02, percent-clipped=3.0 2023-06-19 19:26:11,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535914.0, ans=0.1 2023-06-19 19:26:43,615 INFO [train.py:996] (1/4) Epoch 3, batch 28350, loss[loss=0.223, simple_loss=0.2766, pruned_loss=0.08465, over 21243.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3264, pruned_loss=0.09721, over 4259277.14 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:26:50,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=536034.0, ans=0.125 2023-06-19 19:28:25,743 INFO [train.py:996] (1/4) Epoch 3, batch 28400, loss[loss=0.2712, simple_loss=0.3432, pruned_loss=0.09962, over 19967.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3238, pruned_loss=0.09707, over 4259516.33 frames. ], batch size: 703, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:28:29,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=536334.0, ans=0.125 2023-06-19 19:28:44,189 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.716e+02 3.452e+02 4.220e+02 6.740e+02, threshold=6.905e+02, percent-clipped=2.0 2023-06-19 19:28:48,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=536394.0, ans=0.125 2023-06-19 19:29:46,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=536574.0, ans=0.1 2023-06-19 19:30:02,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=536634.0, ans=0.04949747468305833 2023-06-19 19:30:03,697 INFO [train.py:996] (1/4) Epoch 3, batch 28450, loss[loss=0.3191, simple_loss=0.378, pruned_loss=0.1301, over 21343.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.331, pruned_loss=0.1026, over 4264663.97 frames. ], batch size: 549, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:30:13,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=536634.0, ans=0.125 2023-06-19 19:30:21,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=536634.0, ans=0.125 2023-06-19 19:30:52,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-19 19:31:42,083 INFO [train.py:996] (1/4) Epoch 3, batch 28500, loss[loss=0.3361, simple_loss=0.3886, pruned_loss=0.1418, over 21686.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3338, pruned_loss=0.1057, over 4266272.19 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:31:57,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=536934.0, ans=0.125 2023-06-19 19:32:01,949 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.952e+02 3.630e+02 4.610e+02 9.107e+02, threshold=7.260e+02, percent-clipped=2.0 2023-06-19 19:32:03,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=536994.0, ans=0.1 2023-06-19 19:32:08,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=536994.0, ans=0.125 2023-06-19 19:32:21,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=536994.0, ans=0.0 2023-06-19 19:33:05,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=537114.0, ans=0.125 2023-06-19 19:33:30,073 INFO [train.py:996] (1/4) Epoch 3, batch 28550, loss[loss=0.3821, simple_loss=0.4437, pruned_loss=0.1603, over 21393.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3439, pruned_loss=0.1105, over 4268655.80 frames. ], batch size: 507, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:33:56,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=537294.0, ans=0.1 2023-06-19 19:34:01,210 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:34:26,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=537354.0, ans=0.125 2023-06-19 19:35:12,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=537534.0, ans=0.125 2023-06-19 19:35:14,041 INFO [train.py:996] (1/4) Epoch 3, batch 28600, loss[loss=0.2824, simple_loss=0.3424, pruned_loss=0.1112, over 21831.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.351, pruned_loss=0.1124, over 4266287.12 frames. ], batch size: 282, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:35:38,656 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.072e+02 3.686e+02 4.724e+02 8.342e+02, threshold=7.372e+02, percent-clipped=3.0 2023-06-19 19:36:05,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=537654.0, ans=0.0 2023-06-19 19:36:16,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=537714.0, ans=0.2 2023-06-19 19:36:24,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=537714.0, ans=0.125 2023-06-19 19:36:27,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=537714.0, ans=0.0 2023-06-19 19:36:34,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=537714.0, ans=0.0 2023-06-19 19:37:00,661 INFO [train.py:996] (1/4) Epoch 3, batch 28650, loss[loss=0.2453, simple_loss=0.2951, pruned_loss=0.09776, over 21567.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3449, pruned_loss=0.1111, over 4266147.47 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:37:04,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=537834.0, ans=0.125 2023-06-19 19:37:12,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=537834.0, ans=0.125 2023-06-19 19:37:35,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=537894.0, ans=0.125 2023-06-19 19:37:49,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=537954.0, ans=0.125 2023-06-19 19:37:59,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=538014.0, ans=0.125 2023-06-19 19:38:04,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538014.0, ans=0.1 2023-06-19 19:38:15,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=538014.0, ans=0.125 2023-06-19 19:38:42,130 INFO [train.py:996] (1/4) Epoch 3, batch 28700, loss[loss=0.3196, simple_loss=0.3775, pruned_loss=0.1308, over 21849.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3443, pruned_loss=0.1119, over 4269879.66 frames. ], batch size: 124, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:38:57,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=538134.0, ans=0.0 2023-06-19 19:39:01,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.971e+02 3.318e+02 4.254e+02 6.959e+02, threshold=6.637e+02, percent-clipped=0.0 2023-06-19 19:39:02,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538194.0, ans=0.1 2023-06-19 19:39:05,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=538194.0, ans=0.025 2023-06-19 19:39:25,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=538254.0, ans=0.1 2023-06-19 19:39:36,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=538254.0, ans=0.2 2023-06-19 19:39:36,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538254.0, ans=0.0 2023-06-19 19:39:40,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=538314.0, ans=0.05 2023-06-19 19:40:01,667 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-19 19:40:06,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538374.0, ans=0.1 2023-06-19 19:40:24,347 INFO [train.py:996] (1/4) Epoch 3, batch 28750, loss[loss=0.2756, simple_loss=0.385, pruned_loss=0.08306, over 19833.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3444, pruned_loss=0.1119, over 4268800.80 frames. ], batch size: 702, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:40:54,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=538494.0, ans=0.05 2023-06-19 19:41:02,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538494.0, ans=0.1 2023-06-19 19:41:29,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=538614.0, ans=0.0 2023-06-19 19:42:11,936 INFO [train.py:996] (1/4) Epoch 3, batch 28800, loss[loss=0.2767, simple_loss=0.3444, pruned_loss=0.1045, over 21291.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3465, pruned_loss=0.1121, over 4266191.66 frames. ], batch size: 143, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:42:15,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=538734.0, ans=0.125 2023-06-19 19:42:31,958 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.000e+02 3.700e+02 5.247e+02 1.056e+03, threshold=7.400e+02, percent-clipped=15.0 2023-06-19 19:43:18,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=538914.0, ans=0.125 2023-06-19 19:43:47,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=538974.0, ans=0.2 2023-06-19 19:43:59,882 INFO [train.py:996] (1/4) Epoch 3, batch 28850, loss[loss=0.2989, simple_loss=0.3529, pruned_loss=0.1225, over 21416.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.35, pruned_loss=0.1149, over 4275744.49 frames. ], batch size: 176, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:44:22,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=8.0 2023-06-19 19:44:49,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=539154.0, ans=0.0 2023-06-19 19:45:43,210 INFO [train.py:996] (1/4) Epoch 3, batch 28900, loss[loss=0.3341, simple_loss=0.3876, pruned_loss=0.1403, over 21750.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.353, pruned_loss=0.1166, over 4279720.76 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:45:43,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=539334.0, ans=0.125 2023-06-19 19:45:45,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=539334.0, ans=0.0 2023-06-19 19:45:46,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=539334.0, ans=0.125 2023-06-19 19:45:58,267 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 3.259e+02 3.859e+02 4.928e+02 8.850e+02, threshold=7.718e+02, percent-clipped=4.0 2023-06-19 19:46:15,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-19 19:46:28,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=539454.0, ans=0.125 2023-06-19 19:47:26,879 INFO [train.py:996] (1/4) Epoch 3, batch 28950, loss[loss=0.3458, simple_loss=0.4209, pruned_loss=0.1354, over 21445.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3507, pruned_loss=0.1142, over 4281509.23 frames. ], batch size: 507, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:47:48,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=539694.0, ans=0.125 2023-06-19 19:47:50,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=539694.0, ans=0.125 2023-06-19 19:48:14,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=539754.0, ans=0.2 2023-06-19 19:48:44,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=539814.0, ans=0.2 2023-06-19 19:48:46,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=539814.0, ans=0.125 2023-06-19 19:48:50,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-19 19:49:03,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.88 vs. limit=22.5 2023-06-19 19:49:13,084 INFO [train.py:996] (1/4) Epoch 3, batch 29000, loss[loss=0.3059, simple_loss=0.3684, pruned_loss=0.1217, over 21707.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3517, pruned_loss=0.1132, over 4280533.98 frames. ], batch size: 351, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:49:27,886 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.902e+02 3.366e+02 4.190e+02 7.172e+02, threshold=6.731e+02, percent-clipped=0.0 2023-06-19 19:50:19,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=540114.0, ans=0.125 2023-06-19 19:50:47,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=540174.0, ans=0.2 2023-06-19 19:50:55,865 INFO [train.py:996] (1/4) Epoch 3, batch 29050, loss[loss=0.3234, simple_loss=0.3589, pruned_loss=0.144, over 21786.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3525, pruned_loss=0.1141, over 4278267.46 frames. ], batch size: 508, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:51:27,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=540294.0, ans=0.0 2023-06-19 19:52:08,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=540414.0, ans=0.125 2023-06-19 19:52:08,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-19 19:52:38,109 INFO [train.py:996] (1/4) Epoch 3, batch 29100, loss[loss=0.2146, simple_loss=0.271, pruned_loss=0.07909, over 21632.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3439, pruned_loss=0.1114, over 4278648.03 frames. ], batch size: 247, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:52:57,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.942e+02 3.636e+02 4.444e+02 9.761e+02, threshold=7.273e+02, percent-clipped=4.0 2023-06-19 19:53:23,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=540654.0, ans=0.1 2023-06-19 19:53:45,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=540714.0, ans=0.0 2023-06-19 19:54:07,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=540774.0, ans=0.0 2023-06-19 19:54:18,755 INFO [train.py:996] (1/4) Epoch 3, batch 29150, loss[loss=0.2459, simple_loss=0.3011, pruned_loss=0.09532, over 14940.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3395, pruned_loss=0.1082, over 4275854.96 frames. ], batch size: 60, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:54:33,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=540834.0, ans=0.125 2023-06-19 19:55:58,476 INFO [train.py:996] (1/4) Epoch 3, batch 29200, loss[loss=0.2955, simple_loss=0.3449, pruned_loss=0.1231, over 21397.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3363, pruned_loss=0.1076, over 4279364.50 frames. ], batch size: 508, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:55:59,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=541134.0, ans=0.125 2023-06-19 19:56:00,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=541134.0, ans=0.0 2023-06-19 19:56:05,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=541134.0, ans=0.0 2023-06-19 19:56:18,614 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.074e+02 3.815e+02 4.848e+02 9.248e+02, threshold=7.630e+02, percent-clipped=3.0 2023-06-19 19:56:19,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=541194.0, ans=0.04949747468305833 2023-06-19 19:56:19,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=541194.0, ans=0.0 2023-06-19 19:56:21,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=541194.0, ans=0.125 2023-06-19 19:57:41,213 INFO [train.py:996] (1/4) Epoch 3, batch 29250, loss[loss=0.2157, simple_loss=0.2915, pruned_loss=0.06989, over 21436.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3351, pruned_loss=0.1047, over 4280348.99 frames. ], batch size: 212, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 19:57:43,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=541434.0, ans=0.0 2023-06-19 19:58:43,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=541554.0, ans=0.0 2023-06-19 19:59:08,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=541674.0, ans=0.125 2023-06-19 19:59:28,961 INFO [train.py:996] (1/4) Epoch 3, batch 29300, loss[loss=0.2757, simple_loss=0.3553, pruned_loss=0.09799, over 21697.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3365, pruned_loss=0.1037, over 4277219.67 frames. ], batch size: 298, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 19:59:49,541 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.065e+02 3.693e+02 4.587e+02 7.138e+02, threshold=7.387e+02, percent-clipped=0.0 2023-06-19 20:00:53,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=541974.0, ans=0.1 2023-06-19 20:01:11,336 INFO [train.py:996] (1/4) Epoch 3, batch 29350, loss[loss=0.2394, simple_loss=0.3236, pruned_loss=0.0776, over 21706.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3327, pruned_loss=0.1025, over 4275200.79 frames. ], batch size: 298, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:01:13,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=542034.0, ans=0.125 2023-06-19 20:01:15,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=542034.0, ans=0.0 2023-06-19 20:01:27,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=542034.0, ans=0.125 2023-06-19 20:01:58,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=542154.0, ans=0.0 2023-06-19 20:02:59,325 INFO [train.py:996] (1/4) Epoch 3, batch 29400, loss[loss=0.234, simple_loss=0.3044, pruned_loss=0.08176, over 21772.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3306, pruned_loss=0.09896, over 4263916.92 frames. ], batch size: 333, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:03:01,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=542334.0, ans=0.2 2023-06-19 20:03:09,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=542334.0, ans=0.125 2023-06-19 20:03:21,660 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.918e+02 3.507e+02 4.489e+02 7.938e+02, threshold=7.015e+02, percent-clipped=2.0 2023-06-19 20:03:52,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-19 20:04:09,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=542514.0, ans=0.125 2023-06-19 20:04:10,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=542514.0, ans=0.0 2023-06-19 20:04:24,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=542574.0, ans=0.1 2023-06-19 20:04:43,521 INFO [train.py:996] (1/4) Epoch 3, batch 29450, loss[loss=0.344, simple_loss=0.3988, pruned_loss=0.1446, over 21394.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3297, pruned_loss=0.09854, over 4268223.75 frames. ], batch size: 471, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:04:50,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=542634.0, ans=0.2 2023-06-19 20:05:00,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=542634.0, ans=0.0 2023-06-19 20:05:27,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=542754.0, ans=0.0 2023-06-19 20:06:04,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=12.0 2023-06-19 20:06:04,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=22.5 2023-06-19 20:06:29,857 INFO [train.py:996] (1/4) Epoch 3, batch 29500, loss[loss=0.2639, simple_loss=0.3159, pruned_loss=0.106, over 21315.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3343, pruned_loss=0.1024, over 4277631.18 frames. ], batch size: 159, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:06:39,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=22.5 2023-06-19 20:06:45,767 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.087e+02 3.959e+02 5.251e+02 8.059e+02, threshold=7.918e+02, percent-clipped=6.0 2023-06-19 20:06:46,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-19 20:07:27,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=543114.0, ans=0.125 2023-06-19 20:07:39,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=543114.0, ans=0.0 2023-06-19 20:07:44,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=543174.0, ans=0.125 2023-06-19 20:07:57,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=543174.0, ans=0.2 2023-06-19 20:08:10,035 INFO [train.py:996] (1/4) Epoch 3, batch 29550, loss[loss=0.275, simple_loss=0.3273, pruned_loss=0.1113, over 21721.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3353, pruned_loss=0.1045, over 4279824.32 frames. ], batch size: 230, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:08:23,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=543234.0, ans=0.125 2023-06-19 20:08:37,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=543294.0, ans=0.1 2023-06-19 20:08:39,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-19 20:09:54,118 INFO [train.py:996] (1/4) Epoch 3, batch 29600, loss[loss=0.302, simple_loss=0.3732, pruned_loss=0.1154, over 21633.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3437, pruned_loss=0.1077, over 4287617.83 frames. ], batch size: 230, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:10:15,946 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.027e+02 3.599e+02 4.338e+02 7.072e+02, threshold=7.197e+02, percent-clipped=0.0 2023-06-19 20:10:56,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-19 20:11:34,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=543774.0, ans=0.1 2023-06-19 20:11:36,783 INFO [train.py:996] (1/4) Epoch 3, batch 29650, loss[loss=0.2366, simple_loss=0.3012, pruned_loss=0.08596, over 21516.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.34, pruned_loss=0.1035, over 4288895.01 frames. ], batch size: 195, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:11:37,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=543834.0, ans=0.0 2023-06-19 20:11:44,005 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-19 20:12:41,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=544014.0, ans=0.2 2023-06-19 20:13:12,955 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2023-06-19 20:13:17,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=544074.0, ans=0.0 2023-06-19 20:13:20,491 INFO [train.py:996] (1/4) Epoch 3, batch 29700, loss[loss=0.3072, simple_loss=0.408, pruned_loss=0.1032, over 21776.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3423, pruned_loss=0.104, over 4293038.65 frames. ], batch size: 332, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:13:41,155 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.649e+02 2.987e+02 3.970e+02 7.304e+02, threshold=5.973e+02, percent-clipped=1.0 2023-06-19 20:13:56,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-19 20:14:06,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=544254.0, ans=0.04949747468305833 2023-06-19 20:14:42,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=544374.0, ans=0.1 2023-06-19 20:14:59,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=544374.0, ans=0.0 2023-06-19 20:15:01,828 INFO [train.py:996] (1/4) Epoch 3, batch 29750, loss[loss=0.2609, simple_loss=0.339, pruned_loss=0.09143, over 21481.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3452, pruned_loss=0.1036, over 4290861.60 frames. ], batch size: 211, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:15:14,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=544434.0, ans=0.1 2023-06-19 20:15:14,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=544434.0, ans=0.0 2023-06-19 20:15:27,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=544494.0, ans=0.125 2023-06-19 20:15:33,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=544494.0, ans=0.125 2023-06-19 20:15:48,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=544554.0, ans=0.0 2023-06-19 20:16:25,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=544674.0, ans=0.0 2023-06-19 20:16:47,594 INFO [train.py:996] (1/4) Epoch 3, batch 29800, loss[loss=0.273, simple_loss=0.3356, pruned_loss=0.1052, over 21911.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3468, pruned_loss=0.1044, over 4294169.03 frames. ], batch size: 371, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:17:05,117 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.342e+02 4.045e+02 4.978e+02 1.039e+03, threshold=8.090e+02, percent-clipped=10.0 2023-06-19 20:17:22,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=544854.0, ans=0.0 2023-06-19 20:18:13,115 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:18:22,249 INFO [train.py:996] (1/4) Epoch 3, batch 29850, loss[loss=0.2692, simple_loss=0.3204, pruned_loss=0.109, over 21548.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3428, pruned_loss=0.1023, over 4285397.19 frames. ], batch size: 548, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:19:16,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=545154.0, ans=0.125 2023-06-19 20:19:23,575 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-19 20:19:52,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-19 20:20:08,631 INFO [train.py:996] (1/4) Epoch 3, batch 29900, loss[loss=0.217, simple_loss=0.2494, pruned_loss=0.09233, over 20083.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.341, pruned_loss=0.1042, over 4288860.25 frames. ], batch size: 703, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:20:26,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.681e+02 3.110e+02 3.688e+02 5.256e+02, threshold=6.220e+02, percent-clipped=0.0 2023-06-19 20:20:49,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=545454.0, ans=0.125 2023-06-19 20:21:06,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-19 20:21:14,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=545514.0, ans=0.0 2023-06-19 20:21:17,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=545514.0, ans=0.035 2023-06-19 20:21:22,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=545514.0, ans=0.0 2023-06-19 20:21:24,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-19 20:21:40,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=545574.0, ans=0.1 2023-06-19 20:21:46,636 INFO [train.py:996] (1/4) Epoch 3, batch 29950, loss[loss=0.3055, simple_loss=0.3617, pruned_loss=0.1246, over 21442.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3443, pruned_loss=0.1082, over 4286123.29 frames. ], batch size: 211, lr: 9.99e-03, grad_scale: 16.0 2023-06-19 20:22:27,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=545754.0, ans=0.2 2023-06-19 20:22:59,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=545814.0, ans=0.125 2023-06-19 20:23:11,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=545874.0, ans=0.125 2023-06-19 20:23:29,415 INFO [train.py:996] (1/4) Epoch 3, batch 30000, loss[loss=0.28, simple_loss=0.3722, pruned_loss=0.0939, over 21476.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3475, pruned_loss=0.1091, over 4286841.72 frames. ], batch size: 471, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:23:29,416 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 20:23:43,843 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([0.5436, 1.2051, 1.7478, 1.4012, 1.1329, 1.8480, 1.7952, 1.1686], device='cuda:1') 2023-06-19 20:23:45,895 INFO [train.py:1028] (1/4) Epoch 3, validation: loss=0.254, simple_loss=0.3581, pruned_loss=0.075, over 1796401.00 frames. 2023-06-19 20:23:45,895 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 20:24:15,682 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.901e+02 3.447e+02 4.272e+02 9.118e+02, threshold=6.893e+02, percent-clipped=6.0 2023-06-19 20:25:02,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=546114.0, ans=0.2 2023-06-19 20:25:35,463 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:25:43,251 INFO [train.py:996] (1/4) Epoch 3, batch 30050, loss[loss=0.2892, simple_loss=0.4128, pruned_loss=0.0828, over 20805.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3507, pruned_loss=0.1061, over 4279914.18 frames. ], batch size: 607, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:26:32,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=546354.0, ans=0.0 2023-06-19 20:27:00,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=546414.0, ans=0.0 2023-06-19 20:27:22,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=546534.0, ans=0.125 2023-06-19 20:27:23,637 INFO [train.py:996] (1/4) Epoch 3, batch 30100, loss[loss=0.3052, simple_loss=0.3373, pruned_loss=0.1366, over 21479.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3506, pruned_loss=0.1058, over 4272106.21 frames. ], batch size: 441, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:27:36,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-19 20:27:46,639 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.970e+02 3.475e+02 4.229e+02 7.609e+02, threshold=6.950e+02, percent-clipped=3.0 2023-06-19 20:27:59,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=546594.0, ans=0.1 2023-06-19 20:29:11,560 INFO [train.py:996] (1/4) Epoch 3, batch 30150, loss[loss=0.2761, simple_loss=0.3376, pruned_loss=0.1073, over 21666.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3454, pruned_loss=0.107, over 4266013.73 frames. ], batch size: 351, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:29:35,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=546894.0, ans=0.125 2023-06-19 20:29:35,820 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:30:26,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=547014.0, ans=0.125 2023-06-19 20:30:32,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-19 20:30:39,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547074.0, ans=0.1 2023-06-19 20:30:47,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=547074.0, ans=0.1 2023-06-19 20:31:01,064 INFO [train.py:996] (1/4) Epoch 3, batch 30200, loss[loss=0.258, simple_loss=0.3371, pruned_loss=0.08943, over 21733.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.346, pruned_loss=0.1055, over 4264322.15 frames. ], batch size: 247, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:31:17,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=547194.0, ans=0.125 2023-06-19 20:31:20,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.884e+02 3.477e+02 4.360e+02 6.992e+02, threshold=6.953e+02, percent-clipped=1.0 2023-06-19 20:31:38,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=547194.0, ans=0.125 2023-06-19 20:32:45,957 INFO [train.py:996] (1/4) Epoch 3, batch 30250, loss[loss=0.321, simple_loss=0.4156, pruned_loss=0.1132, over 21919.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3558, pruned_loss=0.1091, over 4261040.87 frames. ], batch size: 372, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:32:56,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-19 20:34:17,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=547674.0, ans=0.125 2023-06-19 20:34:29,298 INFO [train.py:996] (1/4) Epoch 3, batch 30300, loss[loss=0.2454, simple_loss=0.2926, pruned_loss=0.09907, over 21239.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3533, pruned_loss=0.1083, over 4261229.22 frames. ], batch size: 159, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:34:52,622 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.189e+02 3.746e+02 4.977e+02 8.102e+02, threshold=7.493e+02, percent-clipped=4.0 2023-06-19 20:35:54,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=547914.0, ans=0.125 2023-06-19 20:35:59,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=547974.0, ans=0.2 2023-06-19 20:36:13,941 INFO [train.py:996] (1/4) Epoch 3, batch 30350, loss[loss=0.2477, simple_loss=0.3131, pruned_loss=0.09119, over 21666.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3534, pruned_loss=0.1093, over 4268966.39 frames. ], batch size: 247, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:36:40,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=548094.0, ans=0.0 2023-06-19 20:37:21,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=548214.0, ans=0.125 2023-06-19 20:37:43,098 INFO [train.py:996] (1/4) Epoch 3, batch 30400, loss[loss=0.2536, simple_loss=0.2999, pruned_loss=0.1036, over 20321.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3479, pruned_loss=0.1073, over 4258710.72 frames. ], batch size: 703, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:37:46,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=548334.0, ans=0.125 2023-06-19 20:37:59,834 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.466e+02 4.166e+02 5.135e+02 9.055e+02, threshold=8.331e+02, percent-clipped=4.0 2023-06-19 20:38:00,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=548394.0, ans=0.2 2023-06-19 20:38:27,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=548514.0, ans=0.0 2023-06-19 20:38:54,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=548574.0, ans=0.2 2023-06-19 20:39:02,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=548634.0, ans=0.95 2023-06-19 20:39:04,605 INFO [train.py:996] (1/4) Epoch 3, batch 30450, loss[loss=0.3531, simple_loss=0.4431, pruned_loss=0.1315, over 19922.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3495, pruned_loss=0.1077, over 4200055.26 frames. ], batch size: 702, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:41:58,168 INFO [train.py:996] (1/4) Epoch 4, batch 0, loss[loss=0.2773, simple_loss=0.3261, pruned_loss=0.1143, over 21634.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3261, pruned_loss=0.1143, over 21634.00 frames. ], batch size: 196, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:41:58,169 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 20:42:15,972 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2612, simple_loss=0.3698, pruned_loss=0.07632, over 1796401.00 frames. 2023-06-19 20:42:15,973 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 20:42:45,417 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.032e+02 5.518e+02 8.293e+02 1.240e+03 3.012e+03, threshold=1.659e+03, percent-clipped=49.0 2023-06-19 20:43:27,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=549084.0, ans=0.125 2023-06-19 20:43:47,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=549144.0, ans=0.125 2023-06-19 20:43:52,588 INFO [train.py:996] (1/4) Epoch 4, batch 50, loss[loss=0.3204, simple_loss=0.3971, pruned_loss=0.1219, over 21780.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.348, pruned_loss=0.1054, over 956243.80 frames. ], batch size: 351, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:44:12,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=549264.0, ans=0.125 2023-06-19 20:45:04,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=549384.0, ans=0.125 2023-06-19 20:45:19,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-19 20:45:33,214 INFO [train.py:996] (1/4) Epoch 4, batch 100, loss[loss=0.4484, simple_loss=0.4729, pruned_loss=0.2119, over 21346.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3664, pruned_loss=0.1104, over 1699144.33 frames. ], batch size: 507, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:45:42,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=549504.0, ans=0.0 2023-06-19 20:45:49,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=549504.0, ans=0.1 2023-06-19 20:46:08,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.893e+02 3.441e+02 3.943e+02 7.428e+02, threshold=6.883e+02, percent-clipped=0.0 2023-06-19 20:46:17,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=549624.0, ans=0.125 2023-06-19 20:46:19,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=549624.0, ans=0.125 2023-06-19 20:46:28,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=549624.0, ans=0.2 2023-06-19 20:46:45,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=549684.0, ans=0.125 2023-06-19 20:46:47,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=549684.0, ans=0.125 2023-06-19 20:47:01,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=549744.0, ans=0.125 2023-06-19 20:47:07,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=549744.0, ans=0.125 2023-06-19 20:47:13,439 INFO [train.py:996] (1/4) Epoch 4, batch 150, loss[loss=0.2723, simple_loss=0.3496, pruned_loss=0.09753, over 21398.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3636, pruned_loss=0.1075, over 2265572.05 frames. ], batch size: 211, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:47:13,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=549804.0, ans=0.125 2023-06-19 20:47:51,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=549864.0, ans=0.1 2023-06-19 20:47:56,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=549924.0, ans=0.125 2023-06-19 20:48:37,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=550044.0, ans=0.125 2023-06-19 20:48:40,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=550044.0, ans=0.2 2023-06-19 20:48:44,212 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:48:46,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=550044.0, ans=0.0 2023-06-19 20:48:53,409 INFO [train.py:996] (1/4) Epoch 4, batch 200, loss[loss=0.3031, simple_loss=0.384, pruned_loss=0.1111, over 21725.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3625, pruned_loss=0.1089, over 2710479.15 frames. ], batch size: 351, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:49:18,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=550164.0, ans=0.125 2023-06-19 20:49:29,389 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.787e+02 3.303e+02 4.395e+02 6.398e+02, threshold=6.606e+02, percent-clipped=0.0 2023-06-19 20:49:30,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-19 20:50:35,690 INFO [train.py:996] (1/4) Epoch 4, batch 250, loss[loss=0.2521, simple_loss=0.3405, pruned_loss=0.0819, over 21830.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3583, pruned_loss=0.1088, over 3068620.77 frames. ], batch size: 332, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:51:37,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=550524.0, ans=0.125 2023-06-19 20:51:49,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=550584.0, ans=0.2 2023-06-19 20:52:19,234 INFO [train.py:996] (1/4) Epoch 4, batch 300, loss[loss=0.2285, simple_loss=0.2835, pruned_loss=0.0867, over 21522.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.352, pruned_loss=0.1074, over 3337787.27 frames. ], batch size: 247, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:52:30,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=550704.0, ans=0.0 2023-06-19 20:52:57,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-19 20:52:57,501 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.088e+02 3.665e+02 5.063e+02 1.079e+03, threshold=7.330e+02, percent-clipped=8.0 2023-06-19 20:53:01,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=550764.0, ans=0.125 2023-06-19 20:53:07,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=550824.0, ans=0.035 2023-06-19 20:54:05,653 INFO [train.py:996] (1/4) Epoch 4, batch 350, loss[loss=0.2277, simple_loss=0.2886, pruned_loss=0.08339, over 21753.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.346, pruned_loss=0.1063, over 3552521.71 frames. ], batch size: 317, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:54:35,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-19 20:55:07,114 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-19 20:55:54,932 INFO [train.py:996] (1/4) Epoch 4, batch 400, loss[loss=0.2679, simple_loss=0.3107, pruned_loss=0.1126, over 20119.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3399, pruned_loss=0.1042, over 3715956.87 frames. ], batch size: 703, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:56:21,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.08 vs. limit=15.0 2023-06-19 20:56:22,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.50 vs. limit=15.0 2023-06-19 20:56:26,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.883e+02 3.575e+02 4.503e+02 7.615e+02, threshold=7.149e+02, percent-clipped=2.0 2023-06-19 20:57:20,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=551544.0, ans=0.02 2023-06-19 20:57:32,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=551544.0, ans=0.125 2023-06-19 20:57:37,292 INFO [train.py:996] (1/4) Epoch 4, batch 450, loss[loss=0.2758, simple_loss=0.371, pruned_loss=0.09029, over 21851.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3351, pruned_loss=0.1019, over 3839379.69 frames. ], batch size: 316, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:57:50,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=551604.0, ans=0.125 2023-06-19 20:58:25,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=551724.0, ans=0.2 2023-06-19 20:59:16,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=551844.0, ans=0.125 2023-06-19 20:59:19,178 INFO [train.py:996] (1/4) Epoch 4, batch 500, loss[loss=0.2522, simple_loss=0.3205, pruned_loss=0.09194, over 21261.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3341, pruned_loss=0.09948, over 3942189.60 frames. ], batch size: 159, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:59:44,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.02 vs. limit=6.0 2023-06-19 20:59:53,156 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 2.948e+02 3.424e+02 4.506e+02 6.960e+02, threshold=6.848e+02, percent-clipped=0.0 2023-06-19 21:01:02,281 INFO [train.py:996] (1/4) Epoch 4, batch 550, loss[loss=0.2545, simple_loss=0.3499, pruned_loss=0.07951, over 21405.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.339, pruned_loss=0.1006, over 4010209.58 frames. ], batch size: 211, lr: 8.58e-03, grad_scale: 16.0 2023-06-19 21:01:14,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=552204.0, ans=0.0 2023-06-19 21:02:08,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552384.0, ans=0.1 2023-06-19 21:02:10,206 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-06-19 21:02:45,448 INFO [train.py:996] (1/4) Epoch 4, batch 600, loss[loss=0.2423, simple_loss=0.3154, pruned_loss=0.08458, over 21653.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3442, pruned_loss=0.1016, over 4066106.99 frames. ], batch size: 247, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:02:50,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=552504.0, ans=0.125 2023-06-19 21:03:17,434 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 3.276e+02 3.981e+02 4.951e+02 8.718e+02, threshold=7.962e+02, percent-clipped=3.0 2023-06-19 21:03:39,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552624.0, ans=0.1 2023-06-19 21:03:47,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-19 21:03:48,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2023-06-19 21:03:58,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=552684.0, ans=0.125 2023-06-19 21:04:28,072 INFO [train.py:996] (1/4) Epoch 4, batch 650, loss[loss=0.2694, simple_loss=0.3791, pruned_loss=0.07989, over 21666.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3407, pruned_loss=0.09999, over 4120325.74 frames. ], batch size: 389, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:04:56,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=552864.0, ans=0.0 2023-06-19 21:04:58,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=552864.0, ans=0.07 2023-06-19 21:04:59,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=552864.0, ans=0.2 2023-06-19 21:05:23,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=552924.0, ans=0.125 2023-06-19 21:05:26,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=552924.0, ans=0.125 2023-06-19 21:05:50,095 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-19 21:06:10,750 INFO [train.py:996] (1/4) Epoch 4, batch 700, loss[loss=0.2664, simple_loss=0.3287, pruned_loss=0.102, over 21807.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3412, pruned_loss=0.1007, over 4165701.38 frames. ], batch size: 118, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:06:43,336 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.524e+02 3.407e+02 4.015e+02 5.310e+02 1.031e+03, threshold=8.030e+02, percent-clipped=3.0 2023-06-19 21:07:06,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=553224.0, ans=0.125 2023-06-19 21:07:48,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=553344.0, ans=0.125 2023-06-19 21:07:52,966 INFO [train.py:996] (1/4) Epoch 4, batch 750, loss[loss=0.3168, simple_loss=0.429, pruned_loss=0.1023, over 20751.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3407, pruned_loss=0.1011, over 4196499.09 frames. ], batch size: 607, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:08:37,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=553524.0, ans=0.0 2023-06-19 21:09:02,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=553584.0, ans=0.125 2023-06-19 21:09:30,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=553644.0, ans=0.125 2023-06-19 21:09:34,447 INFO [train.py:996] (1/4) Epoch 4, batch 800, loss[loss=0.2245, simple_loss=0.2825, pruned_loss=0.08326, over 21622.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3371, pruned_loss=0.1014, over 4218316.81 frames. ], batch size: 247, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:09:46,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=553704.0, ans=0.125 2023-06-19 21:10:07,066 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.089e+02 3.541e+02 4.418e+02 8.046e+02, threshold=7.083e+02, percent-clipped=1.0 2023-06-19 21:10:23,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=553824.0, ans=0.0 2023-06-19 21:10:34,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-19 21:10:40,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=553884.0, ans=0.0 2023-06-19 21:10:45,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553884.0, ans=0.1 2023-06-19 21:10:49,857 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-19 21:10:58,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=553944.0, ans=0.035 2023-06-19 21:11:01,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=15.0 2023-06-19 21:11:17,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-19 21:11:18,242 INFO [train.py:996] (1/4) Epoch 4, batch 850, loss[loss=0.2308, simple_loss=0.2947, pruned_loss=0.08343, over 21514.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3352, pruned_loss=0.1019, over 4239170.41 frames. ], batch size: 212, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:12:53,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=554244.0, ans=0.04949747468305833 2023-06-19 21:13:02,642 INFO [train.py:996] (1/4) Epoch 4, batch 900, loss[loss=0.299, simple_loss=0.3504, pruned_loss=0.1238, over 21730.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3339, pruned_loss=0.1014, over 4248347.75 frames. ], batch size: 507, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:13:17,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=554304.0, ans=0.125 2023-06-19 21:13:35,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=554364.0, ans=0.125 2023-06-19 21:13:40,686 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 3.017e+02 3.559e+02 4.118e+02 8.031e+02, threshold=7.118e+02, percent-clipped=1.0 2023-06-19 21:14:06,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=12.0 2023-06-19 21:14:45,087 INFO [train.py:996] (1/4) Epoch 4, batch 950, loss[loss=0.2322, simple_loss=0.2864, pruned_loss=0.08902, over 21272.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3316, pruned_loss=0.09993, over 4260283.32 frames. ], batch size: 176, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:15:36,869 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-19 21:15:52,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=554784.0, ans=0.2 2023-06-19 21:15:58,847 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:16:27,596 INFO [train.py:996] (1/4) Epoch 4, batch 1000, loss[loss=0.21, simple_loss=0.3008, pruned_loss=0.05962, over 21724.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3307, pruned_loss=0.09979, over 4270825.67 frames. ], batch size: 351, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:17:12,744 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.951e+02 3.502e+02 4.133e+02 7.133e+02, threshold=7.004e+02, percent-clipped=1.0 2023-06-19 21:17:36,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=555084.0, ans=0.1 2023-06-19 21:17:45,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.97 vs. limit=22.5 2023-06-19 21:18:15,439 INFO [train.py:996] (1/4) Epoch 4, batch 1050, loss[loss=0.2688, simple_loss=0.3406, pruned_loss=0.09845, over 21801.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3321, pruned_loss=0.1003, over 4275275.44 frames. ], batch size: 298, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:18:53,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=555264.0, ans=0.125 2023-06-19 21:19:57,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=555504.0, ans=0.0 2023-06-19 21:19:58,899 INFO [train.py:996] (1/4) Epoch 4, batch 1100, loss[loss=0.2589, simple_loss=0.3216, pruned_loss=0.09811, over 21771.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3313, pruned_loss=0.0989, over 4283194.91 frames. ], batch size: 247, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:20:03,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-19 21:20:08,001 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=15.0 2023-06-19 21:20:18,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=555504.0, ans=0.125 2023-06-19 21:20:39,858 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 3.086e+02 3.737e+02 4.742e+02 7.537e+02, threshold=7.473e+02, percent-clipped=2.0 2023-06-19 21:21:23,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=555744.0, ans=0.0 2023-06-19 21:21:43,848 INFO [train.py:996] (1/4) Epoch 4, batch 1150, loss[loss=0.2422, simple_loss=0.3207, pruned_loss=0.08191, over 21734.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3316, pruned_loss=0.09844, over 4283783.38 frames. ], batch size: 247, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:21:54,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=555804.0, ans=0.125 2023-06-19 21:22:01,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=555804.0, ans=0.1 2023-06-19 21:22:31,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=555924.0, ans=0.125 2023-06-19 21:23:05,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=556044.0, ans=0.1 2023-06-19 21:23:26,489 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-19 21:23:33,587 INFO [train.py:996] (1/4) Epoch 4, batch 1200, loss[loss=0.3052, simple_loss=0.3761, pruned_loss=0.1171, over 21785.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3329, pruned_loss=0.09833, over 4288802.13 frames. ], batch size: 124, lr: 8.55e-03, grad_scale: 32.0 2023-06-19 21:23:59,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=556164.0, ans=0.1 2023-06-19 21:24:08,658 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.755e+02 3.087e+02 3.854e+02 6.716e+02, threshold=6.173e+02, percent-clipped=0.0 2023-06-19 21:24:12,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=556224.0, ans=0.125 2023-06-19 21:25:17,515 INFO [train.py:996] (1/4) Epoch 4, batch 1250, loss[loss=0.2373, simple_loss=0.2868, pruned_loss=0.09387, over 20158.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3357, pruned_loss=0.0992, over 4282160.11 frames. ], batch size: 703, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:25:52,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=556464.0, ans=0.125 2023-06-19 21:25:59,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=556524.0, ans=0.125 2023-06-19 21:26:34,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-19 21:27:02,097 INFO [train.py:996] (1/4) Epoch 4, batch 1300, loss[loss=0.2797, simple_loss=0.3648, pruned_loss=0.0973, over 21736.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3371, pruned_loss=0.1005, over 4291200.01 frames. ], batch size: 391, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:27:36,346 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.941e+02 3.345e+02 4.151e+02 1.109e+03, threshold=6.689e+02, percent-clipped=6.0 2023-06-19 21:27:42,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-19 21:28:44,692 INFO [train.py:996] (1/4) Epoch 4, batch 1350, loss[loss=0.246, simple_loss=0.3348, pruned_loss=0.0786, over 21426.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3373, pruned_loss=0.1006, over 4291122.72 frames. ], batch size: 211, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:29:02,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=557004.0, ans=0.2 2023-06-19 21:29:03,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=557004.0, ans=0.0 2023-06-19 21:29:50,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-19 21:29:53,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=557184.0, ans=0.1 2023-06-19 21:30:10,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=557244.0, ans=0.125 2023-06-19 21:30:27,823 INFO [train.py:996] (1/4) Epoch 4, batch 1400, loss[loss=0.2117, simple_loss=0.2806, pruned_loss=0.07138, over 21843.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3332, pruned_loss=0.09956, over 4292094.34 frames. ], batch size: 118, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:30:28,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-19 21:30:43,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=557304.0, ans=0.125 2023-06-19 21:31:00,958 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-19 21:31:03,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.007e+02 3.409e+02 4.154e+02 6.851e+02, threshold=6.817e+02, percent-clipped=4.0 2023-06-19 21:31:39,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=557484.0, ans=0.0 2023-06-19 21:31:41,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=557484.0, ans=0.125 2023-06-19 21:31:42,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-19 21:31:48,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=557484.0, ans=0.125 2023-06-19 21:32:08,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-19 21:32:10,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=557544.0, ans=0.09899494936611666 2023-06-19 21:32:12,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.86 vs. limit=10.0 2023-06-19 21:32:18,681 INFO [train.py:996] (1/4) Epoch 4, batch 1450, loss[loss=0.2403, simple_loss=0.3299, pruned_loss=0.0753, over 19897.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3322, pruned_loss=0.09991, over 4291006.32 frames. ], batch size: 702, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:32:21,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=557604.0, ans=0.0 2023-06-19 21:32:51,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-19 21:34:02,928 INFO [train.py:996] (1/4) Epoch 4, batch 1500, loss[loss=0.2213, simple_loss=0.281, pruned_loss=0.08084, over 20145.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.336, pruned_loss=0.1029, over 4292693.60 frames. ], batch size: 702, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:34:21,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-19 21:34:31,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=557964.0, ans=0.1 2023-06-19 21:34:33,018 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.980e+02 3.543e+02 4.143e+02 6.339e+02, threshold=7.086e+02, percent-clipped=0.0 2023-06-19 21:35:17,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=558084.0, ans=0.125 2023-06-19 21:35:33,096 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-19 21:35:45,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=558144.0, ans=0.07 2023-06-19 21:35:48,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=558204.0, ans=0.1 2023-06-19 21:35:49,402 INFO [train.py:996] (1/4) Epoch 4, batch 1550, loss[loss=0.2686, simple_loss=0.3606, pruned_loss=0.08828, over 21661.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3338, pruned_loss=0.1008, over 4293100.73 frames. ], batch size: 414, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:36:39,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=558324.0, ans=0.1 2023-06-19 21:37:27,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=558444.0, ans=0.125 2023-06-19 21:37:34,540 INFO [train.py:996] (1/4) Epoch 4, batch 1600, loss[loss=0.3206, simple_loss=0.3829, pruned_loss=0.1292, over 21847.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3337, pruned_loss=0.1005, over 4286544.31 frames. ], batch size: 372, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:38:15,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 2.993e+02 3.386e+02 4.443e+02 8.016e+02, threshold=6.773e+02, percent-clipped=2.0 2023-06-19 21:38:59,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=558684.0, ans=0.0 2023-06-19 21:38:59,879 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:39:07,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.46 vs. limit=6.0 2023-06-19 21:39:19,159 INFO [train.py:996] (1/4) Epoch 4, batch 1650, loss[loss=0.2849, simple_loss=0.3576, pruned_loss=0.1061, over 21765.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3331, pruned_loss=0.09988, over 4283976.67 frames. ], batch size: 332, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:39:40,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=558864.0, ans=0.2 2023-06-19 21:40:11,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=558924.0, ans=0.1 2023-06-19 21:41:05,513 INFO [train.py:996] (1/4) Epoch 4, batch 1700, loss[loss=0.2822, simple_loss=0.368, pruned_loss=0.09816, over 21736.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3375, pruned_loss=0.1025, over 4285623.36 frames. ], batch size: 332, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:41:10,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.39 vs. limit=15.0 2023-06-19 21:41:10,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=559104.0, ans=0.035 2023-06-19 21:41:32,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=559164.0, ans=0.125 2023-06-19 21:41:47,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=559164.0, ans=0.2 2023-06-19 21:41:53,218 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 2.877e+02 3.357e+02 4.119e+02 6.244e+02, threshold=6.713e+02, percent-clipped=0.0 2023-06-19 21:42:24,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-06-19 21:42:56,581 INFO [train.py:996] (1/4) Epoch 4, batch 1750, loss[loss=0.2222, simple_loss=0.3063, pruned_loss=0.0691, over 21717.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3367, pruned_loss=0.09992, over 4283903.32 frames. ], batch size: 298, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:43:21,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=559464.0, ans=0.125 2023-06-19 21:43:51,157 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-19 21:44:44,122 INFO [train.py:996] (1/4) Epoch 4, batch 1800, loss[loss=0.1701, simple_loss=0.2308, pruned_loss=0.05468, over 21312.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3352, pruned_loss=0.09783, over 4270536.81 frames. ], batch size: 131, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:45:02,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-19 21:45:05,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=559764.0, ans=0.0 2023-06-19 21:45:19,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=559764.0, ans=0.1 2023-06-19 21:45:27,954 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 3.069e+02 3.500e+02 4.481e+02 7.550e+02, threshold=6.999e+02, percent-clipped=2.0 2023-06-19 21:46:16,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.74 vs. limit=10.0 2023-06-19 21:46:34,182 INFO [train.py:996] (1/4) Epoch 4, batch 1850, loss[loss=0.3028, simple_loss=0.3753, pruned_loss=0.1151, over 21764.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3356, pruned_loss=0.09588, over 4263047.34 frames. ], batch size: 414, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:46:36,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=560004.0, ans=0.125 2023-06-19 21:46:40,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=560004.0, ans=0.125 2023-06-19 21:47:01,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=560064.0, ans=0.1 2023-06-19 21:47:06,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=560064.0, ans=0.0 2023-06-19 21:47:15,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=560124.0, ans=0.0 2023-06-19 21:48:17,776 INFO [train.py:996] (1/4) Epoch 4, batch 1900, loss[loss=0.2171, simple_loss=0.2953, pruned_loss=0.06946, over 21782.00 frames. ], tot_loss[loss=0.266, simple_loss=0.337, pruned_loss=0.09747, over 4264790.84 frames. ], batch size: 247, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:48:26,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-19 21:48:47,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=560364.0, ans=0.2 2023-06-19 21:48:53,832 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.971e+02 3.385e+02 4.219e+02 8.098e+02, threshold=6.770e+02, percent-clipped=2.0 2023-06-19 21:49:27,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=560484.0, ans=0.125 2023-06-19 21:49:33,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=560484.0, ans=0.1 2023-06-19 21:49:42,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=560544.0, ans=0.05 2023-06-19 21:49:59,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=560544.0, ans=0.2 2023-06-19 21:50:02,250 INFO [train.py:996] (1/4) Epoch 4, batch 1950, loss[loss=0.2344, simple_loss=0.2812, pruned_loss=0.09381, over 21605.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3348, pruned_loss=0.09708, over 4259306.91 frames. ], batch size: 231, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:50:05,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=560604.0, ans=0.1 2023-06-19 21:50:05,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=560604.0, ans=0.125 2023-06-19 21:50:47,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=560724.0, ans=0.1 2023-06-19 21:50:56,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=560724.0, ans=0.125 2023-06-19 21:51:21,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=560784.0, ans=0.125 2023-06-19 21:51:39,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=560844.0, ans=0.125 2023-06-19 21:51:46,748 INFO [train.py:996] (1/4) Epoch 4, batch 2000, loss[loss=0.3739, simple_loss=0.4416, pruned_loss=0.1531, over 21453.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3292, pruned_loss=0.0956, over 4257681.58 frames. ], batch size: 471, lr: 8.51e-03, grad_scale: 32.0 2023-06-19 21:52:17,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=560964.0, ans=0.1 2023-06-19 21:52:24,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.002e+02 3.642e+02 4.364e+02 7.369e+02, threshold=7.284e+02, percent-clipped=1.0 2023-06-19 21:53:04,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=561084.0, ans=0.125 2023-06-19 21:53:30,405 INFO [train.py:996] (1/4) Epoch 4, batch 2050, loss[loss=0.2864, simple_loss=0.3375, pruned_loss=0.1177, over 21334.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3311, pruned_loss=0.09657, over 4266701.97 frames. ], batch size: 159, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:53:49,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=561204.0, ans=0.2 2023-06-19 21:55:15,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-06-19 21:55:20,874 INFO [train.py:996] (1/4) Epoch 4, batch 2100, loss[loss=0.2279, simple_loss=0.291, pruned_loss=0.08237, over 21609.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3342, pruned_loss=0.09828, over 4267466.68 frames. ], batch size: 263, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:55:26,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=561504.0, ans=0.125 2023-06-19 21:55:33,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=561504.0, ans=0.1 2023-06-19 21:55:59,293 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.198e+02 3.847e+02 4.816e+02 7.420e+02, threshold=7.693e+02, percent-clipped=1.0 2023-06-19 21:56:28,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=561684.0, ans=0.2 2023-06-19 21:57:06,041 INFO [train.py:996] (1/4) Epoch 4, batch 2150, loss[loss=0.2102, simple_loss=0.2681, pruned_loss=0.07613, over 21539.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3326, pruned_loss=0.09958, over 4270968.29 frames. ], batch size: 263, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 21:57:30,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-19 21:57:52,860 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-19 21:58:07,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=561984.0, ans=0.125 2023-06-19 21:58:09,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=561984.0, ans=0.125 2023-06-19 21:58:27,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=562044.0, ans=0.125 2023-06-19 21:58:50,846 INFO [train.py:996] (1/4) Epoch 4, batch 2200, loss[loss=0.2301, simple_loss=0.2944, pruned_loss=0.08289, over 21289.00 frames. ], tot_loss[loss=0.269, simple_loss=0.336, pruned_loss=0.101, over 4277519.29 frames. ], batch size: 159, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 21:59:04,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=562104.0, ans=0.0 2023-06-19 21:59:16,273 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-19 21:59:28,563 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.061e+02 3.534e+02 4.711e+02 8.653e+02, threshold=7.068e+02, percent-clipped=2.0 2023-06-19 21:59:44,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=562224.0, ans=0.125 2023-06-19 21:59:46,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=562224.0, ans=0.125 2023-06-19 21:59:55,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=562284.0, ans=0.125 2023-06-19 22:00:18,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=562344.0, ans=0.125 2023-06-19 22:00:28,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=562404.0, ans=0.125 2023-06-19 22:00:29,279 INFO [train.py:996] (1/4) Epoch 4, batch 2250, loss[loss=0.2261, simple_loss=0.2859, pruned_loss=0.08315, over 21326.00 frames. ], tot_loss[loss=0.263, simple_loss=0.33, pruned_loss=0.09794, over 4273437.58 frames. ], batch size: 144, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 22:00:33,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=562404.0, ans=0.2 2023-06-19 22:00:53,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=562464.0, ans=0.0 2023-06-19 22:01:44,481 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-19 22:02:08,498 INFO [train.py:996] (1/4) Epoch 4, batch 2300, loss[loss=0.2475, simple_loss=0.3021, pruned_loss=0.09648, over 21717.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3262, pruned_loss=0.09746, over 4278337.62 frames. ], batch size: 283, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 22:02:20,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=562704.0, ans=0.125 2023-06-19 22:02:51,583 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.061e+02 3.548e+02 4.710e+02 1.046e+03, threshold=7.097e+02, percent-clipped=5.0 2023-06-19 22:03:05,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2023-06-19 22:03:44,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=562944.0, ans=0.1 2023-06-19 22:03:55,565 INFO [train.py:996] (1/4) Epoch 4, batch 2350, loss[loss=0.2234, simple_loss=0.2861, pruned_loss=0.08031, over 21397.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3238, pruned_loss=0.09745, over 4281511.93 frames. ], batch size: 211, lr: 8.49e-03, grad_scale: 16.0 2023-06-19 22:04:15,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=563064.0, ans=0.125 2023-06-19 22:05:39,325 INFO [train.py:996] (1/4) Epoch 4, batch 2400, loss[loss=0.2871, simple_loss=0.3544, pruned_loss=0.1099, over 21780.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3281, pruned_loss=0.09958, over 4279785.34 frames. ], batch size: 124, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:06:19,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=563364.0, ans=0.125 2023-06-19 22:06:23,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.093e+02 3.486e+02 4.537e+02 7.539e+02, threshold=6.972e+02, percent-clipped=1.0 2023-06-19 22:06:25,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=563424.0, ans=0.0 2023-06-19 22:07:09,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=563544.0, ans=0.1 2023-06-19 22:07:19,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=563544.0, ans=0.04949747468305833 2023-06-19 22:07:22,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=563604.0, ans=0.125 2023-06-19 22:07:23,971 INFO [train.py:996] (1/4) Epoch 4, batch 2450, loss[loss=0.2839, simple_loss=0.3344, pruned_loss=0.1167, over 22010.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.335, pruned_loss=0.1018, over 4272383.80 frames. ], batch size: 103, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:08:54,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.75 vs. limit=10.0 2023-06-19 22:08:56,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=563844.0, ans=0.125 2023-06-19 22:08:59,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=563844.0, ans=0.1 2023-06-19 22:09:02,297 INFO [train.py:996] (1/4) Epoch 4, batch 2500, loss[loss=0.2358, simple_loss=0.3114, pruned_loss=0.08011, over 21721.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3317, pruned_loss=0.0999, over 4268752.60 frames. ], batch size: 112, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:09:10,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=563904.0, ans=0.2 2023-06-19 22:09:45,350 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.863e+02 3.660e+02 4.293e+02 8.660e+02, threshold=7.321e+02, percent-clipped=2.0 2023-06-19 22:10:34,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=564144.0, ans=0.0 2023-06-19 22:10:41,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=564144.0, ans=0.125 2023-06-19 22:10:45,485 INFO [train.py:996] (1/4) Epoch 4, batch 2550, loss[loss=0.3314, simple_loss=0.3949, pruned_loss=0.1339, over 21559.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3311, pruned_loss=0.09814, over 4266033.34 frames. ], batch size: 471, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:10:50,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=564204.0, ans=0.2 2023-06-19 22:11:04,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=564204.0, ans=0.125 2023-06-19 22:11:13,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-06-19 22:12:26,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=564444.0, ans=0.125 2023-06-19 22:12:29,206 INFO [train.py:996] (1/4) Epoch 4, batch 2600, loss[loss=0.3038, simple_loss=0.3601, pruned_loss=0.1237, over 21423.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3348, pruned_loss=0.09999, over 4266029.20 frames. ], batch size: 194, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:12:41,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=564504.0, ans=0.125 2023-06-19 22:13:09,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=564564.0, ans=0.125 2023-06-19 22:13:12,177 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.048e+02 3.693e+02 4.515e+02 8.330e+02, threshold=7.386e+02, percent-clipped=1.0 2023-06-19 22:13:27,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=564684.0, ans=0.5 2023-06-19 22:14:11,619 INFO [train.py:996] (1/4) Epoch 4, batch 2650, loss[loss=0.2537, simple_loss=0.3033, pruned_loss=0.102, over 21408.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3359, pruned_loss=0.1018, over 4274113.54 frames. ], batch size: 476, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:14:11,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=564804.0, ans=0.0 2023-06-19 22:14:12,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=564804.0, ans=0.125 2023-06-19 22:14:58,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-19 22:15:05,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=564924.0, ans=0.0 2023-06-19 22:15:08,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=564924.0, ans=10.0 2023-06-19 22:15:20,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=564984.0, ans=0.0 2023-06-19 22:15:32,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=564984.0, ans=0.125 2023-06-19 22:15:56,997 INFO [train.py:996] (1/4) Epoch 4, batch 2700, loss[loss=0.1872, simple_loss=0.2462, pruned_loss=0.0641, over 21226.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3318, pruned_loss=0.1001, over 4277706.78 frames. ], batch size: 176, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:16:39,944 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 3.006e+02 3.494e+02 4.497e+02 9.129e+02, threshold=6.988e+02, percent-clipped=4.0 2023-06-19 22:16:45,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=565224.0, ans=0.125 2023-06-19 22:17:37,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-06-19 22:17:40,803 INFO [train.py:996] (1/4) Epoch 4, batch 2750, loss[loss=0.2742, simple_loss=0.3351, pruned_loss=0.1066, over 21792.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.332, pruned_loss=0.1007, over 4281403.58 frames. ], batch size: 247, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:18:00,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=565404.0, ans=0.1 2023-06-19 22:18:34,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=565524.0, ans=0.125 2023-06-19 22:19:00,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=565584.0, ans=0.1 2023-06-19 22:19:32,207 INFO [train.py:996] (1/4) Epoch 4, batch 2800, loss[loss=0.2593, simple_loss=0.3138, pruned_loss=0.1024, over 21362.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3372, pruned_loss=0.1024, over 4288284.33 frames. ], batch size: 131, lr: 8.47e-03, grad_scale: 32.0 2023-06-19 22:20:17,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 3.042e+02 3.463e+02 4.341e+02 7.810e+02, threshold=6.926e+02, percent-clipped=4.0 2023-06-19 22:20:24,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-19 22:20:33,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=565884.0, ans=0.0 2023-06-19 22:21:16,493 INFO [train.py:996] (1/4) Epoch 4, batch 2850, loss[loss=0.3391, simple_loss=0.3962, pruned_loss=0.141, over 21441.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3374, pruned_loss=0.1037, over 4284055.41 frames. ], batch size: 507, lr: 8.47e-03, grad_scale: 32.0 2023-06-19 22:22:02,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=566124.0, ans=0.125 2023-06-19 22:22:11,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=566124.0, ans=0.125 2023-06-19 22:22:35,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=566244.0, ans=0.0 2023-06-19 22:22:52,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566244.0, ans=0.1 2023-06-19 22:22:59,600 INFO [train.py:996] (1/4) Epoch 4, batch 2900, loss[loss=0.2938, simple_loss=0.3495, pruned_loss=0.119, over 21893.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3371, pruned_loss=0.1045, over 4293391.71 frames. ], batch size: 107, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:23:12,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=566304.0, ans=0.125 2023-06-19 22:23:45,060 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.998e+02 3.695e+02 4.530e+02 8.664e+02, threshold=7.390e+02, percent-clipped=3.0 2023-06-19 22:24:02,599 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:24:02,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=566484.0, ans=0.125 2023-06-19 22:24:38,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=566544.0, ans=0.125 2023-06-19 22:24:42,824 INFO [train.py:996] (1/4) Epoch 4, batch 2950, loss[loss=0.2456, simple_loss=0.3208, pruned_loss=0.08518, over 21390.00 frames. ], tot_loss[loss=0.274, simple_loss=0.3388, pruned_loss=0.1046, over 4295004.68 frames. ], batch size: 131, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:24:49,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=566604.0, ans=0.1 2023-06-19 22:25:32,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=566724.0, ans=0.125 2023-06-19 22:26:22,895 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:26:25,947 INFO [train.py:996] (1/4) Epoch 4, batch 3000, loss[loss=0.3255, simple_loss=0.3867, pruned_loss=0.1321, over 21714.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3431, pruned_loss=0.1051, over 4295685.84 frames. ], batch size: 332, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:26:25,947 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-19 22:26:43,395 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2637, simple_loss=0.3577, pruned_loss=0.08486, over 1796401.00 frames. 2023-06-19 22:26:43,395 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-19 22:27:29,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.065e+02 3.685e+02 4.308e+02 7.209e+02, threshold=7.369e+02, percent-clipped=0.0 2023-06-19 22:28:11,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=567144.0, ans=0.0 2023-06-19 22:28:14,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=567144.0, ans=0.125 2023-06-19 22:28:27,657 INFO [train.py:996] (1/4) Epoch 4, batch 3050, loss[loss=0.2448, simple_loss=0.313, pruned_loss=0.08825, over 21791.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3427, pruned_loss=0.1033, over 4295952.63 frames. ], batch size: 247, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:28:58,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=567264.0, ans=0.125 2023-06-19 22:29:11,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=567324.0, ans=0.1 2023-06-19 22:29:19,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=567324.0, ans=0.04949747468305833 2023-06-19 22:30:12,636 INFO [train.py:996] (1/4) Epoch 4, batch 3100, loss[loss=0.2162, simple_loss=0.3014, pruned_loss=0.0655, over 21550.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3401, pruned_loss=0.1012, over 4296465.68 frames. ], batch size: 212, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:30:36,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=567564.0, ans=0.2 2023-06-19 22:30:52,797 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 3.250e+02 3.985e+02 4.690e+02 7.522e+02, threshold=7.970e+02, percent-clipped=1.0 2023-06-19 22:31:03,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=567624.0, ans=0.125 2023-06-19 22:31:27,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=567684.0, ans=0.125 2023-06-19 22:31:27,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=567684.0, ans=0.2 2023-06-19 22:32:03,243 INFO [train.py:996] (1/4) Epoch 4, batch 3150, loss[loss=0.319, simple_loss=0.3635, pruned_loss=0.1372, over 21655.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3416, pruned_loss=0.101, over 4295474.47 frames. ], batch size: 471, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:32:08,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=567804.0, ans=0.035 2023-06-19 22:32:52,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=567924.0, ans=0.0 2023-06-19 22:33:00,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=567924.0, ans=0.125 2023-06-19 22:33:34,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=568044.0, ans=0.05 2023-06-19 22:33:48,448 INFO [train.py:996] (1/4) Epoch 4, batch 3200, loss[loss=0.2661, simple_loss=0.3409, pruned_loss=0.09567, over 21953.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3441, pruned_loss=0.1021, over 4286382.93 frames. ], batch size: 317, lr: 8.46e-03, grad_scale: 32.0 2023-06-19 22:34:34,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.110e+02 3.486e+02 4.566e+02 1.016e+03, threshold=6.972e+02, percent-clipped=1.0 2023-06-19 22:34:39,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-19 22:35:15,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=568344.0, ans=0.125 2023-06-19 22:35:27,844 INFO [train.py:996] (1/4) Epoch 4, batch 3250, loss[loss=0.3025, simple_loss=0.3533, pruned_loss=0.1258, over 21506.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3433, pruned_loss=0.1035, over 4283260.42 frames. ], batch size: 389, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:35:52,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=568464.0, ans=0.09899494936611666 2023-06-19 22:36:00,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=568464.0, ans=0.0 2023-06-19 22:36:17,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=568524.0, ans=0.015 2023-06-19 22:36:39,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=568584.0, ans=0.1 2023-06-19 22:37:02,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=568644.0, ans=0.125 2023-06-19 22:37:12,285 INFO [train.py:996] (1/4) Epoch 4, batch 3300, loss[loss=0.243, simple_loss=0.3179, pruned_loss=0.08406, over 21210.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.338, pruned_loss=0.1033, over 4278351.97 frames. ], batch size: 176, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:37:34,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=568704.0, ans=0.0 2023-06-19 22:37:36,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=568704.0, ans=0.125 2023-06-19 22:37:51,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=568764.0, ans=0.2 2023-06-19 22:37:57,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.879e+02 3.455e+02 4.524e+02 7.307e+02, threshold=6.909e+02, percent-clipped=1.0 2023-06-19 22:38:07,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-19 22:38:26,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=568884.0, ans=0.125 2023-06-19 22:38:43,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=568944.0, ans=0.0 2023-06-19 22:38:55,624 INFO [train.py:996] (1/4) Epoch 4, batch 3350, loss[loss=0.2613, simple_loss=0.3221, pruned_loss=0.1002, over 21481.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3406, pruned_loss=0.1041, over 4276658.42 frames. ], batch size: 194, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:39:58,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=569124.0, ans=0.0 2023-06-19 22:40:00,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=569124.0, ans=0.025 2023-06-19 22:40:15,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=569184.0, ans=0.5 2023-06-19 22:40:20,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-19 22:40:50,337 INFO [train.py:996] (1/4) Epoch 4, batch 3400, loss[loss=0.261, simple_loss=0.3455, pruned_loss=0.08821, over 20911.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3413, pruned_loss=0.1047, over 4286643.50 frames. ], batch size: 607, lr: 8.45e-03, grad_scale: 16.0 2023-06-19 22:41:10,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=569364.0, ans=0.125 2023-06-19 22:41:36,998 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 3.071e+02 3.735e+02 4.641e+02 6.693e+02, threshold=7.470e+02, percent-clipped=0.0 2023-06-19 22:41:37,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=569424.0, ans=0.125 2023-06-19 22:41:53,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.66 vs. limit=5.0 2023-06-19 22:42:29,553 INFO [train.py:996] (1/4) Epoch 4, batch 3450, loss[loss=0.2187, simple_loss=0.2771, pruned_loss=0.08017, over 21509.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3368, pruned_loss=0.1036, over 4278552.53 frames. ], batch size: 195, lr: 8.45e-03, grad_scale: 16.0 2023-06-19 22:42:38,155 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:43:06,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=569664.0, ans=0.125 2023-06-19 22:43:14,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=569724.0, ans=0.125 2023-06-19 22:43:23,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=569724.0, ans=0.0 2023-06-19 22:43:34,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=569784.0, ans=0.125 2023-06-19 22:44:15,171 INFO [train.py:996] (1/4) Epoch 4, batch 3500, loss[loss=0.3025, simple_loss=0.3573, pruned_loss=0.1238, over 21360.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3441, pruned_loss=0.1074, over 4279495.95 frames. ], batch size: 471, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:44:48,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=569964.0, ans=0.125 2023-06-19 22:45:03,482 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 3.084e+02 3.677e+02 4.361e+02 8.360e+02, threshold=7.354e+02, percent-clipped=5.0 2023-06-19 22:45:37,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-19 22:45:55,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=570144.0, ans=0.125 2023-06-19 22:45:57,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=570144.0, ans=0.125 2023-06-19 22:46:00,030 INFO [train.py:996] (1/4) Epoch 4, batch 3550, loss[loss=0.2805, simple_loss=0.348, pruned_loss=0.1065, over 19946.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3485, pruned_loss=0.1092, over 4277933.75 frames. ], batch size: 703, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:46:45,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-19 22:46:51,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-19 22:47:23,955 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-19 22:47:51,323 INFO [train.py:996] (1/4) Epoch 4, batch 3600, loss[loss=0.3166, simple_loss=0.3976, pruned_loss=0.1178, over 20726.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3432, pruned_loss=0.1082, over 4269821.12 frames. ], batch size: 607, lr: 8.44e-03, grad_scale: 32.0 2023-06-19 22:48:07,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-19 22:48:29,401 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.447e+02 3.242e+02 3.839e+02 4.789e+02 9.292e+02, threshold=7.677e+02, percent-clipped=2.0 2023-06-19 22:49:04,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=570684.0, ans=0.0 2023-06-19 22:49:34,886 INFO [train.py:996] (1/4) Epoch 4, batch 3650, loss[loss=0.2541, simple_loss=0.3122, pruned_loss=0.09797, over 21250.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3446, pruned_loss=0.108, over 4264802.04 frames. ], batch size: 159, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:50:01,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-19 22:51:05,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=571044.0, ans=0.1 2023-06-19 22:51:14,657 INFO [train.py:996] (1/4) Epoch 4, batch 3700, loss[loss=0.2862, simple_loss=0.3838, pruned_loss=0.09427, over 19705.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.344, pruned_loss=0.1075, over 4278203.08 frames. ], batch size: 702, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:51:51,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=571224.0, ans=0.125 2023-06-19 22:51:52,881 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.752e+02 3.200e+02 3.601e+02 6.077e+02, threshold=6.399e+02, percent-clipped=0.0 2023-06-19 22:52:12,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=571284.0, ans=0.0 2023-06-19 22:52:19,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=571284.0, ans=0.125 2023-06-19 22:52:42,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=571344.0, ans=0.1 2023-06-19 22:52:56,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-19 22:52:57,316 INFO [train.py:996] (1/4) Epoch 4, batch 3750, loss[loss=0.2335, simple_loss=0.3139, pruned_loss=0.0766, over 21771.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3424, pruned_loss=0.1068, over 4283153.58 frames. ], batch size: 351, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:53:54,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-19 22:54:29,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=571644.0, ans=0.0 2023-06-19 22:54:40,563 INFO [train.py:996] (1/4) Epoch 4, batch 3800, loss[loss=0.3247, simple_loss=0.3774, pruned_loss=0.136, over 21705.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3387, pruned_loss=0.1049, over 4279195.89 frames. ], batch size: 441, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:54:52,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=571704.0, ans=0.125 2023-06-19 22:55:20,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=571764.0, ans=0.125 2023-06-19 22:55:27,666 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.805e+02 3.314e+02 3.828e+02 7.886e+02, threshold=6.628e+02, percent-clipped=5.0 2023-06-19 22:56:23,665 INFO [train.py:996] (1/4) Epoch 4, batch 3850, loss[loss=0.2714, simple_loss=0.3195, pruned_loss=0.1117, over 21849.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3353, pruned_loss=0.1048, over 4271502.85 frames. ], batch size: 373, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:56:33,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=572004.0, ans=0.1 2023-06-19 22:56:46,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=572064.0, ans=0.2 2023-06-19 22:58:06,833 INFO [train.py:996] (1/4) Epoch 4, batch 3900, loss[loss=0.2602, simple_loss=0.3448, pruned_loss=0.08785, over 20787.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3321, pruned_loss=0.1049, over 4274870.78 frames. ], batch size: 608, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:58:54,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=572424.0, ans=0.0 2023-06-19 22:58:55,578 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.958e+02 3.677e+02 4.804e+02 9.279e+02, threshold=7.354e+02, percent-clipped=7.0 2023-06-19 22:59:05,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=572484.0, ans=0.125 2023-06-19 22:59:51,707 INFO [train.py:996] (1/4) Epoch 4, batch 3950, loss[loss=0.2136, simple_loss=0.2788, pruned_loss=0.07424, over 21994.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3333, pruned_loss=0.1041, over 4278111.67 frames. ], batch size: 103, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:00:05,420 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-19 23:01:34,281 INFO [train.py:996] (1/4) Epoch 4, batch 4000, loss[loss=0.2457, simple_loss=0.2964, pruned_loss=0.09754, over 21444.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3275, pruned_loss=0.1006, over 4276265.83 frames. ], batch size: 212, lr: 8.42e-03, grad_scale: 32.0 2023-06-19 23:02:22,415 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.603e+02 3.194e+02 3.964e+02 9.151e+02, threshold=6.387e+02, percent-clipped=1.0 2023-06-19 23:02:45,198 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:02:54,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=573144.0, ans=0.035 2023-06-19 23:03:03,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=573144.0, ans=0.2 2023-06-19 23:03:08,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=573144.0, ans=0.125 2023-06-19 23:03:18,124 INFO [train.py:996] (1/4) Epoch 4, batch 4050, loss[loss=0.2443, simple_loss=0.322, pruned_loss=0.08329, over 21740.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3267, pruned_loss=0.09833, over 4280615.64 frames. ], batch size: 247, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:04:25,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=573384.0, ans=0.125 2023-06-19 23:04:27,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=573384.0, ans=0.2 2023-06-19 23:04:37,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=573384.0, ans=10.0 2023-06-19 23:04:47,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=573444.0, ans=0.015 2023-06-19 23:04:57,137 INFO [train.py:996] (1/4) Epoch 4, batch 4100, loss[loss=0.2689, simple_loss=0.3272, pruned_loss=0.1053, over 21897.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.328, pruned_loss=0.09816, over 4283764.35 frames. ], batch size: 316, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:05:04,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=573504.0, ans=0.1 2023-06-19 23:05:16,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=573504.0, ans=0.1 2023-06-19 23:05:46,190 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.845e+02 3.334e+02 4.002e+02 7.963e+02, threshold=6.669e+02, percent-clipped=0.0 2023-06-19 23:06:40,756 INFO [train.py:996] (1/4) Epoch 4, batch 4150, loss[loss=0.2465, simple_loss=0.3147, pruned_loss=0.08913, over 21317.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3269, pruned_loss=0.09452, over 4280936.47 frames. ], batch size: 131, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:07:22,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-19 23:07:28,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=573924.0, ans=0.04949747468305833 2023-06-19 23:07:58,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=573984.0, ans=0.125 2023-06-19 23:08:21,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=574044.0, ans=0.125 2023-06-19 23:08:25,537 INFO [train.py:996] (1/4) Epoch 4, batch 4200, loss[loss=0.2822, simple_loss=0.3562, pruned_loss=0.1041, over 21693.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3265, pruned_loss=0.09375, over 4281125.07 frames. ], batch size: 332, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:08:52,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=574164.0, ans=0.0 2023-06-19 23:09:26,357 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.689e+02 3.288e+02 4.795e+02 7.055e+02, threshold=6.577e+02, percent-clipped=3.0 2023-06-19 23:09:45,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-19 23:09:51,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-19 23:09:54,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=574344.0, ans=0.125 2023-06-19 23:09:54,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=574344.0, ans=0.125 2023-06-19 23:10:19,724 INFO [train.py:996] (1/4) Epoch 4, batch 4250, loss[loss=0.3346, simple_loss=0.3938, pruned_loss=0.1378, over 21721.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3337, pruned_loss=0.09662, over 4275714.03 frames. ], batch size: 351, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:10:20,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=574404.0, ans=0.125 2023-06-19 23:10:28,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=574404.0, ans=0.125 2023-06-19 23:11:15,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=574524.0, ans=0.125 2023-06-19 23:11:18,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=574524.0, ans=0.125 2023-06-19 23:11:22,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=574584.0, ans=0.125 2023-06-19 23:11:59,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=574644.0, ans=0.0 2023-06-19 23:12:06,299 INFO [train.py:996] (1/4) Epoch 4, batch 4300, loss[loss=0.2732, simple_loss=0.3045, pruned_loss=0.121, over 20121.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3415, pruned_loss=0.1006, over 4270057.03 frames. ], batch size: 707, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:12:19,183 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-19 23:12:33,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-19 23:12:39,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=574764.0, ans=0.0 2023-06-19 23:12:57,770 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 2.886e+02 3.415e+02 4.755e+02 8.316e+02, threshold=6.829e+02, percent-clipped=3.0 2023-06-19 23:12:59,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=574824.0, ans=0.0 2023-06-19 23:13:37,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=574944.0, ans=0.125 2023-06-19 23:13:38,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=574944.0, ans=0.035 2023-06-19 23:14:00,193 INFO [train.py:996] (1/4) Epoch 4, batch 4350, loss[loss=0.318, simple_loss=0.3706, pruned_loss=0.1327, over 20780.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3398, pruned_loss=0.09953, over 4265293.42 frames. ], batch size: 611, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:14:02,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=575004.0, ans=0.05 2023-06-19 23:14:05,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=575004.0, ans=0.2 2023-06-19 23:15:20,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=575244.0, ans=0.1 2023-06-19 23:15:40,573 INFO [train.py:996] (1/4) Epoch 4, batch 4400, loss[loss=0.2217, simple_loss=0.294, pruned_loss=0.07466, over 21074.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3375, pruned_loss=0.09935, over 4264355.10 frames. ], batch size: 143, lr: 8.40e-03, grad_scale: 32.0 2023-06-19 23:15:55,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=575364.0, ans=0.125 2023-06-19 23:16:22,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=575424.0, ans=0.5 2023-06-19 23:16:24,052 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-19 23:16:26,297 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.820e+02 3.325e+02 4.010e+02 7.079e+02, threshold=6.649e+02, percent-clipped=1.0 2023-06-19 23:16:41,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-19 23:16:50,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=575484.0, ans=0.0 2023-06-19 23:17:22,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=575544.0, ans=0.1 2023-06-19 23:17:25,256 INFO [train.py:996] (1/4) Epoch 4, batch 4450, loss[loss=0.2744, simple_loss=0.3542, pruned_loss=0.09733, over 21734.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3441, pruned_loss=0.1001, over 4266035.52 frames. ], batch size: 247, lr: 8.40e-03, grad_scale: 32.0 2023-06-19 23:17:42,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-19 23:18:01,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=575724.0, ans=0.1 2023-06-19 23:18:55,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=575844.0, ans=0.2 2023-06-19 23:19:07,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=575904.0, ans=0.125 2023-06-19 23:19:08,157 INFO [train.py:996] (1/4) Epoch 4, batch 4500, loss[loss=0.3459, simple_loss=0.4087, pruned_loss=0.1415, over 21576.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3466, pruned_loss=0.1021, over 4275625.14 frames. ], batch size: 471, lr: 8.40e-03, grad_scale: 16.0 2023-06-19 23:19:25,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=575964.0, ans=0.0 2023-06-19 23:19:51,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=576024.0, ans=0.1 2023-06-19 23:20:01,043 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.953e+02 3.681e+02 4.394e+02 8.500e+02, threshold=7.362e+02, percent-clipped=5.0 2023-06-19 23:20:28,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=576084.0, ans=0.125 2023-06-19 23:20:32,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=576084.0, ans=0.125 2023-06-19 23:20:53,465 INFO [train.py:996] (1/4) Epoch 4, batch 4550, loss[loss=0.2962, simple_loss=0.3725, pruned_loss=0.1099, over 21486.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3505, pruned_loss=0.1031, over 4276707.82 frames. ], batch size: 131, lr: 8.40e-03, grad_scale: 16.0 2023-06-19 23:21:25,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=576264.0, ans=0.1 2023-06-19 23:22:17,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=576384.0, ans=0.2 2023-06-19 23:22:38,808 INFO [train.py:996] (1/4) Epoch 4, batch 4600, loss[loss=0.2417, simple_loss=0.3188, pruned_loss=0.08232, over 21885.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3515, pruned_loss=0.1049, over 4274211.82 frames. ], batch size: 107, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:23:22,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=576624.0, ans=0.0 2023-06-19 23:23:34,652 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 2.974e+02 3.353e+02 4.220e+02 8.842e+02, threshold=6.706e+02, percent-clipped=3.0 2023-06-19 23:24:03,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=576744.0, ans=0.0 2023-06-19 23:24:21,961 INFO [train.py:996] (1/4) Epoch 4, batch 4650, loss[loss=0.2765, simple_loss=0.3693, pruned_loss=0.0919, over 20984.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3441, pruned_loss=0.1024, over 4276318.94 frames. ], batch size: 608, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:24:29,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=576804.0, ans=0.2 2023-06-19 23:24:36,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-19 23:24:58,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=576924.0, ans=0.125 2023-06-19 23:25:12,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=576924.0, ans=0.125 2023-06-19 23:25:12,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=576924.0, ans=0.125 2023-06-19 23:25:27,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=576984.0, ans=0.035 2023-06-19 23:26:00,225 INFO [train.py:996] (1/4) Epoch 4, batch 4700, loss[loss=0.2363, simple_loss=0.3349, pruned_loss=0.0689, over 20832.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3363, pruned_loss=0.1005, over 4278442.91 frames. ], batch size: 608, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:26:06,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=577104.0, ans=0.125 2023-06-19 23:26:56,728 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.084e+02 3.825e+02 4.515e+02 8.128e+02, threshold=7.651e+02, percent-clipped=5.0 2023-06-19 23:27:29,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=577344.0, ans=0.0 2023-06-19 23:27:42,039 INFO [train.py:996] (1/4) Epoch 4, batch 4750, loss[loss=0.3109, simple_loss=0.3552, pruned_loss=0.1333, over 21871.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3298, pruned_loss=0.09983, over 4280878.18 frames. ], batch size: 118, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:28:12,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-19 23:28:33,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=577524.0, ans=0.0 2023-06-19 23:28:41,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=577524.0, ans=0.1 2023-06-19 23:28:49,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=577584.0, ans=0.0 2023-06-19 23:29:15,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=577644.0, ans=0.95 2023-06-19 23:29:15,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-19 23:29:16,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=577644.0, ans=0.125 2023-06-19 23:29:27,849 INFO [train.py:996] (1/4) Epoch 4, batch 4800, loss[loss=0.2522, simple_loss=0.3098, pruned_loss=0.09727, over 21188.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.328, pruned_loss=0.09952, over 4290508.92 frames. ], batch size: 548, lr: 8.39e-03, grad_scale: 32.0 2023-06-19 23:29:41,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=577704.0, ans=10.0 2023-06-19 23:29:49,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=577764.0, ans=0.125 2023-06-19 23:29:54,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=577764.0, ans=0.125 2023-06-19 23:30:05,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=577764.0, ans=0.125 2023-06-19 23:30:08,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=577764.0, ans=0.2 2023-06-19 23:30:25,429 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 3.016e+02 3.604e+02 4.520e+02 9.140e+02, threshold=7.207e+02, percent-clipped=2.0 2023-06-19 23:30:30,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=577824.0, ans=0.125 2023-06-19 23:30:48,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=577944.0, ans=0.1 2023-06-19 23:31:11,085 INFO [train.py:996] (1/4) Epoch 4, batch 4850, loss[loss=0.2307, simple_loss=0.3018, pruned_loss=0.07976, over 21400.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3277, pruned_loss=0.09938, over 4289417.41 frames. ], batch size: 131, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:31:35,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=578064.0, ans=0.125 2023-06-19 23:31:37,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=578064.0, ans=0.125 2023-06-19 23:31:39,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=578064.0, ans=0.125 2023-06-19 23:32:21,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=578184.0, ans=0.2 2023-06-19 23:32:26,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-19 23:32:27,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=578184.0, ans=0.125 2023-06-19 23:32:53,775 INFO [train.py:996] (1/4) Epoch 4, batch 4900, loss[loss=0.3085, simple_loss=0.4005, pruned_loss=0.1083, over 20922.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3303, pruned_loss=0.1016, over 4287676.38 frames. ], batch size: 608, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:33:45,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=578424.0, ans=0.2 2023-06-19 23:33:50,453 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 3.075e+02 3.679e+02 4.552e+02 8.349e+02, threshold=7.359e+02, percent-clipped=3.0 2023-06-19 23:33:59,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=578484.0, ans=0.1 2023-06-19 23:34:37,040 INFO [train.py:996] (1/4) Epoch 4, batch 4950, loss[loss=0.236, simple_loss=0.328, pruned_loss=0.07194, over 21732.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3345, pruned_loss=0.09922, over 4288608.42 frames. ], batch size: 332, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:35:16,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=578664.0, ans=0.125 2023-06-19 23:35:54,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=578784.0, ans=0.05 2023-06-19 23:35:55,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=578784.0, ans=0.2 2023-06-19 23:36:19,051 INFO [train.py:996] (1/4) Epoch 4, batch 5000, loss[loss=0.258, simple_loss=0.3215, pruned_loss=0.09726, over 21373.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3331, pruned_loss=0.09532, over 4287607.54 frames. ], batch size: 159, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:36:45,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-19 23:36:56,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-19 23:36:57,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=578964.0, ans=0.035 2023-06-19 23:36:59,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=578964.0, ans=0.05 2023-06-19 23:37:15,611 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.766e+02 3.352e+02 4.422e+02 7.725e+02, threshold=6.703e+02, percent-clipped=2.0 2023-06-19 23:37:21,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=579024.0, ans=0.04949747468305833 2023-06-19 23:37:46,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=579144.0, ans=0.125 2023-06-19 23:37:53,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=579144.0, ans=0.125 2023-06-19 23:38:01,087 INFO [train.py:996] (1/4) Epoch 4, batch 5050, loss[loss=0.2567, simple_loss=0.3235, pruned_loss=0.09499, over 21896.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3349, pruned_loss=0.09795, over 4291164.78 frames. ], batch size: 351, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:38:11,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=579204.0, ans=0.0 2023-06-19 23:38:11,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.60 vs. limit=10.0 2023-06-19 23:38:30,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=579264.0, ans=0.2 2023-06-19 23:39:28,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=579444.0, ans=0.125 2023-06-19 23:39:43,693 INFO [train.py:996] (1/4) Epoch 4, batch 5100, loss[loss=0.2543, simple_loss=0.3098, pruned_loss=0.09944, over 21854.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3322, pruned_loss=0.09862, over 4298361.76 frames. ], batch size: 118, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:39:45,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=579504.0, ans=0.2 2023-06-19 23:40:17,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-19 23:40:39,484 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.860e+02 3.323e+02 3.950e+02 6.797e+02, threshold=6.645e+02, percent-clipped=1.0 2023-06-19 23:41:14,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-19 23:41:26,660 INFO [train.py:996] (1/4) Epoch 4, batch 5150, loss[loss=0.2526, simple_loss=0.313, pruned_loss=0.09613, over 21861.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3309, pruned_loss=0.09934, over 4298189.58 frames. ], batch size: 124, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:41:32,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=579804.0, ans=0.125 2023-06-19 23:42:16,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=579924.0, ans=0.125 2023-06-19 23:42:25,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.90 vs. limit=15.0 2023-06-19 23:42:35,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=579984.0, ans=0.125 2023-06-19 23:43:16,511 INFO [train.py:996] (1/4) Epoch 4, batch 5200, loss[loss=0.2877, simple_loss=0.3759, pruned_loss=0.09969, over 21752.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3305, pruned_loss=0.09871, over 4288124.28 frames. ], batch size: 351, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:43:20,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=580104.0, ans=0.125 2023-06-19 23:43:22,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=580104.0, ans=0.125 2023-06-19 23:43:49,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=580164.0, ans=0.125 2023-06-19 23:44:10,729 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 2.847e+02 3.708e+02 4.367e+02 7.934e+02, threshold=7.417e+02, percent-clipped=2.0 2023-06-19 23:44:16,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=580224.0, ans=0.07 2023-06-19 23:45:01,059 INFO [train.py:996] (1/4) Epoch 4, batch 5250, loss[loss=0.2738, simple_loss=0.3656, pruned_loss=0.09097, over 21269.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3354, pruned_loss=0.09701, over 4286358.70 frames. ], batch size: 548, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:45:16,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-19 23:45:26,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.10 vs. limit=15.0 2023-06-19 23:45:40,467 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:46:17,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=580584.0, ans=0.125 2023-06-19 23:46:28,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=580644.0, ans=0.1 2023-06-19 23:46:41,710 INFO [train.py:996] (1/4) Epoch 4, batch 5300, loss[loss=0.2442, simple_loss=0.3091, pruned_loss=0.08962, over 21616.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3371, pruned_loss=0.0988, over 4283385.86 frames. ], batch size: 263, lr: 8.36e-03, grad_scale: 32.0 2023-06-19 23:47:02,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-19 23:47:06,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-19 23:47:18,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=580764.0, ans=0.125 2023-06-19 23:47:26,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=580824.0, ans=22.5 2023-06-19 23:47:34,108 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.860e+02 3.383e+02 4.031e+02 8.552e+02, threshold=6.767e+02, percent-clipped=2.0 2023-06-19 23:48:04,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=580944.0, ans=0.2 2023-06-19 23:48:07,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=580944.0, ans=0.0 2023-06-19 23:48:23,140 INFO [train.py:996] (1/4) Epoch 4, batch 5350, loss[loss=0.2614, simple_loss=0.3165, pruned_loss=0.1031, over 21368.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3382, pruned_loss=0.1017, over 4284663.72 frames. ], batch size: 159, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:49:03,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=581064.0, ans=0.125 2023-06-19 23:49:13,305 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-19 23:49:41,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=581184.0, ans=0.0 2023-06-19 23:50:10,525 INFO [train.py:996] (1/4) Epoch 4, batch 5400, loss[loss=0.2402, simple_loss=0.3181, pruned_loss=0.08116, over 21782.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3357, pruned_loss=0.1023, over 4284669.74 frames. ], batch size: 371, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:50:53,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=581424.0, ans=0.2 2023-06-19 23:51:04,826 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 3.139e+02 3.601e+02 4.345e+02 9.321e+02, threshold=7.202e+02, percent-clipped=3.0 2023-06-19 23:51:10,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=581484.0, ans=0.125 2023-06-19 23:51:10,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-19 23:51:26,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=581544.0, ans=0.0 2023-06-19 23:51:27,402 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-19 23:51:48,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=581544.0, ans=0.1 2023-06-19 23:51:54,848 INFO [train.py:996] (1/4) Epoch 4, batch 5450, loss[loss=0.2498, simple_loss=0.3222, pruned_loss=0.08864, over 21717.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3337, pruned_loss=0.09966, over 4284949.28 frames. ], batch size: 112, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:53:15,873 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:53:30,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=581844.0, ans=0.1 2023-06-19 23:53:43,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=581904.0, ans=0.0 2023-06-19 23:53:44,840 INFO [train.py:996] (1/4) Epoch 4, batch 5500, loss[loss=0.3559, simple_loss=0.4288, pruned_loss=0.1415, over 21526.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3383, pruned_loss=0.09578, over 4283368.72 frames. ], batch size: 471, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:53:51,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=581904.0, ans=0.125 2023-06-19 23:53:54,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=581904.0, ans=0.125 2023-06-19 23:54:30,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=582024.0, ans=0.125 2023-06-19 23:54:33,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.703e+02 3.148e+02 3.931e+02 6.952e+02, threshold=6.296e+02, percent-clipped=0.0 2023-06-19 23:55:07,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=582144.0, ans=0.2 2023-06-19 23:55:30,366 INFO [train.py:996] (1/4) Epoch 4, batch 5550, loss[loss=0.1926, simple_loss=0.2777, pruned_loss=0.05374, over 21590.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.336, pruned_loss=0.09184, over 4281222.00 frames. ], batch size: 230, lr: 8.35e-03, grad_scale: 16.0 2023-06-19 23:55:51,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=582264.0, ans=0.0 2023-06-19 23:55:59,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=582264.0, ans=0.125 2023-06-19 23:57:19,412 INFO [train.py:996] (1/4) Epoch 4, batch 5600, loss[loss=0.2066, simple_loss=0.2838, pruned_loss=0.06467, over 21447.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3353, pruned_loss=0.09025, over 4274355.88 frames. ], batch size: 194, lr: 8.35e-03, grad_scale: 32.0 2023-06-19 23:58:12,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.737e+02 3.310e+02 4.006e+02 7.274e+02, threshold=6.621e+02, percent-clipped=1.0 2023-06-19 23:58:29,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=582684.0, ans=0.125 2023-06-19 23:58:36,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=582684.0, ans=0.125 2023-06-19 23:59:01,245 INFO [train.py:996] (1/4) Epoch 4, batch 5650, loss[loss=0.2924, simple_loss=0.3665, pruned_loss=0.1091, over 21756.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3399, pruned_loss=0.09284, over 4276193.51 frames. ], batch size: 414, lr: 8.35e-03, grad_scale: 32.0 2023-06-19 23:59:06,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=582804.0, ans=0.1 2023-06-19 23:59:20,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.18 vs. limit=6.0 2023-06-19 23:59:21,474 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:59:52,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=582924.0, ans=0.125 2023-06-20 00:00:08,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=582984.0, ans=0.125 2023-06-20 00:00:13,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=582984.0, ans=0.0 2023-06-20 00:00:26,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=583044.0, ans=0.0 2023-06-20 00:00:43,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=583104.0, ans=0.125 2023-06-20 00:00:44,940 INFO [train.py:996] (1/4) Epoch 4, batch 5700, loss[loss=0.26, simple_loss=0.3245, pruned_loss=0.09775, over 21457.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3392, pruned_loss=0.095, over 4279591.40 frames. ], batch size: 194, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 00:00:49,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-20 00:01:06,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=583164.0, ans=0.0 2023-06-20 00:01:08,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=583164.0, ans=0.0 2023-06-20 00:01:38,504 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.072e+02 3.794e+02 4.480e+02 7.487e+02, threshold=7.588e+02, percent-clipped=5.0 2023-06-20 00:01:40,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=583224.0, ans=0.0 2023-06-20 00:02:02,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=583284.0, ans=0.0 2023-06-20 00:02:29,586 INFO [train.py:996] (1/4) Epoch 4, batch 5750, loss[loss=0.2255, simple_loss=0.3092, pruned_loss=0.07085, over 21664.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3361, pruned_loss=0.09162, over 4275906.78 frames. ], batch size: 247, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 00:02:43,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=583404.0, ans=0.035 2023-06-20 00:04:13,588 INFO [train.py:996] (1/4) Epoch 4, batch 5800, loss[loss=0.2691, simple_loss=0.3498, pruned_loss=0.09418, over 21458.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3344, pruned_loss=0.0901, over 4274407.52 frames. ], batch size: 211, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:04:17,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=583704.0, ans=0.0 2023-06-20 00:04:21,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=583704.0, ans=0.2 2023-06-20 00:04:40,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=583764.0, ans=0.0 2023-06-20 00:05:02,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 2.603e+02 3.108e+02 3.966e+02 5.463e+02, threshold=6.216e+02, percent-clipped=0.0 2023-06-20 00:05:53,785 INFO [train.py:996] (1/4) Epoch 4, batch 5850, loss[loss=0.1965, simple_loss=0.2784, pruned_loss=0.05726, over 21296.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3302, pruned_loss=0.08636, over 4259664.77 frames. ], batch size: 159, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:06:05,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=584004.0, ans=0.125 2023-06-20 00:06:09,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=584004.0, ans=15.0 2023-06-20 00:06:33,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=584064.0, ans=0.125 2023-06-20 00:06:34,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=584064.0, ans=0.0 2023-06-20 00:06:52,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=584184.0, ans=0.125 2023-06-20 00:07:36,690 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.96 vs. limit=15.0 2023-06-20 00:07:36,916 INFO [train.py:996] (1/4) Epoch 4, batch 5900, loss[loss=0.2339, simple_loss=0.3052, pruned_loss=0.08123, over 21399.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3219, pruned_loss=0.08052, over 4259392.21 frames. ], batch size: 131, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:08:01,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=584304.0, ans=0.125 2023-06-20 00:08:25,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=584424.0, ans=0.125 2023-06-20 00:08:26,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=584424.0, ans=0.0 2023-06-20 00:08:29,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 2.549e+02 3.049e+02 3.679e+02 6.495e+02, threshold=6.098e+02, percent-clipped=1.0 2023-06-20 00:08:32,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.63 vs. limit=15.0 2023-06-20 00:08:35,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584484.0, ans=0.1 2023-06-20 00:08:46,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=584484.0, ans=0.2 2023-06-20 00:08:54,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=584484.0, ans=0.2 2023-06-20 00:09:09,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=584544.0, ans=0.125 2023-06-20 00:09:09,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.56 vs. limit=10.0 2023-06-20 00:09:28,638 INFO [train.py:996] (1/4) Epoch 4, batch 5950, loss[loss=0.2563, simple_loss=0.3101, pruned_loss=0.1013, over 21934.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3212, pruned_loss=0.08364, over 4262358.11 frames. ], batch size: 373, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:10:06,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=584724.0, ans=0.0 2023-06-20 00:10:16,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=584784.0, ans=0.1 2023-06-20 00:10:18,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=584784.0, ans=0.2 2023-06-20 00:11:04,354 INFO [train.py:996] (1/4) Epoch 4, batch 6000, loss[loss=0.2679, simple_loss=0.3138, pruned_loss=0.111, over 21693.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3178, pruned_loss=0.08742, over 4260691.53 frames. ], batch size: 124, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:11:04,354 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 00:11:16,129 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.4353, 2.2596, 3.5963, 3.7597], device='cuda:1') 2023-06-20 00:11:26,257 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2686, simple_loss=0.3646, pruned_loss=0.08628, over 1796401.00 frames. 2023-06-20 00:11:26,258 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 00:11:32,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=584904.0, ans=0.0 2023-06-20 00:11:48,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-06-20 00:12:19,532 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.849e+02 3.273e+02 3.960e+02 7.085e+02, threshold=6.546e+02, percent-clipped=4.0 2023-06-20 00:12:28,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=585084.0, ans=0.2 2023-06-20 00:12:31,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=585084.0, ans=0.125 2023-06-20 00:12:43,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=585084.0, ans=0.125 2023-06-20 00:13:09,979 INFO [train.py:996] (1/4) Epoch 4, batch 6050, loss[loss=0.2413, simple_loss=0.3062, pruned_loss=0.08824, over 21873.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3146, pruned_loss=0.08872, over 4269753.38 frames. ], batch size: 373, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:14:03,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=585324.0, ans=0.125 2023-06-20 00:14:08,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=585384.0, ans=0.125 2023-06-20 00:14:50,518 INFO [train.py:996] (1/4) Epoch 4, batch 6100, loss[loss=0.2271, simple_loss=0.3351, pruned_loss=0.05961, over 19747.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3145, pruned_loss=0.08747, over 4263210.41 frames. ], batch size: 703, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:14:57,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=585504.0, ans=0.125 2023-06-20 00:15:43,377 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.539e+02 3.084e+02 3.751e+02 6.044e+02, threshold=6.168e+02, percent-clipped=0.0 2023-06-20 00:15:43,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=585624.0, ans=0.125 2023-06-20 00:16:32,399 INFO [train.py:996] (1/4) Epoch 4, batch 6150, loss[loss=0.2632, simple_loss=0.3273, pruned_loss=0.09956, over 21749.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3173, pruned_loss=0.09072, over 4273380.42 frames. ], batch size: 316, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:16:48,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=585864.0, ans=0.125 2023-06-20 00:16:58,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=585864.0, ans=0.0 2023-06-20 00:17:57,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=586044.0, ans=0.125 2023-06-20 00:18:01,174 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:18:04,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=586044.0, ans=0.0 2023-06-20 00:18:14,441 INFO [train.py:996] (1/4) Epoch 4, batch 6200, loss[loss=0.2912, simple_loss=0.3941, pruned_loss=0.09413, over 21290.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3203, pruned_loss=0.09182, over 4281618.71 frames. ], batch size: 548, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:18:33,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-20 00:18:34,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=586164.0, ans=0.125 2023-06-20 00:19:05,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=586224.0, ans=0.125 2023-06-20 00:19:08,240 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.711e+02 3.284e+02 3.994e+02 6.399e+02, threshold=6.568e+02, percent-clipped=2.0 2023-06-20 00:19:08,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=586224.0, ans=0.125 2023-06-20 00:19:36,912 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-20 00:19:47,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=586344.0, ans=0.125 2023-06-20 00:19:50,780 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:19:59,684 INFO [train.py:996] (1/4) Epoch 4, batch 6250, loss[loss=0.2484, simple_loss=0.3407, pruned_loss=0.07808, over 21600.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3248, pruned_loss=0.09112, over 4285363.21 frames. ], batch size: 230, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:20:02,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2023-06-20 00:21:35,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-20 00:21:43,499 INFO [train.py:996] (1/4) Epoch 4, batch 6300, loss[loss=0.2414, simple_loss=0.3504, pruned_loss=0.06618, over 20786.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3275, pruned_loss=0.08933, over 4288018.17 frames. ], batch size: 607, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:22:43,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=586824.0, ans=0.0 2023-06-20 00:22:45,834 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.688e+02 3.149e+02 3.968e+02 6.842e+02, threshold=6.299e+02, percent-clipped=2.0 2023-06-20 00:23:03,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586884.0, ans=0.1 2023-06-20 00:23:26,097 INFO [train.py:996] (1/4) Epoch 4, batch 6350, loss[loss=0.3763, simple_loss=0.4113, pruned_loss=0.1707, over 21345.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3333, pruned_loss=0.09482, over 4290224.48 frames. ], batch size: 507, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:23:45,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=587004.0, ans=0.125 2023-06-20 00:23:48,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587064.0, ans=0.1 2023-06-20 00:25:16,734 INFO [train.py:996] (1/4) Epoch 4, batch 6400, loss[loss=0.3053, simple_loss=0.3679, pruned_loss=0.1214, over 21424.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3421, pruned_loss=0.09931, over 4285141.93 frames. ], batch size: 159, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:25:37,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=587364.0, ans=0.125 2023-06-20 00:26:03,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=587424.0, ans=0.2 2023-06-20 00:26:11,097 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.295e+02 3.771e+02 4.525e+02 8.192e+02, threshold=7.543e+02, percent-clipped=2.0 2023-06-20 00:26:22,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=587484.0, ans=0.125 2023-06-20 00:26:34,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=587544.0, ans=10.0 2023-06-20 00:26:52,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=587544.0, ans=0.125 2023-06-20 00:27:04,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=587604.0, ans=0.125 2023-06-20 00:27:05,302 INFO [train.py:996] (1/4) Epoch 4, batch 6450, loss[loss=0.2402, simple_loss=0.3213, pruned_loss=0.07959, over 21376.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3439, pruned_loss=0.09979, over 4287650.77 frames. ], batch size: 211, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:27:27,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=587664.0, ans=0.125 2023-06-20 00:27:48,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=587724.0, ans=0.05 2023-06-20 00:27:50,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=587724.0, ans=0.025 2023-06-20 00:27:50,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=587724.0, ans=0.125 2023-06-20 00:27:56,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=587724.0, ans=0.2 2023-06-20 00:28:22,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=587844.0, ans=0.125 2023-06-20 00:28:48,562 INFO [train.py:996] (1/4) Epoch 4, batch 6500, loss[loss=0.3178, simple_loss=0.3714, pruned_loss=0.1321, over 21315.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3353, pruned_loss=0.09811, over 4285704.05 frames. ], batch size: 471, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 00:28:52,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=12.0 2023-06-20 00:28:53,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=587904.0, ans=0.0 2023-06-20 00:28:55,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=587904.0, ans=0.1 2023-06-20 00:28:56,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=587904.0, ans=0.2 2023-06-20 00:29:17,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-20 00:29:26,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=588024.0, ans=0.0 2023-06-20 00:29:29,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=588024.0, ans=0.0 2023-06-20 00:29:31,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=588024.0, ans=0.2 2023-06-20 00:29:35,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.671e+02 3.231e+02 3.777e+02 5.375e+02, threshold=6.462e+02, percent-clipped=0.0 2023-06-20 00:30:04,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=588144.0, ans=0.0 2023-06-20 00:30:30,232 INFO [train.py:996] (1/4) Epoch 4, batch 6550, loss[loss=0.2773, simple_loss=0.3362, pruned_loss=0.1092, over 21840.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3343, pruned_loss=0.09699, over 4292646.15 frames. ], batch size: 351, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 00:30:32,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=588204.0, ans=0.2 2023-06-20 00:30:44,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=588204.0, ans=0.0 2023-06-20 00:30:55,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=588264.0, ans=0.0 2023-06-20 00:30:57,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=22.5 2023-06-20 00:31:11,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=588324.0, ans=0.1 2023-06-20 00:31:32,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=588384.0, ans=0.0 2023-06-20 00:32:11,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=588504.0, ans=0.0 2023-06-20 00:32:13,055 INFO [train.py:996] (1/4) Epoch 4, batch 6600, loss[loss=0.2895, simple_loss=0.3809, pruned_loss=0.09905, over 19865.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3286, pruned_loss=0.0968, over 4286734.01 frames. ], batch size: 703, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:32:18,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=588504.0, ans=0.125 2023-06-20 00:32:40,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-20 00:32:40,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-20 00:32:53,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=588624.0, ans=0.1 2023-06-20 00:33:01,890 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.719e+02 3.222e+02 3.782e+02 6.837e+02, threshold=6.444e+02, percent-clipped=2.0 2023-06-20 00:33:02,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=588624.0, ans=0.0 2023-06-20 00:33:48,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=588744.0, ans=0.125 2023-06-20 00:33:54,724 INFO [train.py:996] (1/4) Epoch 4, batch 6650, loss[loss=0.2366, simple_loss=0.2912, pruned_loss=0.09097, over 21487.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3227, pruned_loss=0.09362, over 4288393.56 frames. ], batch size: 212, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:33:55,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=588804.0, ans=0.0 2023-06-20 00:34:04,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=588804.0, ans=0.125 2023-06-20 00:35:37,426 INFO [train.py:996] (1/4) Epoch 4, batch 6700, loss[loss=0.2123, simple_loss=0.3472, pruned_loss=0.0387, over 20795.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3183, pruned_loss=0.09257, over 4284223.19 frames. ], batch size: 607, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:35:49,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=589104.0, ans=0.1 2023-06-20 00:35:57,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=589164.0, ans=0.2 2023-06-20 00:36:26,048 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.795e+02 3.323e+02 4.034e+02 6.039e+02, threshold=6.647e+02, percent-clipped=0.0 2023-06-20 00:37:18,640 INFO [train.py:996] (1/4) Epoch 4, batch 6750, loss[loss=0.2441, simple_loss=0.3022, pruned_loss=0.09302, over 21547.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3164, pruned_loss=0.09257, over 4288703.55 frames. ], batch size: 212, lr: 8.30e-03, grad_scale: 16.0 2023-06-20 00:37:43,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=589464.0, ans=0.5 2023-06-20 00:38:54,402 INFO [train.py:996] (1/4) Epoch 4, batch 6800, loss[loss=0.2196, simple_loss=0.2748, pruned_loss=0.08222, over 21639.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3178, pruned_loss=0.09527, over 4295943.18 frames. ], batch size: 247, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 00:38:56,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=589704.0, ans=0.0 2023-06-20 00:39:01,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=589704.0, ans=0.09899494936611666 2023-06-20 00:39:27,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=589764.0, ans=0.125 2023-06-20 00:39:41,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-20 00:39:43,461 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.844e+02 3.168e+02 3.952e+02 7.008e+02, threshold=6.337e+02, percent-clipped=1.0 2023-06-20 00:39:52,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.06 vs. limit=15.0 2023-06-20 00:40:35,906 INFO [train.py:996] (1/4) Epoch 4, batch 6850, loss[loss=0.2898, simple_loss=0.3329, pruned_loss=0.1234, over 21357.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3169, pruned_loss=0.09817, over 4291641.13 frames. ], batch size: 144, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 00:40:39,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=590004.0, ans=0.125 2023-06-20 00:41:37,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=590184.0, ans=0.0 2023-06-20 00:41:44,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-20 00:42:20,510 INFO [train.py:996] (1/4) Epoch 4, batch 6900, loss[loss=0.3307, simple_loss=0.3716, pruned_loss=0.1449, over 21725.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3187, pruned_loss=0.09882, over 4292455.52 frames. ], batch size: 508, lr: 8.30e-03, grad_scale: 16.0 2023-06-20 00:43:12,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=590424.0, ans=0.0 2023-06-20 00:43:22,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.145e+02 3.688e+02 5.056e+02 7.443e+02, threshold=7.376e+02, percent-clipped=5.0 2023-06-20 00:43:26,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=590484.0, ans=0.1 2023-06-20 00:44:03,274 INFO [train.py:996] (1/4) Epoch 4, batch 6950, loss[loss=0.2597, simple_loss=0.3297, pruned_loss=0.09483, over 21826.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3193, pruned_loss=0.095, over 4293423.53 frames. ], batch size: 282, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:44:32,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=590664.0, ans=0.2 2023-06-20 00:44:35,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=590664.0, ans=0.1 2023-06-20 00:45:11,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=590784.0, ans=0.0 2023-06-20 00:45:32,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=590844.0, ans=0.125 2023-06-20 00:45:50,573 INFO [train.py:996] (1/4) Epoch 4, batch 7000, loss[loss=0.2977, simple_loss=0.3764, pruned_loss=0.1095, over 21009.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3217, pruned_loss=0.0973, over 4291247.82 frames. ], batch size: 608, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:46:15,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=590964.0, ans=0.1 2023-06-20 00:46:46,589 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 3.062e+02 3.465e+02 4.392e+02 8.171e+02, threshold=6.929e+02, percent-clipped=2.0 2023-06-20 00:47:33,152 INFO [train.py:996] (1/4) Epoch 4, batch 7050, loss[loss=0.2453, simple_loss=0.3398, pruned_loss=0.07542, over 21213.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3187, pruned_loss=0.09496, over 4292685.56 frames. ], batch size: 548, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:48:18,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=591324.0, ans=0.0 2023-06-20 00:48:40,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=591384.0, ans=0.05 2023-06-20 00:49:11,733 INFO [train.py:996] (1/4) Epoch 4, batch 7100, loss[loss=0.2032, simple_loss=0.2758, pruned_loss=0.06532, over 21263.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3248, pruned_loss=0.09721, over 4287898.49 frames. ], batch size: 159, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:49:30,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=591564.0, ans=0.125 2023-06-20 00:49:47,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=591564.0, ans=0.0 2023-06-20 00:50:07,539 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.760e+02 3.326e+02 4.324e+02 6.991e+02, threshold=6.652e+02, percent-clipped=1.0 2023-06-20 00:50:53,173 INFO [train.py:996] (1/4) Epoch 4, batch 7150, loss[loss=0.2634, simple_loss=0.3269, pruned_loss=0.09996, over 21423.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3232, pruned_loss=0.09492, over 4280862.44 frames. ], batch size: 211, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:51:08,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=591864.0, ans=0.125 2023-06-20 00:52:10,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=591984.0, ans=0.0 2023-06-20 00:52:30,753 INFO [train.py:996] (1/4) Epoch 4, batch 7200, loss[loss=0.2665, simple_loss=0.321, pruned_loss=0.106, over 21731.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3268, pruned_loss=0.09814, over 4273767.34 frames. ], batch size: 351, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:52:31,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-20 00:52:56,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=592164.0, ans=0.1 2023-06-20 00:53:31,847 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.719e+02 3.108e+02 3.931e+02 6.174e+02, threshold=6.217e+02, percent-clipped=0.0 2023-06-20 00:53:46,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.16 vs. limit=6.0 2023-06-20 00:54:12,850 INFO [train.py:996] (1/4) Epoch 4, batch 7250, loss[loss=0.2341, simple_loss=0.2875, pruned_loss=0.09033, over 21592.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3234, pruned_loss=0.09764, over 4275738.15 frames. ], batch size: 415, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:54:34,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=592404.0, ans=0.2 2023-06-20 00:54:36,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=592404.0, ans=0.0 2023-06-20 00:54:50,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.80 vs. limit=10.0 2023-06-20 00:55:29,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=592584.0, ans=0.125 2023-06-20 00:55:46,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=592644.0, ans=0.0 2023-06-20 00:55:51,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=592644.0, ans=0.125 2023-06-20 00:55:55,952 INFO [train.py:996] (1/4) Epoch 4, batch 7300, loss[loss=0.2031, simple_loss=0.2792, pruned_loss=0.06349, over 20825.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3172, pruned_loss=0.0954, over 4275665.23 frames. ], batch size: 607, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:56:06,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=592704.0, ans=0.0 2023-06-20 00:56:37,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=592764.0, ans=0.0 2023-06-20 00:56:52,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=592824.0, ans=0.125 2023-06-20 00:56:52,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=592824.0, ans=0.125 2023-06-20 00:56:58,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.938e+02 3.597e+02 4.532e+02 8.618e+02, threshold=7.193e+02, percent-clipped=4.0 2023-06-20 00:57:38,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=592944.0, ans=0.07 2023-06-20 00:57:45,775 INFO [train.py:996] (1/4) Epoch 4, batch 7350, loss[loss=0.2633, simple_loss=0.3187, pruned_loss=0.104, over 21735.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3144, pruned_loss=0.09619, over 4270053.47 frames. ], batch size: 282, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:59:31,732 INFO [train.py:996] (1/4) Epoch 4, batch 7400, loss[loss=0.2476, simple_loss=0.2988, pruned_loss=0.09818, over 21138.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3216, pruned_loss=0.09826, over 4266484.38 frames. ], batch size: 143, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:59:32,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-20 00:59:53,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=593364.0, ans=0.125 2023-06-20 01:00:28,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 3.006e+02 3.626e+02 4.126e+02 7.462e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-20 01:00:38,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=593484.0, ans=0.1 2023-06-20 01:00:58,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=593544.0, ans=0.0 2023-06-20 01:01:00,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=593544.0, ans=0.125 2023-06-20 01:01:05,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=593544.0, ans=0.2 2023-06-20 01:01:15,452 INFO [train.py:996] (1/4) Epoch 4, batch 7450, loss[loss=0.2125, simple_loss=0.2736, pruned_loss=0.07577, over 21678.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3202, pruned_loss=0.09738, over 4274783.53 frames. ], batch size: 282, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:02:09,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=593724.0, ans=0.125 2023-06-20 01:02:17,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=593784.0, ans=0.125 2023-06-20 01:02:32,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=593784.0, ans=0.1 2023-06-20 01:03:05,763 INFO [train.py:996] (1/4) Epoch 4, batch 7500, loss[loss=0.2735, simple_loss=0.3488, pruned_loss=0.09908, over 21755.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3251, pruned_loss=0.09894, over 4271044.32 frames. ], batch size: 124, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:03:25,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=593904.0, ans=0.1 2023-06-20 01:03:35,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=593964.0, ans=0.125 2023-06-20 01:04:09,788 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.163e+02 3.635e+02 4.766e+02 7.864e+02, threshold=7.270e+02, percent-clipped=2.0 2023-06-20 01:04:19,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.44 vs. limit=6.0 2023-06-20 01:04:43,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=594144.0, ans=0.0 2023-06-20 01:04:51,492 INFO [train.py:996] (1/4) Epoch 4, batch 7550, loss[loss=0.3411, simple_loss=0.4079, pruned_loss=0.1372, over 21452.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3331, pruned_loss=0.09806, over 4268088.19 frames. ], batch size: 507, lr: 8.27e-03, grad_scale: 16.0 2023-06-20 01:05:18,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=594264.0, ans=0.125 2023-06-20 01:05:26,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=594264.0, ans=0.09899494936611666 2023-06-20 01:05:40,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=594324.0, ans=0.1 2023-06-20 01:06:33,659 INFO [train.py:996] (1/4) Epoch 4, batch 7600, loss[loss=0.2341, simple_loss=0.3106, pruned_loss=0.07882, over 21696.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3319, pruned_loss=0.09704, over 4279520.65 frames. ], batch size: 263, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:07:27,124 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.700e+02 3.107e+02 3.746e+02 5.626e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-20 01:08:11,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=594744.0, ans=0.1 2023-06-20 01:08:17,604 INFO [train.py:996] (1/4) Epoch 4, batch 7650, loss[loss=0.2872, simple_loss=0.3275, pruned_loss=0.1234, over 20109.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3299, pruned_loss=0.09905, over 4285545.89 frames. ], batch size: 703, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:08:27,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=594804.0, ans=0.125 2023-06-20 01:08:52,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=594864.0, ans=0.125 2023-06-20 01:09:10,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=594924.0, ans=0.025 2023-06-20 01:09:25,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=594984.0, ans=0.125 2023-06-20 01:09:26,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-20 01:10:03,422 INFO [train.py:996] (1/4) Epoch 4, batch 7700, loss[loss=0.3092, simple_loss=0.3678, pruned_loss=0.1253, over 21842.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3335, pruned_loss=0.102, over 4284463.63 frames. ], batch size: 441, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:10:03,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=595104.0, ans=0.125 2023-06-20 01:10:16,436 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-20 01:11:08,879 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.825e+02 3.572e+02 4.383e+02 7.085e+02, threshold=7.144e+02, percent-clipped=3.0 2023-06-20 01:11:33,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=595344.0, ans=0.07 2023-06-20 01:11:42,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=595344.0, ans=0.125 2023-06-20 01:11:54,612 INFO [train.py:996] (1/4) Epoch 4, batch 7750, loss[loss=0.2303, simple_loss=0.3124, pruned_loss=0.07405, over 21278.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.337, pruned_loss=0.1018, over 4272381.39 frames. ], batch size: 176, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:12:11,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=22.5 2023-06-20 01:13:03,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=595584.0, ans=0.05 2023-06-20 01:13:40,999 INFO [train.py:996] (1/4) Epoch 4, batch 7800, loss[loss=0.191, simple_loss=0.2411, pruned_loss=0.07043, over 21722.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.339, pruned_loss=0.1027, over 4274608.69 frames. ], batch size: 124, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:13:52,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=595704.0, ans=0.125 2023-06-20 01:13:59,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=595764.0, ans=0.125 2023-06-20 01:14:08,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=595764.0, ans=0.125 2023-06-20 01:14:18,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=595764.0, ans=0.125 2023-06-20 01:14:44,355 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.079e+02 3.630e+02 4.586e+02 7.709e+02, threshold=7.261e+02, percent-clipped=1.0 2023-06-20 01:14:46,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=595884.0, ans=0.125 2023-06-20 01:14:53,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-20 01:15:17,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-20 01:15:24,366 INFO [train.py:996] (1/4) Epoch 4, batch 7850, loss[loss=0.2498, simple_loss=0.309, pruned_loss=0.09533, over 21885.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3341, pruned_loss=0.1021, over 4277895.32 frames. ], batch size: 107, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:15:39,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=596064.0, ans=0.1 2023-06-20 01:16:00,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=596064.0, ans=0.125 2023-06-20 01:16:00,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-20 01:17:10,713 INFO [train.py:996] (1/4) Epoch 4, batch 7900, loss[loss=0.2922, simple_loss=0.3558, pruned_loss=0.1143, over 20630.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3296, pruned_loss=0.1001, over 4277138.77 frames. ], batch size: 607, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:17:31,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=596364.0, ans=0.125 2023-06-20 01:18:14,883 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.214e+02 3.696e+02 4.914e+02 8.338e+02, threshold=7.393e+02, percent-clipped=4.0 2023-06-20 01:18:27,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=596484.0, ans=0.0 2023-06-20 01:18:34,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=596484.0, ans=0.1 2023-06-20 01:18:56,155 INFO [train.py:996] (1/4) Epoch 4, batch 7950, loss[loss=0.3084, simple_loss=0.3761, pruned_loss=0.1204, over 21784.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3341, pruned_loss=0.1001, over 4267951.76 frames. ], batch size: 441, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:19:43,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=596724.0, ans=0.125 2023-06-20 01:19:49,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=596724.0, ans=0.2 2023-06-20 01:19:54,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-20 01:20:17,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=596784.0, ans=0.0 2023-06-20 01:20:25,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=596844.0, ans=0.05 2023-06-20 01:20:50,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=596844.0, ans=0.0 2023-06-20 01:20:53,154 INFO [train.py:996] (1/4) Epoch 4, batch 8000, loss[loss=0.228, simple_loss=0.2821, pruned_loss=0.08698, over 20724.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3383, pruned_loss=0.1027, over 4256759.67 frames. ], batch size: 609, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:21:08,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=596904.0, ans=0.1 2023-06-20 01:21:19,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=596964.0, ans=0.125 2023-06-20 01:21:19,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=596964.0, ans=0.025 2023-06-20 01:21:55,194 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.970e+02 3.290e+02 4.047e+02 5.946e+02, threshold=6.580e+02, percent-clipped=0.0 2023-06-20 01:22:25,058 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:22:46,665 INFO [train.py:996] (1/4) Epoch 4, batch 8050, loss[loss=0.2827, simple_loss=0.3598, pruned_loss=0.1028, over 21874.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.339, pruned_loss=0.1014, over 4253792.58 frames. ], batch size: 372, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:23:25,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=597324.0, ans=0.0 2023-06-20 01:24:32,704 INFO [train.py:996] (1/4) Epoch 4, batch 8100, loss[loss=0.2521, simple_loss=0.3123, pruned_loss=0.09595, over 21238.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3368, pruned_loss=0.1017, over 4263545.92 frames. ], batch size: 143, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:24:43,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=597504.0, ans=0.125 2023-06-20 01:25:24,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=597624.0, ans=0.1 2023-06-20 01:25:36,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=597624.0, ans=0.125 2023-06-20 01:25:39,506 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.366e+02 3.127e+02 3.761e+02 5.016e+02 1.103e+03, threshold=7.523e+02, percent-clipped=9.0 2023-06-20 01:25:55,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=597684.0, ans=0.0 2023-06-20 01:26:20,076 INFO [train.py:996] (1/4) Epoch 4, batch 8150, loss[loss=0.2439, simple_loss=0.3168, pruned_loss=0.08543, over 21568.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.344, pruned_loss=0.1027, over 4257976.35 frames. ], batch size: 230, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:26:20,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=597804.0, ans=0.2 2023-06-20 01:26:21,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=597804.0, ans=0.125 2023-06-20 01:27:08,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=597924.0, ans=0.04949747468305833 2023-06-20 01:28:10,453 INFO [train.py:996] (1/4) Epoch 4, batch 8200, loss[loss=0.2368, simple_loss=0.2931, pruned_loss=0.09027, over 21548.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3362, pruned_loss=0.09981, over 4255270.48 frames. ], batch size: 247, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:28:22,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=598104.0, ans=0.125 2023-06-20 01:28:26,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.05 vs. limit=15.0 2023-06-20 01:28:33,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=598164.0, ans=0.1 2023-06-20 01:29:13,764 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 2.961e+02 3.420e+02 4.432e+02 7.003e+02, threshold=6.840e+02, percent-clipped=0.0 2023-06-20 01:29:20,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=598284.0, ans=0.07 2023-06-20 01:29:27,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=598284.0, ans=0.1 2023-06-20 01:29:53,761 INFO [train.py:996] (1/4) Epoch 4, batch 8250, loss[loss=0.2763, simple_loss=0.3642, pruned_loss=0.0942, over 21818.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3365, pruned_loss=0.09977, over 4251462.82 frames. ], batch size: 371, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:29:59,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=598404.0, ans=0.2 2023-06-20 01:30:47,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=598524.0, ans=0.04949747468305833 2023-06-20 01:30:56,214 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-20 01:31:01,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-20 01:31:38,804 INFO [train.py:996] (1/4) Epoch 4, batch 8300, loss[loss=0.2454, simple_loss=0.3231, pruned_loss=0.08389, over 21768.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3344, pruned_loss=0.09688, over 4263260.96 frames. ], batch size: 316, lr: 8.24e-03, grad_scale: 16.0 2023-06-20 01:31:51,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=598704.0, ans=0.125 2023-06-20 01:32:34,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-06-20 01:32:37,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=598824.0, ans=0.125 2023-06-20 01:32:45,073 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.814e+02 3.368e+02 3.938e+02 8.477e+02, threshold=6.736e+02, percent-clipped=1.0 2023-06-20 01:32:47,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=598884.0, ans=0.125 2023-06-20 01:33:07,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-20 01:33:20,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=598944.0, ans=0.0 2023-06-20 01:33:23,473 INFO [train.py:996] (1/4) Epoch 4, batch 8350, loss[loss=0.2321, simple_loss=0.3094, pruned_loss=0.07737, over 21640.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3311, pruned_loss=0.09466, over 4259159.18 frames. ], batch size: 298, lr: 8.24e-03, grad_scale: 16.0 2023-06-20 01:33:39,393 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-20 01:33:53,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-20 01:34:39,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=599184.0, ans=0.0 2023-06-20 01:34:47,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=599244.0, ans=0.125 2023-06-20 01:34:55,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=599244.0, ans=0.5 2023-06-20 01:35:08,064 INFO [train.py:996] (1/4) Epoch 4, batch 8400, loss[loss=0.2071, simple_loss=0.2956, pruned_loss=0.05931, over 21688.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.329, pruned_loss=0.09285, over 4261490.56 frames. ], batch size: 247, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:35:31,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=599304.0, ans=0.125 2023-06-20 01:36:06,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=599424.0, ans=0.5 2023-06-20 01:36:14,373 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.656e+02 3.140e+02 3.908e+02 6.671e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-20 01:36:41,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=599544.0, ans=0.125 2023-06-20 01:36:50,915 INFO [train.py:996] (1/4) Epoch 4, batch 8450, loss[loss=0.2279, simple_loss=0.2958, pruned_loss=0.08003, over 21817.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3278, pruned_loss=0.0924, over 4268524.26 frames. ], batch size: 298, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:37:04,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=599604.0, ans=0.2 2023-06-20 01:37:20,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=599664.0, ans=0.0 2023-06-20 01:37:58,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-20 01:38:16,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=599844.0, ans=0.0 2023-06-20 01:38:28,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=15.0 2023-06-20 01:38:31,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=599844.0, ans=0.07 2023-06-20 01:38:33,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=599904.0, ans=0.125 2023-06-20 01:38:34,732 INFO [train.py:996] (1/4) Epoch 4, batch 8500, loss[loss=0.2689, simple_loss=0.3212, pruned_loss=0.1083, over 21382.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3249, pruned_loss=0.09352, over 4266681.11 frames. ], batch size: 194, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:39:33,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=600024.0, ans=0.125 2023-06-20 01:39:36,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=600024.0, ans=0.0 2023-06-20 01:39:44,103 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.032e+02 3.480e+02 4.088e+02 6.738e+02, threshold=6.960e+02, percent-clipped=1.0 2023-06-20 01:39:55,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.71 vs. limit=6.0 2023-06-20 01:40:19,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=600144.0, ans=0.0 2023-06-20 01:40:21,741 INFO [train.py:996] (1/4) Epoch 4, batch 8550, loss[loss=0.2846, simple_loss=0.3638, pruned_loss=0.1027, over 21816.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3285, pruned_loss=0.09616, over 4256287.94 frames. ], batch size: 282, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:41:21,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=600324.0, ans=0.125 2023-06-20 01:42:17,776 INFO [train.py:996] (1/4) Epoch 4, batch 8600, loss[loss=0.3495, simple_loss=0.4451, pruned_loss=0.1269, over 19843.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.339, pruned_loss=0.09996, over 4256113.75 frames. ], batch size: 702, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:42:35,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-20 01:43:05,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=600624.0, ans=0.0 2023-06-20 01:43:11,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-20 01:43:15,160 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.082e+02 3.829e+02 4.661e+02 1.059e+03, threshold=7.657e+02, percent-clipped=7.0 2023-06-20 01:43:29,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=600684.0, ans=0.0 2023-06-20 01:43:56,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=600744.0, ans=0.125 2023-06-20 01:44:00,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=600804.0, ans=0.125 2023-06-20 01:44:06,479 INFO [train.py:996] (1/4) Epoch 4, batch 8650, loss[loss=0.2129, simple_loss=0.2876, pruned_loss=0.06914, over 21814.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3443, pruned_loss=0.1005, over 4261519.80 frames. ], batch size: 107, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:44:08,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=600804.0, ans=0.04949747468305833 2023-06-20 01:44:08,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=600804.0, ans=0.125 2023-06-20 01:44:08,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=600804.0, ans=0.0 2023-06-20 01:44:13,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=600804.0, ans=0.2 2023-06-20 01:44:27,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-20 01:44:51,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.62 vs. limit=10.0 2023-06-20 01:45:14,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=600984.0, ans=0.125 2023-06-20 01:45:17,424 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.84 vs. limit=15.0 2023-06-20 01:45:28,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=601044.0, ans=0.05 2023-06-20 01:45:44,857 INFO [train.py:996] (1/4) Epoch 4, batch 8700, loss[loss=0.2209, simple_loss=0.2818, pruned_loss=0.07996, over 21657.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3357, pruned_loss=0.09624, over 4265128.92 frames. ], batch size: 264, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:45:58,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=601104.0, ans=0.125 2023-06-20 01:46:18,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.11 vs. limit=10.0 2023-06-20 01:46:31,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=8.0 2023-06-20 01:46:36,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=601224.0, ans=0.09899494936611666 2023-06-20 01:46:42,031 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.831e+02 3.437e+02 4.356e+02 1.035e+03, threshold=6.874e+02, percent-clipped=3.0 2023-06-20 01:46:49,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=601284.0, ans=0.0 2023-06-20 01:46:51,105 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:47:35,546 INFO [train.py:996] (1/4) Epoch 4, batch 8750, loss[loss=0.2389, simple_loss=0.2783, pruned_loss=0.09976, over 20336.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3332, pruned_loss=0.09819, over 4272761.56 frames. ], batch size: 703, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:47:43,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.83 vs. limit=6.0 2023-06-20 01:48:28,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=601524.0, ans=0.0 2023-06-20 01:49:11,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=601644.0, ans=0.0 2023-06-20 01:49:18,720 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:49:20,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=601644.0, ans=0.1 2023-06-20 01:49:22,810 INFO [train.py:996] (1/4) Epoch 4, batch 8800, loss[loss=0.4102, simple_loss=0.4546, pruned_loss=0.1829, over 21372.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.342, pruned_loss=0.1023, over 4281160.73 frames. ], batch size: 507, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 01:50:20,619 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 2.907e+02 3.416e+02 4.301e+02 7.142e+02, threshold=6.833e+02, percent-clipped=3.0 2023-06-20 01:50:53,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=601944.0, ans=0.2 2023-06-20 01:51:03,463 INFO [train.py:996] (1/4) Epoch 4, batch 8850, loss[loss=0.2616, simple_loss=0.3255, pruned_loss=0.09885, over 21547.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3506, pruned_loss=0.1055, over 4283260.98 frames. ], batch size: 263, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 01:51:34,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-20 01:51:44,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-20 01:51:49,481 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-20 01:51:54,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=602124.0, ans=0.125 2023-06-20 01:52:27,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-20 01:52:44,263 INFO [train.py:996] (1/4) Epoch 4, batch 8900, loss[loss=0.3682, simple_loss=0.4822, pruned_loss=0.1271, over 19801.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3475, pruned_loss=0.1043, over 4274273.47 frames. ], batch size: 702, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:52:53,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=602304.0, ans=0.1 2023-06-20 01:52:58,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=602304.0, ans=0.0 2023-06-20 01:53:07,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=602364.0, ans=0.0 2023-06-20 01:53:17,170 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-20 01:53:18,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=602364.0, ans=0.0 2023-06-20 01:53:53,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=602424.0, ans=0.125 2023-06-20 01:54:00,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.914e+02 3.498e+02 4.067e+02 9.619e+02, threshold=6.997e+02, percent-clipped=2.0 2023-06-20 01:54:31,850 INFO [train.py:996] (1/4) Epoch 4, batch 8950, loss[loss=0.285, simple_loss=0.3462, pruned_loss=0.1119, over 21636.00 frames. ], tot_loss[loss=0.277, simple_loss=0.348, pruned_loss=0.103, over 4272306.03 frames. ], batch size: 263, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:54:32,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=602604.0, ans=0.0 2023-06-20 01:54:40,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=602604.0, ans=0.0 2023-06-20 01:55:32,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=602724.0, ans=0.0 2023-06-20 01:55:37,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=602724.0, ans=0.05 2023-06-20 01:56:00,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=602844.0, ans=0.125 2023-06-20 01:56:15,318 INFO [train.py:996] (1/4) Epoch 4, batch 9000, loss[loss=0.2425, simple_loss=0.2983, pruned_loss=0.09336, over 21676.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3412, pruned_loss=0.1021, over 4264197.07 frames. ], batch size: 248, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:56:15,319 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 01:56:27,158 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.6686, 4.9623, 5.0827, 5.3698], device='cuda:1') 2023-06-20 01:56:37,868 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2701, simple_loss=0.3695, pruned_loss=0.08531, over 1796401.00 frames. 2023-06-20 01:56:37,869 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 01:57:34,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=603024.0, ans=0.0 2023-06-20 01:57:36,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=603024.0, ans=0.2 2023-06-20 01:57:40,794 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.934e+02 3.477e+02 4.426e+02 7.521e+02, threshold=6.955e+02, percent-clipped=2.0 2023-06-20 01:58:24,269 INFO [train.py:996] (1/4) Epoch 4, batch 9050, loss[loss=0.2315, simple_loss=0.3136, pruned_loss=0.07472, over 21734.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3364, pruned_loss=0.09809, over 4264338.07 frames. ], batch size: 332, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:58:43,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=603204.0, ans=0.1 2023-06-20 01:58:43,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-20 01:58:57,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=603264.0, ans=0.1 2023-06-20 01:59:05,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=603264.0, ans=0.0 2023-06-20 01:59:09,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-20 02:00:15,333 INFO [train.py:996] (1/4) Epoch 4, batch 9100, loss[loss=0.2553, simple_loss=0.3507, pruned_loss=0.07999, over 21700.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3398, pruned_loss=0.09867, over 4269840.61 frames. ], batch size: 351, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 02:00:17,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=603504.0, ans=0.0 2023-06-20 02:00:25,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=603504.0, ans=0.1 2023-06-20 02:00:47,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=603624.0, ans=0.1 2023-06-20 02:00:47,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=603624.0, ans=0.1 2023-06-20 02:01:08,273 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.683e+02 3.374e+02 4.242e+02 6.313e+02, threshold=6.748e+02, percent-clipped=0.0 2023-06-20 02:01:56,015 INFO [train.py:996] (1/4) Epoch 4, batch 9150, loss[loss=0.251, simple_loss=0.3362, pruned_loss=0.08291, over 21624.00 frames. ], tot_loss[loss=0.267, simple_loss=0.341, pruned_loss=0.09652, over 4273307.64 frames. ], batch size: 263, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 02:02:03,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=603804.0, ans=0.0 2023-06-20 02:02:16,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=603864.0, ans=0.0 2023-06-20 02:02:24,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=603864.0, ans=0.2 2023-06-20 02:02:32,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-20 02:02:41,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=603924.0, ans=0.125 2023-06-20 02:02:45,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=603984.0, ans=0.125 2023-06-20 02:03:41,426 INFO [train.py:996] (1/4) Epoch 4, batch 9200, loss[loss=0.3158, simple_loss=0.3801, pruned_loss=0.1257, over 21897.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3449, pruned_loss=0.09696, over 4270998.45 frames. ], batch size: 316, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 02:03:55,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=604104.0, ans=0.125 2023-06-20 02:04:44,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=604284.0, ans=0.05 2023-06-20 02:04:45,436 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.901e+02 3.630e+02 4.447e+02 7.984e+02, threshold=7.260e+02, percent-clipped=1.0 2023-06-20 02:05:24,863 INFO [train.py:996] (1/4) Epoch 4, batch 9250, loss[loss=0.2508, simple_loss=0.3056, pruned_loss=0.09801, over 21860.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3463, pruned_loss=0.1005, over 4272546.12 frames. ], batch size: 98, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:06:41,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=604584.0, ans=0.1 2023-06-20 02:07:06,130 INFO [train.py:996] (1/4) Epoch 4, batch 9300, loss[loss=0.3378, simple_loss=0.3952, pruned_loss=0.1402, over 21348.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3429, pruned_loss=0.1007, over 4262414.25 frames. ], batch size: 507, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:07:08,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=604704.0, ans=0.0 2023-06-20 02:08:18,257 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 3.021e+02 3.641e+02 4.393e+02 8.139e+02, threshold=7.281e+02, percent-clipped=1.0 2023-06-20 02:08:26,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=604884.0, ans=0.125 2023-06-20 02:08:46,467 INFO [train.py:996] (1/4) Epoch 4, batch 9350, loss[loss=0.3051, simple_loss=0.3826, pruned_loss=0.1138, over 21772.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3474, pruned_loss=0.1019, over 4259254.10 frames. ], batch size: 441, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:09:51,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=605124.0, ans=0.0 2023-06-20 02:10:00,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-20 02:10:03,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=605184.0, ans=0.125 2023-06-20 02:10:30,854 INFO [train.py:996] (1/4) Epoch 4, batch 9400, loss[loss=0.2487, simple_loss=0.3234, pruned_loss=0.08701, over 21295.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3498, pruned_loss=0.1029, over 4259647.97 frames. ], batch size: 131, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:11:44,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.73 vs. limit=10.0 2023-06-20 02:11:44,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=605484.0, ans=0.1 2023-06-20 02:11:46,092 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 3.026e+02 3.567e+02 4.359e+02 8.563e+02, threshold=7.134e+02, percent-clipped=2.0 2023-06-20 02:11:50,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=605484.0, ans=0.2 2023-06-20 02:12:13,922 INFO [train.py:996] (1/4) Epoch 4, batch 9450, loss[loss=0.2325, simple_loss=0.2871, pruned_loss=0.0889, over 21601.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3429, pruned_loss=0.1016, over 4246152.14 frames. ], batch size: 298, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:12:19,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=12.0 2023-06-20 02:12:36,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=605664.0, ans=0.07 2023-06-20 02:13:18,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=605784.0, ans=0.05 2023-06-20 02:13:26,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=605784.0, ans=0.125 2023-06-20 02:13:36,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=605844.0, ans=0.1 2023-06-20 02:13:52,763 INFO [train.py:996] (1/4) Epoch 4, batch 9500, loss[loss=0.2618, simple_loss=0.329, pruned_loss=0.09735, over 21790.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3333, pruned_loss=0.09888, over 4248763.24 frames. ], batch size: 124, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:13:55,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-20 02:13:56,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=605904.0, ans=0.1 2023-06-20 02:14:10,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=605904.0, ans=0.125 2023-06-20 02:15:07,555 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-20 02:15:09,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.876e+02 3.483e+02 4.277e+02 8.627e+02, threshold=6.965e+02, percent-clipped=2.0 2023-06-20 02:15:37,582 INFO [train.py:996] (1/4) Epoch 4, batch 9550, loss[loss=0.3182, simple_loss=0.3839, pruned_loss=0.1262, over 21836.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3371, pruned_loss=0.1005, over 4253480.77 frames. ], batch size: 118, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:15:53,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=606204.0, ans=0.125 2023-06-20 02:16:41,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=606324.0, ans=0.2 2023-06-20 02:16:56,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=606384.0, ans=0.04949747468305833 2023-06-20 02:17:11,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=606444.0, ans=0.0 2023-06-20 02:17:21,045 INFO [train.py:996] (1/4) Epoch 4, batch 9600, loss[loss=0.2481, simple_loss=0.3097, pruned_loss=0.09326, over 21393.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3383, pruned_loss=0.1021, over 4261405.81 frames. ], batch size: 159, lr: 8.19e-03, grad_scale: 32.0 2023-06-20 02:17:21,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=606504.0, ans=0.125 2023-06-20 02:17:57,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=606564.0, ans=0.1 2023-06-20 02:18:22,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=606624.0, ans=15.0 2023-06-20 02:18:25,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=606624.0, ans=0.125 2023-06-20 02:18:36,525 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.961e+02 3.442e+02 3.920e+02 7.478e+02, threshold=6.885e+02, percent-clipped=1.0 2023-06-20 02:18:44,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-20 02:18:58,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=606744.0, ans=0.1 2023-06-20 02:19:09,537 INFO [train.py:996] (1/4) Epoch 4, batch 9650, loss[loss=0.2788, simple_loss=0.3447, pruned_loss=0.1065, over 21331.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3387, pruned_loss=0.1025, over 4262973.18 frames. ], batch size: 159, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:19:13,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=606804.0, ans=0.125 2023-06-20 02:19:35,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=606864.0, ans=0.1 2023-06-20 02:19:55,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=606864.0, ans=0.2 2023-06-20 02:20:07,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=606924.0, ans=0.125 2023-06-20 02:20:54,111 INFO [train.py:996] (1/4) Epoch 4, batch 9700, loss[loss=0.2493, simple_loss=0.3224, pruned_loss=0.08807, over 16803.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3411, pruned_loss=0.1029, over 4269445.27 frames. ], batch size: 61, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:21:04,634 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-20 02:21:26,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=607164.0, ans=0.0 2023-06-20 02:22:07,442 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 2.962e+02 3.413e+02 3.970e+02 9.096e+02, threshold=6.826e+02, percent-clipped=3.0 2023-06-20 02:22:19,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=607344.0, ans=0.125 2023-06-20 02:22:21,366 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.61 vs. limit=6.0 2023-06-20 02:22:37,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=607404.0, ans=0.1 2023-06-20 02:22:38,371 INFO [train.py:996] (1/4) Epoch 4, batch 9750, loss[loss=0.2428, simple_loss=0.3048, pruned_loss=0.09038, over 15597.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.333, pruned_loss=0.1008, over 4266834.06 frames. ], batch size: 60, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:23:03,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=607464.0, ans=0.0 2023-06-20 02:23:04,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-20 02:23:41,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=607584.0, ans=0.04949747468305833 2023-06-20 02:23:56,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=607644.0, ans=0.0 2023-06-20 02:24:13,820 INFO [train.py:996] (1/4) Epoch 4, batch 9800, loss[loss=0.2609, simple_loss=0.3299, pruned_loss=0.09596, over 21754.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3353, pruned_loss=0.1022, over 4270208.10 frames. ], batch size: 112, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:24:27,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.20 vs. limit=22.5 2023-06-20 02:24:29,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=607704.0, ans=0.125 2023-06-20 02:25:26,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=607884.0, ans=0.07 2023-06-20 02:25:29,985 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.802e+02 3.178e+02 3.680e+02 5.783e+02, threshold=6.355e+02, percent-clipped=0.0 2023-06-20 02:25:56,196 INFO [train.py:996] (1/4) Epoch 4, batch 9850, loss[loss=0.2541, simple_loss=0.3191, pruned_loss=0.09454, over 21871.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3317, pruned_loss=0.1011, over 4249155.88 frames. ], batch size: 371, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:26:03,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=608004.0, ans=0.04949747468305833 2023-06-20 02:27:27,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=608244.0, ans=0.125 2023-06-20 02:27:41,043 INFO [train.py:996] (1/4) Epoch 4, batch 9900, loss[loss=0.2706, simple_loss=0.3323, pruned_loss=0.1044, over 21450.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3275, pruned_loss=0.1006, over 4246492.19 frames. ], batch size: 194, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:28:03,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=608304.0, ans=0.0 2023-06-20 02:28:38,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=608424.0, ans=0.0 2023-06-20 02:29:00,244 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.957e+02 3.478e+02 4.825e+02 8.249e+02, threshold=6.956e+02, percent-clipped=2.0 2023-06-20 02:29:31,823 INFO [train.py:996] (1/4) Epoch 4, batch 9950, loss[loss=0.2748, simple_loss=0.3351, pruned_loss=0.1073, over 19876.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3308, pruned_loss=0.1026, over 4254012.88 frames. ], batch size: 702, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:29:32,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-20 02:30:10,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=608664.0, ans=0.2 2023-06-20 02:30:58,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=608844.0, ans=0.125 2023-06-20 02:31:23,718 INFO [train.py:996] (1/4) Epoch 4, batch 10000, loss[loss=0.2266, simple_loss=0.2858, pruned_loss=0.08371, over 21345.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3258, pruned_loss=0.1007, over 4256471.16 frames. ], batch size: 176, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 02:31:44,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=15.0 2023-06-20 02:31:53,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=608964.0, ans=0.125 2023-06-20 02:32:32,026 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.674e+02 3.222e+02 3.735e+02 9.123e+02, threshold=6.443e+02, percent-clipped=2.0 2023-06-20 02:32:34,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=609084.0, ans=0.125 2023-06-20 02:32:43,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-20 02:33:00,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=609144.0, ans=0.0 2023-06-20 02:33:14,568 INFO [train.py:996] (1/4) Epoch 4, batch 10050, loss[loss=0.2197, simple_loss=0.2777, pruned_loss=0.08082, over 21866.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3267, pruned_loss=0.1011, over 4254157.76 frames. ], batch size: 98, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 02:33:29,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=609204.0, ans=0.015 2023-06-20 02:33:41,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=609264.0, ans=0.125 2023-06-20 02:33:56,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=609324.0, ans=0.035 2023-06-20 02:34:05,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=609324.0, ans=0.125 2023-06-20 02:34:30,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=609384.0, ans=0.0 2023-06-20 02:35:07,520 INFO [train.py:996] (1/4) Epoch 4, batch 10100, loss[loss=0.265, simple_loss=0.3308, pruned_loss=0.0996, over 21623.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3227, pruned_loss=0.09797, over 4256623.60 frames. ], batch size: 263, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:35:17,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=609504.0, ans=0.125 2023-06-20 02:35:48,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=609624.0, ans=0.025 2023-06-20 02:36:11,674 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.027e+02 3.628e+02 4.360e+02 7.943e+02, threshold=7.256e+02, percent-clipped=2.0 2023-06-20 02:36:50,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=609744.0, ans=0.0 2023-06-20 02:36:53,328 INFO [train.py:996] (1/4) Epoch 4, batch 10150, loss[loss=0.2643, simple_loss=0.3436, pruned_loss=0.09254, over 21840.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3293, pruned_loss=0.1005, over 4258143.51 frames. ], batch size: 316, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:37:20,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=609864.0, ans=0.125 2023-06-20 02:37:30,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=609924.0, ans=0.1 2023-06-20 02:37:42,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=609984.0, ans=0.2 2023-06-20 02:38:15,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=610044.0, ans=0.5 2023-06-20 02:38:18,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=610044.0, ans=0.04949747468305833 2023-06-20 02:38:35,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=610044.0, ans=0.0 2023-06-20 02:38:38,437 INFO [train.py:996] (1/4) Epoch 4, batch 10200, loss[loss=0.2708, simple_loss=0.3241, pruned_loss=0.1088, over 21870.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.328, pruned_loss=0.09752, over 4258837.89 frames. ], batch size: 107, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:38:45,994 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:38:47,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=610104.0, ans=0.125 2023-06-20 02:39:02,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=610164.0, ans=0.125 2023-06-20 02:39:52,549 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 2.503e+02 3.181e+02 4.095e+02 8.895e+02, threshold=6.363e+02, percent-clipped=3.0 2023-06-20 02:39:53,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=610284.0, ans=0.125 2023-06-20 02:40:00,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=610344.0, ans=0.125 2023-06-20 02:40:16,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=610344.0, ans=0.0 2023-06-20 02:40:23,067 INFO [train.py:996] (1/4) Epoch 4, batch 10250, loss[loss=0.2094, simple_loss=0.2981, pruned_loss=0.06037, over 21635.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3235, pruned_loss=0.09178, over 4263780.86 frames. ], batch size: 230, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:40:34,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=610404.0, ans=0.125 2023-06-20 02:40:57,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-20 02:42:00,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=610644.0, ans=0.125 2023-06-20 02:42:09,912 INFO [train.py:996] (1/4) Epoch 4, batch 10300, loss[loss=0.2762, simple_loss=0.3464, pruned_loss=0.1029, over 21726.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3254, pruned_loss=0.09105, over 4265163.03 frames. ], batch size: 298, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:42:11,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=610704.0, ans=0.0 2023-06-20 02:42:30,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=610764.0, ans=0.0 2023-06-20 02:42:34,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=610764.0, ans=0.2 2023-06-20 02:42:38,726 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-20 02:43:02,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-20 02:43:12,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=610884.0, ans=0.125 2023-06-20 02:43:24,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=610884.0, ans=0.125 2023-06-20 02:43:25,980 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 3.082e+02 3.756e+02 4.509e+02 8.129e+02, threshold=7.512e+02, percent-clipped=5.0 2023-06-20 02:43:51,313 INFO [train.py:996] (1/4) Epoch 4, batch 10350, loss[loss=0.158, simple_loss=0.1943, pruned_loss=0.06088, over 16692.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3291, pruned_loss=0.09205, over 4266514.43 frames. ], batch size: 60, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:43:55,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=611004.0, ans=0.125 2023-06-20 02:44:06,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=611064.0, ans=0.0 2023-06-20 02:44:45,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=611124.0, ans=0.0 2023-06-20 02:44:56,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=611124.0, ans=0.125 2023-06-20 02:45:35,477 INFO [train.py:996] (1/4) Epoch 4, batch 10400, loss[loss=0.3099, simple_loss=0.4059, pruned_loss=0.1069, over 21238.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3196, pruned_loss=0.08994, over 4265998.71 frames. ], batch size: 549, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:45:35,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=611304.0, ans=0.125 2023-06-20 02:45:44,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=611304.0, ans=0.125 2023-06-20 02:46:05,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=611364.0, ans=0.2 2023-06-20 02:46:09,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=611364.0, ans=0.2 2023-06-20 02:46:35,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=611424.0, ans=0.125 2023-06-20 02:46:45,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=611424.0, ans=0.1 2023-06-20 02:46:56,615 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.060e+02 3.521e+02 4.304e+02 7.584e+02, threshold=7.042e+02, percent-clipped=1.0 2023-06-20 02:47:21,557 INFO [train.py:996] (1/4) Epoch 4, batch 10450, loss[loss=0.2744, simple_loss=0.3537, pruned_loss=0.09756, over 21764.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3229, pruned_loss=0.09325, over 4266728.76 frames. ], batch size: 332, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:48:45,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=611784.0, ans=0.1 2023-06-20 02:49:17,067 INFO [train.py:996] (1/4) Epoch 4, batch 10500, loss[loss=0.2633, simple_loss=0.3172, pruned_loss=0.1048, over 21835.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3239, pruned_loss=0.09259, over 4268676.61 frames. ], batch size: 102, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:50:00,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=612024.0, ans=0.125 2023-06-20 02:50:18,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=612084.0, ans=0.0 2023-06-20 02:50:25,505 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.864e+02 3.538e+02 4.749e+02 1.100e+03, threshold=7.075e+02, percent-clipped=4.0 2023-06-20 02:50:33,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=612144.0, ans=0.0 2023-06-20 02:50:47,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=612144.0, ans=0.125 2023-06-20 02:50:51,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=612144.0, ans=0.125 2023-06-20 02:51:01,149 INFO [train.py:996] (1/4) Epoch 4, batch 10550, loss[loss=0.2352, simple_loss=0.2844, pruned_loss=0.09301, over 21598.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3197, pruned_loss=0.09309, over 4256298.69 frames. ], batch size: 231, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:51:29,041 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.32 vs. limit=15.0 2023-06-20 02:51:47,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-06-20 02:52:05,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=612384.0, ans=0.125 2023-06-20 02:52:06,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=612384.0, ans=0.0 2023-06-20 02:52:15,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=612444.0, ans=0.09899494936611666 2023-06-20 02:52:41,237 INFO [train.py:996] (1/4) Epoch 4, batch 10600, loss[loss=0.2315, simple_loss=0.2848, pruned_loss=0.08913, over 21558.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3164, pruned_loss=0.09158, over 4258308.35 frames. ], batch size: 231, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:53:00,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=612504.0, ans=0.125 2023-06-20 02:53:12,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612564.0, ans=0.1 2023-06-20 02:53:26,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=612624.0, ans=0.04949747468305833 2023-06-20 02:53:48,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=612684.0, ans=0.0 2023-06-20 02:53:50,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=612684.0, ans=0.125 2023-06-20 02:53:53,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.887e+02 3.753e+02 5.124e+02 8.898e+02, threshold=7.506e+02, percent-clipped=7.0 2023-06-20 02:54:23,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=612744.0, ans=0.125 2023-06-20 02:54:28,414 INFO [train.py:996] (1/4) Epoch 4, batch 10650, loss[loss=0.2987, simple_loss=0.3659, pruned_loss=0.1158, over 21445.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3162, pruned_loss=0.08909, over 4260647.00 frames. ], batch size: 471, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:55:01,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=612864.0, ans=0.0 2023-06-20 02:55:05,271 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-20 02:55:13,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612924.0, ans=0.1 2023-06-20 02:55:20,618 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.40 vs. limit=10.0 2023-06-20 02:55:39,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.39 vs. limit=12.0 2023-06-20 02:56:03,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=613044.0, ans=0.1 2023-06-20 02:56:26,505 INFO [train.py:996] (1/4) Epoch 4, batch 10700, loss[loss=0.2847, simple_loss=0.3575, pruned_loss=0.106, over 21620.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3165, pruned_loss=0.08971, over 4258303.58 frames. ], batch size: 415, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:57:22,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=613284.0, ans=0.125 2023-06-20 02:57:36,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.228e+02 3.667e+02 4.550e+02 7.977e+02, threshold=7.334e+02, percent-clipped=1.0 2023-06-20 02:57:58,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=12.0 2023-06-20 02:58:11,812 INFO [train.py:996] (1/4) Epoch 4, batch 10750, loss[loss=0.2794, simple_loss=0.3471, pruned_loss=0.1058, over 21432.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3283, pruned_loss=0.09519, over 4260423.40 frames. ], batch size: 194, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:58:47,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=613524.0, ans=0.125 2023-06-20 02:59:58,068 INFO [train.py:996] (1/4) Epoch 4, batch 10800, loss[loss=0.3709, simple_loss=0.4185, pruned_loss=0.1617, over 21358.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3345, pruned_loss=0.09604, over 4268789.08 frames. ], batch size: 507, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 03:00:40,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.95 vs. limit=15.0 2023-06-20 03:00:54,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-20 03:01:05,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=613884.0, ans=0.0 2023-06-20 03:01:16,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 2.918e+02 3.193e+02 3.822e+02 6.360e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-20 03:01:30,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=613944.0, ans=0.125 2023-06-20 03:01:35,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=613944.0, ans=0.0 2023-06-20 03:01:41,159 INFO [train.py:996] (1/4) Epoch 4, batch 10850, loss[loss=0.2773, simple_loss=0.3662, pruned_loss=0.09414, over 20839.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3347, pruned_loss=0.09637, over 4262845.91 frames. ], batch size: 609, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 03:02:06,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=12.0 2023-06-20 03:02:49,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=614184.0, ans=0.125 2023-06-20 03:03:06,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614244.0, ans=0.1 2023-06-20 03:03:24,888 INFO [train.py:996] (1/4) Epoch 4, batch 10900, loss[loss=0.283, simple_loss=0.3478, pruned_loss=0.1091, over 20718.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3282, pruned_loss=0.09464, over 4260795.43 frames. ], batch size: 607, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:03:59,007 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-20 03:04:02,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=614364.0, ans=0.1 2023-06-20 03:04:06,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=614424.0, ans=0.0 2023-06-20 03:04:06,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=614424.0, ans=0.2 2023-06-20 03:04:22,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=614424.0, ans=0.0 2023-06-20 03:04:43,226 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.644e+02 3.102e+02 4.157e+02 6.653e+02, threshold=6.203e+02, percent-clipped=4.0 2023-06-20 03:04:46,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=614484.0, ans=0.125 2023-06-20 03:05:07,343 INFO [train.py:996] (1/4) Epoch 4, batch 10950, loss[loss=0.2254, simple_loss=0.2795, pruned_loss=0.08565, over 21108.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3242, pruned_loss=0.09276, over 4265615.79 frames. ], batch size: 143, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:05:56,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=614724.0, ans=0.125 2023-06-20 03:06:49,623 INFO [train.py:996] (1/4) Epoch 4, batch 11000, loss[loss=0.2505, simple_loss=0.3073, pruned_loss=0.09681, over 21287.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3215, pruned_loss=0.09324, over 4252235.20 frames. ], batch size: 176, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:06:57,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=15.0 2023-06-20 03:07:34,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=615024.0, ans=0.125 2023-06-20 03:08:07,857 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.675e+02 3.019e+02 3.528e+02 7.831e+02, threshold=6.038e+02, percent-clipped=1.0 2023-06-20 03:08:14,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=615144.0, ans=0.1 2023-06-20 03:08:15,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=615144.0, ans=0.2 2023-06-20 03:08:17,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=615144.0, ans=0.125 2023-06-20 03:08:18,372 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-20 03:08:22,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=615144.0, ans=0.0 2023-06-20 03:08:31,483 INFO [train.py:996] (1/4) Epoch 4, batch 11050, loss[loss=0.2604, simple_loss=0.3113, pruned_loss=0.1047, over 21791.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3204, pruned_loss=0.09497, over 4256569.87 frames. ], batch size: 98, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:08:33,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=615204.0, ans=0.1 2023-06-20 03:09:38,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-20 03:10:13,743 INFO [train.py:996] (1/4) Epoch 4, batch 11100, loss[loss=0.2509, simple_loss=0.3118, pruned_loss=0.095, over 21902.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3196, pruned_loss=0.09514, over 4261695.31 frames. ], batch size: 107, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 03:10:44,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=615564.0, ans=0.07 2023-06-20 03:11:33,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.025e+02 3.574e+02 4.736e+02 7.982e+02, threshold=7.148e+02, percent-clipped=11.0 2023-06-20 03:11:56,224 INFO [train.py:996] (1/4) Epoch 4, batch 11150, loss[loss=0.2488, simple_loss=0.3262, pruned_loss=0.08576, over 21409.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3189, pruned_loss=0.09477, over 4252118.40 frames. ], batch size: 194, lr: 8.12e-03, grad_scale: 16.0 2023-06-20 03:12:06,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=615804.0, ans=0.125 2023-06-20 03:12:42,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=615924.0, ans=0.125 2023-06-20 03:13:24,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=616044.0, ans=0.125 2023-06-20 03:13:33,707 INFO [train.py:996] (1/4) Epoch 4, batch 11200, loss[loss=0.2456, simple_loss=0.3015, pruned_loss=0.09489, over 21474.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3163, pruned_loss=0.09501, over 4248534.38 frames. ], batch size: 441, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:13:42,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=616104.0, ans=0.0 2023-06-20 03:14:19,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=616224.0, ans=0.0 2023-06-20 03:14:39,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=616284.0, ans=0.125 2023-06-20 03:14:54,147 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.789e+02 3.395e+02 4.282e+02 6.875e+02, threshold=6.790e+02, percent-clipped=0.0 2023-06-20 03:15:17,015 INFO [train.py:996] (1/4) Epoch 4, batch 11250, loss[loss=0.2817, simple_loss=0.3452, pruned_loss=0.1091, over 21838.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3169, pruned_loss=0.09585, over 4248952.29 frames. ], batch size: 107, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:15:30,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=616404.0, ans=0.125 2023-06-20 03:15:38,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=616464.0, ans=0.125 2023-06-20 03:16:07,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-20 03:16:10,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=616524.0, ans=0.0 2023-06-20 03:16:10,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=616524.0, ans=0.2 2023-06-20 03:16:59,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=616704.0, ans=0.125 2023-06-20 03:17:00,604 INFO [train.py:996] (1/4) Epoch 4, batch 11300, loss[loss=0.3018, simple_loss=0.3504, pruned_loss=0.1267, over 21936.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3201, pruned_loss=0.09555, over 4257009.89 frames. ], batch size: 414, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:17:01,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.43 vs. limit=15.0 2023-06-20 03:17:15,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=616764.0, ans=0.125 2023-06-20 03:17:51,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=616824.0, ans=0.125 2023-06-20 03:18:21,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.889e+02 3.588e+02 4.478e+02 6.707e+02, threshold=7.176e+02, percent-clipped=0.0 2023-06-20 03:18:29,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=616944.0, ans=0.125 2023-06-20 03:18:45,388 INFO [train.py:996] (1/4) Epoch 4, batch 11350, loss[loss=0.2831, simple_loss=0.3539, pruned_loss=0.1062, over 21739.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3209, pruned_loss=0.09459, over 4254996.87 frames. ], batch size: 332, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:19:42,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-20 03:20:08,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-20 03:20:15,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=617244.0, ans=0.0 2023-06-20 03:20:41,069 INFO [train.py:996] (1/4) Epoch 4, batch 11400, loss[loss=0.2466, simple_loss=0.3166, pruned_loss=0.08826, over 21440.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.327, pruned_loss=0.09771, over 4260332.45 frames. ], batch size: 194, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 03:20:41,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=617304.0, ans=0.035 2023-06-20 03:20:43,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=617304.0, ans=0.0 2023-06-20 03:20:57,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=617364.0, ans=0.125 2023-06-20 03:21:14,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=617364.0, ans=0.125 2023-06-20 03:21:36,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=617424.0, ans=0.0 2023-06-20 03:21:38,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=617424.0, ans=0.1 2023-06-20 03:21:45,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-20 03:21:51,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=617484.0, ans=10.0 2023-06-20 03:21:53,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.300e+02 4.007e+02 5.041e+02 8.202e+02, threshold=8.013e+02, percent-clipped=6.0 2023-06-20 03:22:17,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=617544.0, ans=0.125 2023-06-20 03:22:21,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=617544.0, ans=0.0 2023-06-20 03:22:26,648 INFO [train.py:996] (1/4) Epoch 4, batch 11450, loss[loss=0.2915, simple_loss=0.3473, pruned_loss=0.1178, over 21422.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3299, pruned_loss=0.09758, over 4259879.00 frames. ], batch size: 211, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:23:44,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=617844.0, ans=0.125 2023-06-20 03:24:05,925 INFO [train.py:996] (1/4) Epoch 4, batch 11500, loss[loss=0.2565, simple_loss=0.3641, pruned_loss=0.07447, over 19783.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3335, pruned_loss=0.09837, over 4258350.77 frames. ], batch size: 703, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:24:06,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=617904.0, ans=0.125 2023-06-20 03:24:53,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-20 03:25:02,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=618024.0, ans=0.0 2023-06-20 03:25:06,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=618084.0, ans=0.1 2023-06-20 03:25:19,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.106e+02 3.529e+02 4.777e+02 9.700e+02, threshold=7.057e+02, percent-clipped=3.0 2023-06-20 03:25:32,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=618144.0, ans=0.02 2023-06-20 03:25:47,089 INFO [train.py:996] (1/4) Epoch 4, batch 11550, loss[loss=0.2792, simple_loss=0.3516, pruned_loss=0.1035, over 19962.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3365, pruned_loss=0.09741, over 4261530.61 frames. ], batch size: 703, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:26:00,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=12.0 2023-06-20 03:27:21,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=618444.0, ans=0.125 2023-06-20 03:27:37,871 INFO [train.py:996] (1/4) Epoch 4, batch 11600, loss[loss=0.2711, simple_loss=0.3643, pruned_loss=0.0889, over 21714.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3515, pruned_loss=0.09885, over 4250479.88 frames. ], batch size: 247, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 03:27:53,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=618504.0, ans=0.125 2023-06-20 03:28:29,825 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.52 vs. limit=15.0 2023-06-20 03:28:40,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=618624.0, ans=0.0 2023-06-20 03:28:55,148 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-20 03:28:55,724 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.978e+02 3.550e+02 4.282e+02 6.763e+02, threshold=7.099e+02, percent-clipped=1.0 2023-06-20 03:29:15,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-20 03:29:22,204 INFO [train.py:996] (1/4) Epoch 4, batch 11650, loss[loss=0.2463, simple_loss=0.3063, pruned_loss=0.09314, over 20079.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3594, pruned_loss=0.1006, over 4253580.28 frames. ], batch size: 704, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:29:33,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-20 03:29:43,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=618864.0, ans=0.0 2023-06-20 03:29:59,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=618924.0, ans=0.0 2023-06-20 03:30:35,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=618984.0, ans=0.1 2023-06-20 03:30:59,664 INFO [train.py:996] (1/4) Epoch 4, batch 11700, loss[loss=0.2349, simple_loss=0.2945, pruned_loss=0.08763, over 21147.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.3492, pruned_loss=0.1002, over 4247341.14 frames. ], batch size: 176, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:31:08,551 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:31:43,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=619224.0, ans=0.2 2023-06-20 03:32:17,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.962e+02 3.618e+02 4.622e+02 7.851e+02, threshold=7.236e+02, percent-clipped=2.0 2023-06-20 03:32:35,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-20 03:32:49,956 INFO [train.py:996] (1/4) Epoch 4, batch 11750, loss[loss=0.2509, simple_loss=0.3154, pruned_loss=0.09324, over 21683.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3408, pruned_loss=0.09943, over 4255987.76 frames. ], batch size: 247, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:32:58,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=619404.0, ans=0.0 2023-06-20 03:33:01,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.39 vs. limit=22.5 2023-06-20 03:33:11,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=619464.0, ans=0.125 2023-06-20 03:34:35,051 INFO [train.py:996] (1/4) Epoch 4, batch 11800, loss[loss=0.2076, simple_loss=0.2653, pruned_loss=0.07493, over 21448.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3415, pruned_loss=0.1012, over 4261203.01 frames. ], batch size: 212, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:34:41,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-20 03:34:44,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=619704.0, ans=0.09899494936611666 2023-06-20 03:34:47,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.72 vs. limit=15.0 2023-06-20 03:34:53,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=619764.0, ans=0.2 2023-06-20 03:34:58,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=619764.0, ans=0.0 2023-06-20 03:35:23,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=619824.0, ans=0.1 2023-06-20 03:35:24,114 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-20 03:35:28,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=619824.0, ans=0.125 2023-06-20 03:35:45,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=619884.0, ans=0.07 2023-06-20 03:35:51,522 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.018e+02 3.704e+02 4.604e+02 8.251e+02, threshold=7.407e+02, percent-clipped=4.0 2023-06-20 03:36:10,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=619944.0, ans=0.125 2023-06-20 03:36:19,792 INFO [train.py:996] (1/4) Epoch 4, batch 11850, loss[loss=0.2359, simple_loss=0.3138, pruned_loss=0.07907, over 21391.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3421, pruned_loss=0.1006, over 4266463.65 frames. ], batch size: 176, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:36:33,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=620004.0, ans=0.125 2023-06-20 03:36:42,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=620064.0, ans=0.05 2023-06-20 03:36:52,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=620064.0, ans=0.04949747468305833 2023-06-20 03:37:09,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=620124.0, ans=0.125 2023-06-20 03:37:46,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=620244.0, ans=0.0 2023-06-20 03:37:46,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=620244.0, ans=0.125 2023-06-20 03:37:59,364 INFO [train.py:996] (1/4) Epoch 4, batch 11900, loss[loss=0.2847, simple_loss=0.3503, pruned_loss=0.1096, over 20642.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3408, pruned_loss=0.09781, over 4270017.93 frames. ], batch size: 607, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:38:03,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=620304.0, ans=0.125 2023-06-20 03:38:12,409 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-20 03:38:17,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-20 03:38:25,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-20 03:38:35,477 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:38:37,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=620364.0, ans=0.0 2023-06-20 03:39:05,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=620484.0, ans=0.125 2023-06-20 03:39:17,403 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-20 03:39:19,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=620484.0, ans=0.0 2023-06-20 03:39:22,712 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.597e+02 3.084e+02 3.622e+02 5.304e+02, threshold=6.167e+02, percent-clipped=0.0 2023-06-20 03:39:24,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=620484.0, ans=0.2 2023-06-20 03:39:44,827 INFO [train.py:996] (1/4) Epoch 4, batch 11950, loss[loss=0.2263, simple_loss=0.2994, pruned_loss=0.07663, over 21583.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3398, pruned_loss=0.09373, over 4270941.58 frames. ], batch size: 230, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:40:08,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=620604.0, ans=0.125 2023-06-20 03:40:24,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=620664.0, ans=0.125 2023-06-20 03:40:31,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=620724.0, ans=0.1 2023-06-20 03:40:55,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=620784.0, ans=0.1 2023-06-20 03:41:21,656 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-20 03:41:26,466 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-20 03:41:29,994 INFO [train.py:996] (1/4) Epoch 4, batch 12000, loss[loss=0.2179, simple_loss=0.2848, pruned_loss=0.07554, over 21676.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3343, pruned_loss=0.09163, over 4274490.16 frames. ], batch size: 282, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:41:29,994 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 03:41:51,443 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2681, simple_loss=0.3653, pruned_loss=0.08549, over 1796401.00 frames. 2023-06-20 03:41:51,444 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 03:42:43,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=621024.0, ans=15.0 2023-06-20 03:42:53,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=621084.0, ans=0.125 2023-06-20 03:42:56,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=621084.0, ans=15.0 2023-06-20 03:43:01,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=621084.0, ans=0.0 2023-06-20 03:43:04,283 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.836e+02 3.434e+02 3.961e+02 6.580e+02, threshold=6.867e+02, percent-clipped=2.0 2023-06-20 03:43:21,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=621144.0, ans=0.95 2023-06-20 03:43:34,835 INFO [train.py:996] (1/4) Epoch 4, batch 12050, loss[loss=0.2671, simple_loss=0.3275, pruned_loss=0.1033, over 21672.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3309, pruned_loss=0.09345, over 4270368.93 frames. ], batch size: 230, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:43:35,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=621204.0, ans=0.2 2023-06-20 03:44:07,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=12.0 2023-06-20 03:45:21,127 INFO [train.py:996] (1/4) Epoch 4, batch 12100, loss[loss=0.2832, simple_loss=0.3355, pruned_loss=0.1154, over 21700.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3372, pruned_loss=0.09888, over 4269465.69 frames. ], batch size: 264, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:45:54,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=621564.0, ans=0.125 2023-06-20 03:46:11,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=621624.0, ans=0.125 2023-06-20 03:46:21,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=621684.0, ans=0.0 2023-06-20 03:46:26,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=621684.0, ans=0.0 2023-06-20 03:46:47,627 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.023e+02 3.734e+02 4.594e+02 9.342e+02, threshold=7.469e+02, percent-clipped=3.0 2023-06-20 03:46:54,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=621744.0, ans=0.0 2023-06-20 03:47:14,421 INFO [train.py:996] (1/4) Epoch 4, batch 12150, loss[loss=0.3019, simple_loss=0.3821, pruned_loss=0.1109, over 21502.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3399, pruned_loss=0.09919, over 4265930.99 frames. ], batch size: 471, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:47:16,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=621804.0, ans=0.2 2023-06-20 03:47:31,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=621804.0, ans=0.95 2023-06-20 03:47:41,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=621864.0, ans=0.0 2023-06-20 03:48:09,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=621924.0, ans=0.07 2023-06-20 03:48:11,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-20 03:48:25,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-20 03:49:02,719 INFO [train.py:996] (1/4) Epoch 4, batch 12200, loss[loss=0.2652, simple_loss=0.3187, pruned_loss=0.1058, over 21552.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3356, pruned_loss=0.09775, over 4255723.69 frames. ], batch size: 414, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:49:03,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-20 03:49:18,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=622164.0, ans=15.0 2023-06-20 03:50:04,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=622284.0, ans=0.09899494936611666 2023-06-20 03:50:18,570 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.874e+02 3.548e+02 4.515e+02 8.617e+02, threshold=7.096e+02, percent-clipped=2.0 2023-06-20 03:50:27,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=622344.0, ans=0.1 2023-06-20 03:50:41,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=622404.0, ans=0.0 2023-06-20 03:50:43,035 INFO [train.py:996] (1/4) Epoch 4, batch 12250, loss[loss=0.1916, simple_loss=0.2768, pruned_loss=0.05319, over 21785.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3275, pruned_loss=0.09421, over 4258362.12 frames. ], batch size: 352, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:50:52,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-20 03:51:23,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=622524.0, ans=0.0 2023-06-20 03:51:31,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=622524.0, ans=0.07 2023-06-20 03:51:38,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=622524.0, ans=0.125 2023-06-20 03:52:18,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=622644.0, ans=0.125 2023-06-20 03:52:21,931 INFO [train.py:996] (1/4) Epoch 4, batch 12300, loss[loss=0.3254, simple_loss=0.4011, pruned_loss=0.1249, over 21675.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3174, pruned_loss=0.08702, over 4253284.06 frames. ], batch size: 414, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:52:35,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=622704.0, ans=0.125 2023-06-20 03:52:59,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=622764.0, ans=0.125 2023-06-20 03:53:33,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-20 03:53:45,499 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.451e+02 2.752e+02 3.675e+02 6.755e+02, threshold=5.504e+02, percent-clipped=0.0 2023-06-20 03:53:54,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-20 03:54:09,724 INFO [train.py:996] (1/4) Epoch 4, batch 12350, loss[loss=0.2657, simple_loss=0.3328, pruned_loss=0.09929, over 21811.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3226, pruned_loss=0.08806, over 4258303.56 frames. ], batch size: 298, lr: 8.08e-03, grad_scale: 16.0 2023-06-20 03:54:26,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=623004.0, ans=0.125 2023-06-20 03:54:42,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-20 03:54:50,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=623124.0, ans=0.2 2023-06-20 03:54:56,102 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:55:17,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=623184.0, ans=0.0 2023-06-20 03:55:32,310 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.46 vs. limit=10.0 2023-06-20 03:55:44,952 INFO [train.py:996] (1/4) Epoch 4, batch 12400, loss[loss=0.27, simple_loss=0.327, pruned_loss=0.1065, over 21965.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3249, pruned_loss=0.09247, over 4272154.10 frames. ], batch size: 316, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:55:53,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=623304.0, ans=0.125 2023-06-20 03:55:57,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=623304.0, ans=0.125 2023-06-20 03:56:17,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-20 03:56:20,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=623364.0, ans=0.2 2023-06-20 03:56:29,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-20 03:56:35,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=623424.0, ans=0.125 2023-06-20 03:56:43,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-20 03:57:03,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-20 03:57:06,675 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.090e+02 3.958e+02 4.907e+02 8.874e+02, threshold=7.916e+02, percent-clipped=17.0 2023-06-20 03:57:08,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=623544.0, ans=0.1 2023-06-20 03:57:23,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=623544.0, ans=0.2 2023-06-20 03:57:33,445 INFO [train.py:996] (1/4) Epoch 4, batch 12450, loss[loss=0.266, simple_loss=0.3294, pruned_loss=0.1013, over 20895.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3295, pruned_loss=0.09677, over 4277099.84 frames. ], batch size: 608, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 03:57:34,389 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=22.5 2023-06-20 03:57:37,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=623604.0, ans=0.125 2023-06-20 03:57:41,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=623604.0, ans=0.125 2023-06-20 03:57:57,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=623664.0, ans=0.0 2023-06-20 03:59:04,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-20 03:59:13,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=623844.0, ans=0.0 2023-06-20 03:59:19,783 INFO [train.py:996] (1/4) Epoch 4, batch 12500, loss[loss=0.2975, simple_loss=0.3831, pruned_loss=0.106, over 21315.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3393, pruned_loss=0.1002, over 4268550.30 frames. ], batch size: 176, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 04:00:00,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-20 04:00:53,419 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.989e+02 3.343e+02 3.829e+02 7.985e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-20 04:00:54,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=624144.0, ans=0.2 2023-06-20 04:01:02,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.54 vs. limit=10.0 2023-06-20 04:01:10,290 INFO [train.py:996] (1/4) Epoch 4, batch 12550, loss[loss=0.2728, simple_loss=0.3487, pruned_loss=0.09843, over 21687.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3444, pruned_loss=0.1022, over 4269251.45 frames. ], batch size: 351, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 04:02:10,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=624324.0, ans=0.125 2023-06-20 04:02:24,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=624384.0, ans=0.0 2023-06-20 04:02:31,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=624384.0, ans=0.0 2023-06-20 04:02:34,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=624384.0, ans=0.1 2023-06-20 04:02:46,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=624444.0, ans=0.125 2023-06-20 04:03:04,247 INFO [train.py:996] (1/4) Epoch 4, batch 12600, loss[loss=0.1915, simple_loss=0.257, pruned_loss=0.06302, over 21858.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3415, pruned_loss=0.09895, over 4276153.56 frames. ], batch size: 98, lr: 8.07e-03, grad_scale: 8.0 2023-06-20 04:03:18,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-20 04:03:41,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-06-20 04:04:02,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=624684.0, ans=0.125 2023-06-20 04:04:21,097 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.838e+02 3.340e+02 3.938e+02 7.249e+02, threshold=6.681e+02, percent-clipped=1.0 2023-06-20 04:04:28,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=624744.0, ans=0.05 2023-06-20 04:04:38,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=624744.0, ans=0.0 2023-06-20 04:04:40,903 INFO [train.py:996] (1/4) Epoch 4, batch 12650, loss[loss=0.1802, simple_loss=0.2489, pruned_loss=0.05573, over 21404.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3342, pruned_loss=0.09439, over 4277871.34 frames. ], batch size: 160, lr: 8.07e-03, grad_scale: 8.0 2023-06-20 04:05:26,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=624924.0, ans=0.0 2023-06-20 04:06:07,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=625044.0, ans=0.0 2023-06-20 04:06:07,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=625044.0, ans=0.05 2023-06-20 04:06:18,050 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.06 vs. limit=6.0 2023-06-20 04:06:25,530 INFO [train.py:996] (1/4) Epoch 4, batch 12700, loss[loss=0.276, simple_loss=0.377, pruned_loss=0.08751, over 19769.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3368, pruned_loss=0.09769, over 4281145.27 frames. ], batch size: 702, lr: 8.06e-03, grad_scale: 8.0 2023-06-20 04:07:08,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-20 04:07:15,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=625224.0, ans=0.125 2023-06-20 04:07:28,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=625284.0, ans=0.0 2023-06-20 04:07:53,687 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.213e+02 3.839e+02 4.784e+02 8.311e+02, threshold=7.678e+02, percent-clipped=6.0 2023-06-20 04:07:57,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=625344.0, ans=0.1 2023-06-20 04:08:07,936 INFO [train.py:996] (1/4) Epoch 4, batch 12750, loss[loss=0.2785, simple_loss=0.3456, pruned_loss=0.1057, over 20082.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3377, pruned_loss=0.09808, over 4279783.84 frames. ], batch size: 703, lr: 8.06e-03, grad_scale: 8.0 2023-06-20 04:09:04,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=625584.0, ans=0.2 2023-06-20 04:09:38,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.08 vs. limit=12.0 2023-06-20 04:09:39,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=625644.0, ans=0.0 2023-06-20 04:09:48,345 INFO [train.py:996] (1/4) Epoch 4, batch 12800, loss[loss=0.2476, simple_loss=0.3179, pruned_loss=0.08866, over 21750.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3372, pruned_loss=0.09935, over 4288534.12 frames. ], batch size: 247, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:10:25,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=625764.0, ans=0.125 2023-06-20 04:11:16,960 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.809e+02 3.205e+02 4.130e+02 6.634e+02, threshold=6.411e+02, percent-clipped=0.0 2023-06-20 04:11:26,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=625944.0, ans=0.125 2023-06-20 04:11:26,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-20 04:11:37,267 INFO [train.py:996] (1/4) Epoch 4, batch 12850, loss[loss=0.2492, simple_loss=0.3334, pruned_loss=0.08246, over 21701.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3405, pruned_loss=0.1011, over 4284664.12 frames. ], batch size: 351, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:11:40,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-20 04:13:15,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=626244.0, ans=0.2 2023-06-20 04:13:27,094 INFO [train.py:996] (1/4) Epoch 4, batch 12900, loss[loss=0.2555, simple_loss=0.3318, pruned_loss=0.08958, over 20012.00 frames. ], tot_loss[loss=0.265, simple_loss=0.337, pruned_loss=0.09647, over 4282715.34 frames. ], batch size: 703, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:13:32,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-20 04:13:48,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=626364.0, ans=0.125 2023-06-20 04:13:52,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=626364.0, ans=0.125 2023-06-20 04:13:58,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=626424.0, ans=0.0 2023-06-20 04:14:19,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=626424.0, ans=0.125 2023-06-20 04:14:20,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=626424.0, ans=0.2 2023-06-20 04:14:50,914 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.539e+02 2.943e+02 3.440e+02 5.830e+02, threshold=5.886e+02, percent-clipped=0.0 2023-06-20 04:15:05,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=626544.0, ans=0.0 2023-06-20 04:15:12,137 INFO [train.py:996] (1/4) Epoch 4, batch 12950, loss[loss=0.2588, simple_loss=0.3274, pruned_loss=0.09508, over 21779.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3371, pruned_loss=0.09587, over 4274377.17 frames. ], batch size: 282, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:15:53,956 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-20 04:16:13,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=626784.0, ans=0.125 2023-06-20 04:16:49,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=626904.0, ans=0.125 2023-06-20 04:16:50,321 INFO [train.py:996] (1/4) Epoch 4, batch 13000, loss[loss=0.2171, simple_loss=0.2984, pruned_loss=0.06794, over 21695.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3367, pruned_loss=0.09565, over 4274594.63 frames. ], batch size: 351, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:17:37,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=627024.0, ans=0.0 2023-06-20 04:18:13,292 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.732e+02 3.360e+02 4.009e+02 6.698e+02, threshold=6.719e+02, percent-clipped=5.0 2023-06-20 04:18:33,504 INFO [train.py:996] (1/4) Epoch 4, batch 13050, loss[loss=0.2629, simple_loss=0.324, pruned_loss=0.1009, over 21622.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3324, pruned_loss=0.09397, over 4281766.15 frames. ], batch size: 548, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:19:22,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=627324.0, ans=0.125 2023-06-20 04:19:36,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=627324.0, ans=0.125 2023-06-20 04:20:17,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=627504.0, ans=0.0 2023-06-20 04:20:18,867 INFO [train.py:996] (1/4) Epoch 4, batch 13100, loss[loss=0.2592, simple_loss=0.336, pruned_loss=0.09125, over 21302.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.334, pruned_loss=0.0942, over 4276767.12 frames. ], batch size: 548, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:20:29,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=627504.0, ans=0.125 2023-06-20 04:20:47,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=627564.0, ans=0.125 2023-06-20 04:21:19,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=627624.0, ans=0.0 2023-06-20 04:21:21,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=627624.0, ans=0.125 2023-06-20 04:21:48,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 3.050e+02 3.631e+02 4.148e+02 7.105e+02, threshold=7.262e+02, percent-clipped=1.0 2023-06-20 04:21:55,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=627744.0, ans=0.125 2023-06-20 04:22:03,645 INFO [train.py:996] (1/4) Epoch 4, batch 13150, loss[loss=0.2133, simple_loss=0.2823, pruned_loss=0.07215, over 21418.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3362, pruned_loss=0.09769, over 4276329.84 frames. ], batch size: 194, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:22:24,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=627804.0, ans=0.0 2023-06-20 04:22:27,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=627804.0, ans=0.2 2023-06-20 04:22:36,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=627864.0, ans=0.1 2023-06-20 04:22:50,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=627864.0, ans=0.125 2023-06-20 04:23:23,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=627984.0, ans=0.125 2023-06-20 04:24:02,000 INFO [train.py:996] (1/4) Epoch 4, batch 13200, loss[loss=0.2888, simple_loss=0.3531, pruned_loss=0.1123, over 21794.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3346, pruned_loss=0.09803, over 4280156.93 frames. ], batch size: 441, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:24:05,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=628104.0, ans=0.1 2023-06-20 04:25:14,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=628284.0, ans=0.1 2023-06-20 04:25:27,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 2.669e+02 3.031e+02 3.643e+02 6.014e+02, threshold=6.063e+02, percent-clipped=0.0 2023-06-20 04:25:39,128 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-20 04:25:40,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=628344.0, ans=0.5 2023-06-20 04:25:48,471 INFO [train.py:996] (1/4) Epoch 4, batch 13250, loss[loss=0.2489, simple_loss=0.3192, pruned_loss=0.08928, over 21521.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3358, pruned_loss=0.1005, over 4282983.22 frames. ], batch size: 131, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:25:48,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=628404.0, ans=0.125 2023-06-20 04:25:50,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=628404.0, ans=0.0 2023-06-20 04:25:54,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-20 04:26:45,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=628524.0, ans=0.0 2023-06-20 04:27:04,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=628584.0, ans=0.04949747468305833 2023-06-20 04:27:12,271 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-20 04:27:17,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=628644.0, ans=10.0 2023-06-20 04:27:40,887 INFO [train.py:996] (1/4) Epoch 4, batch 13300, loss[loss=0.2894, simple_loss=0.3677, pruned_loss=0.1055, over 21294.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3378, pruned_loss=0.0994, over 4280548.08 frames. ], batch size: 548, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:27:51,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=628704.0, ans=0.125 2023-06-20 04:28:09,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=15.0 2023-06-20 04:28:43,521 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-20 04:28:52,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=628884.0, ans=0.0 2023-06-20 04:29:02,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.810e+02 3.267e+02 3.707e+02 6.798e+02, threshold=6.534e+02, percent-clipped=1.0 2023-06-20 04:29:21,138 INFO [train.py:996] (1/4) Epoch 4, batch 13350, loss[loss=0.3995, simple_loss=0.4379, pruned_loss=0.1806, over 21390.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3429, pruned_loss=0.1026, over 4282502.19 frames. ], batch size: 507, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:29:56,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=629064.0, ans=0.0 2023-06-20 04:30:30,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=629184.0, ans=0.0 2023-06-20 04:31:05,722 INFO [train.py:996] (1/4) Epoch 4, batch 13400, loss[loss=0.2747, simple_loss=0.3416, pruned_loss=0.1039, over 21794.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.344, pruned_loss=0.1043, over 4282655.75 frames. ], batch size: 351, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:31:31,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-20 04:32:19,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=629484.0, ans=0.1 2023-06-20 04:32:38,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.043e+02 3.589e+02 4.349e+02 7.690e+02, threshold=7.178e+02, percent-clipped=6.0 2023-06-20 04:32:44,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=629544.0, ans=0.1 2023-06-20 04:32:47,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-20 04:32:51,535 INFO [train.py:996] (1/4) Epoch 4, batch 13450, loss[loss=0.2415, simple_loss=0.2941, pruned_loss=0.09447, over 21392.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.347, pruned_loss=0.1069, over 4281002.20 frames. ], batch size: 194, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:33:07,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=629604.0, ans=0.0 2023-06-20 04:33:31,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=629664.0, ans=0.125 2023-06-20 04:34:46,845 INFO [train.py:996] (1/4) Epoch 4, batch 13500, loss[loss=0.1862, simple_loss=0.2465, pruned_loss=0.063, over 21526.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3347, pruned_loss=0.102, over 4268957.92 frames. ], batch size: 195, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:35:23,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.12 vs. limit=15.0 2023-06-20 04:35:36,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=630024.0, ans=0.1 2023-06-20 04:35:46,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=630024.0, ans=0.2 2023-06-20 04:36:17,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=630144.0, ans=0.0 2023-06-20 04:36:21,738 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.218e+02 3.754e+02 4.451e+02 7.704e+02, threshold=7.508e+02, percent-clipped=1.0 2023-06-20 04:36:34,722 INFO [train.py:996] (1/4) Epoch 4, batch 13550, loss[loss=0.2821, simple_loss=0.3819, pruned_loss=0.0911, over 21777.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3391, pruned_loss=0.1016, over 4276788.30 frames. ], batch size: 282, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:37:13,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=630264.0, ans=0.0 2023-06-20 04:37:30,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=630324.0, ans=0.125 2023-06-20 04:37:39,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=630384.0, ans=0.0 2023-06-20 04:37:39,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-20 04:37:44,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=630384.0, ans=0.1 2023-06-20 04:38:12,099 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-20 04:38:18,734 INFO [train.py:996] (1/4) Epoch 4, batch 13600, loss[loss=0.3098, simple_loss=0.3584, pruned_loss=0.1306, over 21632.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3413, pruned_loss=0.1026, over 4275043.31 frames. ], batch size: 471, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 04:38:47,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.25 vs. limit=10.0 2023-06-20 04:38:52,745 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-20 04:38:53,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=630564.0, ans=0.0 2023-06-20 04:39:50,133 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.831e+02 3.255e+02 3.648e+02 6.704e+02, threshold=6.511e+02, percent-clipped=0.0 2023-06-20 04:40:00,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-20 04:40:01,046 INFO [train.py:996] (1/4) Epoch 4, batch 13650, loss[loss=0.2265, simple_loss=0.287, pruned_loss=0.08298, over 21623.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3356, pruned_loss=0.09869, over 4277883.83 frames. ], batch size: 332, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:40:24,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=630804.0, ans=0.125 2023-06-20 04:40:55,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=630924.0, ans=0.2 2023-06-20 04:41:45,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-20 04:41:45,819 INFO [train.py:996] (1/4) Epoch 4, batch 13700, loss[loss=0.2179, simple_loss=0.27, pruned_loss=0.08286, over 21125.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3298, pruned_loss=0.09802, over 4281362.48 frames. ], batch size: 143, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:42:34,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.28 vs. limit=15.0 2023-06-20 04:42:51,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-20 04:43:24,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-20 04:43:24,874 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.956e+02 3.502e+02 4.364e+02 8.587e+02, threshold=7.005e+02, percent-clipped=5.0 2023-06-20 04:43:33,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=631344.0, ans=0.125 2023-06-20 04:43:42,170 INFO [train.py:996] (1/4) Epoch 4, batch 13750, loss[loss=0.3179, simple_loss=0.3844, pruned_loss=0.1257, over 21493.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.327, pruned_loss=0.09635, over 4270340.34 frames. ], batch size: 471, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:43:45,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=631404.0, ans=0.125 2023-06-20 04:43:53,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=631404.0, ans=0.0 2023-06-20 04:44:12,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=631464.0, ans=0.0 2023-06-20 04:44:32,582 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:44:37,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=631524.0, ans=0.125 2023-06-20 04:44:42,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=631524.0, ans=0.035 2023-06-20 04:44:46,430 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:45:12,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=631584.0, ans=0.025 2023-06-20 04:45:28,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=631644.0, ans=0.125 2023-06-20 04:45:32,621 INFO [train.py:996] (1/4) Epoch 4, batch 13800, loss[loss=0.2666, simple_loss=0.3545, pruned_loss=0.08934, over 21691.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3293, pruned_loss=0.09434, over 4273042.70 frames. ], batch size: 247, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:46:04,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=631764.0, ans=0.0 2023-06-20 04:46:36,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=631824.0, ans=0.0 2023-06-20 04:47:05,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=631944.0, ans=0.125 2023-06-20 04:47:06,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.073e+02 3.642e+02 4.560e+02 8.359e+02, threshold=7.284e+02, percent-clipped=3.0 2023-06-20 04:47:22,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-20 04:47:23,528 INFO [train.py:996] (1/4) Epoch 4, batch 13850, loss[loss=0.2666, simple_loss=0.3679, pruned_loss=0.08266, over 20811.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3383, pruned_loss=0.09663, over 4273800.46 frames. ], batch size: 608, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:47:36,575 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.41 vs. limit=15.0 2023-06-20 04:47:50,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-20 04:49:09,028 INFO [train.py:996] (1/4) Epoch 4, batch 13900, loss[loss=0.2982, simple_loss=0.3572, pruned_loss=0.1196, over 21776.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3449, pruned_loss=0.1021, over 4279207.08 frames. ], batch size: 441, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:49:16,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-20 04:50:40,989 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.556e+02 4.265e+02 5.269e+02 7.474e+02, threshold=8.529e+02, percent-clipped=1.0 2023-06-20 04:50:52,490 INFO [train.py:996] (1/4) Epoch 4, batch 13950, loss[loss=0.2751, simple_loss=0.343, pruned_loss=0.1036, over 21726.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3441, pruned_loss=0.1035, over 4282940.23 frames. ], batch size: 389, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:51:06,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=632604.0, ans=0.025 2023-06-20 04:51:12,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=632664.0, ans=0.125 2023-06-20 04:51:14,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=632664.0, ans=0.0 2023-06-20 04:51:33,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-06-20 04:51:48,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=632724.0, ans=0.125 2023-06-20 04:52:20,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=22.5 2023-06-20 04:52:35,141 INFO [train.py:996] (1/4) Epoch 4, batch 14000, loss[loss=0.1774, simple_loss=0.244, pruned_loss=0.05541, over 21257.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3418, pruned_loss=0.1012, over 4282589.71 frames. ], batch size: 159, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:52:37,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=632904.0, ans=0.0 2023-06-20 04:53:39,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-20 04:53:40,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=633084.0, ans=0.2 2023-06-20 04:54:01,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=633144.0, ans=0.0 2023-06-20 04:54:05,710 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.775e+02 3.339e+02 4.026e+02 8.444e+02, threshold=6.679e+02, percent-clipped=0.0 2023-06-20 04:54:17,313 INFO [train.py:996] (1/4) Epoch 4, batch 14050, loss[loss=0.2134, simple_loss=0.2675, pruned_loss=0.0797, over 21204.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3339, pruned_loss=0.09629, over 4274836.75 frames. ], batch size: 176, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:54:35,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=633204.0, ans=0.125 2023-06-20 04:54:37,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=633264.0, ans=0.125 2023-06-20 04:55:51,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=633444.0, ans=0.0 2023-06-20 04:55:52,506 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.562e-03 2023-06-20 04:56:00,934 INFO [train.py:996] (1/4) Epoch 4, batch 14100, loss[loss=0.3166, simple_loss=0.3534, pruned_loss=0.1399, over 21375.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3274, pruned_loss=0.09649, over 4269336.81 frames. ], batch size: 508, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:56:03,561 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=12.0 2023-06-20 04:56:30,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=633564.0, ans=0.125 2023-06-20 04:56:34,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=633564.0, ans=0.125 2023-06-20 04:56:35,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=633564.0, ans=0.125 2023-06-20 04:57:12,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=633684.0, ans=0.0 2023-06-20 04:57:12,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-20 04:57:31,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=633744.0, ans=0.125 2023-06-20 04:57:34,361 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.697e+02 3.241e+02 4.086e+02 6.665e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 04:57:43,875 INFO [train.py:996] (1/4) Epoch 4, batch 14150, loss[loss=0.2786, simple_loss=0.3541, pruned_loss=0.1015, over 21834.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3313, pruned_loss=0.09778, over 4264262.41 frames. ], batch size: 118, lr: 8.01e-03, grad_scale: 16.0 2023-06-20 04:57:50,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=633804.0, ans=0.2 2023-06-20 04:57:59,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=633804.0, ans=0.1 2023-06-20 04:59:23,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=634104.0, ans=0.125 2023-06-20 04:59:24,535 INFO [train.py:996] (1/4) Epoch 4, batch 14200, loss[loss=0.2191, simple_loss=0.2942, pruned_loss=0.07197, over 21361.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.33, pruned_loss=0.09632, over 4259701.72 frames. ], batch size: 144, lr: 8.01e-03, grad_scale: 16.0 2023-06-20 04:59:32,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=634104.0, ans=22.5 2023-06-20 04:59:48,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=634164.0, ans=0.0 2023-06-20 05:00:04,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-06-20 05:00:12,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=634224.0, ans=15.0 2023-06-20 05:00:36,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=634284.0, ans=0.0 2023-06-20 05:00:38,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=634284.0, ans=0.0 2023-06-20 05:00:52,666 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.482e+02 2.802e+02 3.379e+02 6.129e+02, threshold=5.605e+02, percent-clipped=0.0 2023-06-20 05:01:07,609 INFO [train.py:996] (1/4) Epoch 4, batch 14250, loss[loss=0.2804, simple_loss=0.3232, pruned_loss=0.1188, over 21531.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3246, pruned_loss=0.09594, over 4264017.19 frames. ], batch size: 391, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:01:13,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=634404.0, ans=0.125 2023-06-20 05:01:26,905 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-20 05:01:57,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=634524.0, ans=0.0 2023-06-20 05:02:07,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=634584.0, ans=0.125 2023-06-20 05:02:16,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=634584.0, ans=0.125 2023-06-20 05:02:32,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=634644.0, ans=0.0 2023-06-20 05:02:43,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=634644.0, ans=0.125 2023-06-20 05:02:46,950 INFO [train.py:996] (1/4) Epoch 4, batch 14300, loss[loss=0.291, simple_loss=0.3788, pruned_loss=0.1017, over 21667.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3285, pruned_loss=0.09647, over 4251904.21 frames. ], batch size: 247, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:03:06,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=634704.0, ans=0.0 2023-06-20 05:03:37,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=634824.0, ans=0.125 2023-06-20 05:03:59,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=634884.0, ans=0.125 2023-06-20 05:04:11,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=634884.0, ans=0.125 2023-06-20 05:04:21,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.821e+02 3.422e+02 4.230e+02 9.010e+02, threshold=6.844e+02, percent-clipped=9.0 2023-06-20 05:04:27,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=634944.0, ans=0.07 2023-06-20 05:04:28,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-20 05:04:31,898 INFO [train.py:996] (1/4) Epoch 4, batch 14350, loss[loss=0.2908, simple_loss=0.3476, pruned_loss=0.117, over 21735.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3341, pruned_loss=0.09724, over 4260637.88 frames. ], batch size: 441, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:04:48,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=635004.0, ans=0.015 2023-06-20 05:05:09,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-20 05:06:19,427 INFO [train.py:996] (1/4) Epoch 4, batch 14400, loss[loss=0.2798, simple_loss=0.3304, pruned_loss=0.1146, over 21533.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3317, pruned_loss=0.09837, over 4269975.47 frames. ], batch size: 548, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:06:19,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=635304.0, ans=0.1 2023-06-20 05:06:55,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=635364.0, ans=0.125 2023-06-20 05:07:41,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=635544.0, ans=0.125 2023-06-20 05:07:42,768 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.821e+02 3.350e+02 4.136e+02 6.839e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-20 05:07:57,239 INFO [train.py:996] (1/4) Epoch 4, batch 14450, loss[loss=0.2609, simple_loss=0.3246, pruned_loss=0.09865, over 21845.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3258, pruned_loss=0.09833, over 4275724.40 frames. ], batch size: 107, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:08:27,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=635664.0, ans=0.125 2023-06-20 05:08:40,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=635724.0, ans=0.0 2023-06-20 05:08:44,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-20 05:09:12,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=635784.0, ans=0.125 2023-06-20 05:09:13,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=635784.0, ans=0.1 2023-06-20 05:09:39,317 INFO [train.py:996] (1/4) Epoch 4, batch 14500, loss[loss=0.2828, simple_loss=0.3492, pruned_loss=0.1082, over 21633.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3223, pruned_loss=0.09757, over 4273069.28 frames. ], batch size: 414, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:10:00,438 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-20 05:10:28,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-20 05:10:50,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=636084.0, ans=0.0 2023-06-20 05:11:13,693 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.861e+02 3.336e+02 4.612e+02 7.217e+02, threshold=6.672e+02, percent-clipped=2.0 2023-06-20 05:11:24,706 INFO [train.py:996] (1/4) Epoch 4, batch 14550, loss[loss=0.3055, simple_loss=0.3691, pruned_loss=0.121, over 21560.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3266, pruned_loss=0.09831, over 4265865.87 frames. ], batch size: 389, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:13:16,327 INFO [train.py:996] (1/4) Epoch 4, batch 14600, loss[loss=0.2845, simple_loss=0.3557, pruned_loss=0.1067, over 21475.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3368, pruned_loss=0.1039, over 4269285.44 frames. ], batch size: 211, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:13:19,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-20 05:13:27,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=636504.0, ans=0.125 2023-06-20 05:13:41,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=636564.0, ans=0.05 2023-06-20 05:14:25,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=636684.0, ans=0.125 2023-06-20 05:14:43,551 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 3.039e+02 3.637e+02 4.481e+02 9.662e+02, threshold=7.275e+02, percent-clipped=5.0 2023-06-20 05:14:51,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=636804.0, ans=0.1 2023-06-20 05:14:53,019 INFO [train.py:996] (1/4) Epoch 4, batch 14650, loss[loss=0.2245, simple_loss=0.2824, pruned_loss=0.0833, over 15962.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3403, pruned_loss=0.1039, over 4268884.60 frames. ], batch size: 60, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:15:02,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=636804.0, ans=0.2 2023-06-20 05:15:36,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=636924.0, ans=0.0 2023-06-20 05:16:25,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=637044.0, ans=0.125 2023-06-20 05:16:28,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=637044.0, ans=0.05 2023-06-20 05:16:40,840 INFO [train.py:996] (1/4) Epoch 4, batch 14700, loss[loss=0.1967, simple_loss=0.273, pruned_loss=0.06017, over 21737.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3339, pruned_loss=0.09684, over 4259473.15 frames. ], batch size: 124, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:16:49,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=637104.0, ans=0.0 2023-06-20 05:16:53,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=637104.0, ans=0.0 2023-06-20 05:16:53,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=637104.0, ans=0.1 2023-06-20 05:16:56,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=637164.0, ans=0.0 2023-06-20 05:17:48,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=637284.0, ans=0.0 2023-06-20 05:17:51,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=637284.0, ans=0.125 2023-06-20 05:18:12,034 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.351e+02 2.788e+02 3.519e+02 6.135e+02, threshold=5.577e+02, percent-clipped=0.0 2023-06-20 05:18:22,366 INFO [train.py:996] (1/4) Epoch 4, batch 14750, loss[loss=0.2906, simple_loss=0.3505, pruned_loss=0.1153, over 21592.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3381, pruned_loss=0.09942, over 4265912.62 frames. ], batch size: 230, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:18:26,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=637404.0, ans=0.2 2023-06-20 05:18:48,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=637464.0, ans=0.125 2023-06-20 05:18:50,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=637464.0, ans=0.2 2023-06-20 05:19:17,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=637524.0, ans=0.2 2023-06-20 05:19:47,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=637644.0, ans=0.125 2023-06-20 05:20:03,211 INFO [train.py:996] (1/4) Epoch 4, batch 14800, loss[loss=0.2368, simple_loss=0.2997, pruned_loss=0.087, over 21297.00 frames. ], tot_loss[loss=0.281, simple_loss=0.3501, pruned_loss=0.1059, over 4264010.02 frames. ], batch size: 131, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 05:21:15,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=637884.0, ans=0.125 2023-06-20 05:21:42,991 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.164e+02 3.889e+02 4.731e+02 8.129e+02, threshold=7.778e+02, percent-clipped=15.0 2023-06-20 05:22:00,574 INFO [train.py:996] (1/4) Epoch 4, batch 14850, loss[loss=0.2454, simple_loss=0.3089, pruned_loss=0.09098, over 21448.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3432, pruned_loss=0.1038, over 4263429.04 frames. ], batch size: 211, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:22:06,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=638004.0, ans=0.1 2023-06-20 05:22:54,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=638124.0, ans=0.0 2023-06-20 05:22:58,255 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.91 vs. limit=15.0 2023-06-20 05:23:46,622 INFO [train.py:996] (1/4) Epoch 4, batch 14900, loss[loss=0.2862, simple_loss=0.3433, pruned_loss=0.1145, over 21834.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3441, pruned_loss=0.1051, over 4262684.91 frames. ], batch size: 282, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:24:28,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=638424.0, ans=0.125 2023-06-20 05:24:32,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=638424.0, ans=0.0 2023-06-20 05:25:25,409 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.971e+02 3.790e+02 5.715e+02 1.373e+03, threshold=7.580e+02, percent-clipped=7.0 2023-06-20 05:25:32,239 INFO [train.py:996] (1/4) Epoch 4, batch 14950, loss[loss=0.2784, simple_loss=0.3641, pruned_loss=0.09638, over 19910.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3448, pruned_loss=0.104, over 4258927.95 frames. ], batch size: 702, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:25:34,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=638604.0, ans=0.125 2023-06-20 05:25:38,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.28 vs. limit=10.0 2023-06-20 05:26:13,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=638724.0, ans=0.125 2023-06-20 05:27:04,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=638844.0, ans=0.2 2023-06-20 05:27:16,850 INFO [train.py:996] (1/4) Epoch 4, batch 15000, loss[loss=0.2801, simple_loss=0.3854, pruned_loss=0.08742, over 20720.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.348, pruned_loss=0.1064, over 4259355.85 frames. ], batch size: 607, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:27:16,851 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 05:27:33,899 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2743, simple_loss=0.3665, pruned_loss=0.09108, over 1796401.00 frames. 2023-06-20 05:27:33,900 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 05:27:51,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=638904.0, ans=0.0 2023-06-20 05:27:55,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=638964.0, ans=0.0 2023-06-20 05:27:56,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=638964.0, ans=0.125 2023-06-20 05:28:48,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=639084.0, ans=0.0 2023-06-20 05:29:11,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=639144.0, ans=0.0 2023-06-20 05:29:12,206 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.360e+02 3.927e+02 4.837e+02 8.029e+02, threshold=7.853e+02, percent-clipped=2.0 2023-06-20 05:29:24,460 INFO [train.py:996] (1/4) Epoch 4, batch 15050, loss[loss=0.2912, simple_loss=0.3891, pruned_loss=0.09666, over 21195.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3495, pruned_loss=0.107, over 4264108.81 frames. ], batch size: 548, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:29:32,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=639204.0, ans=0.0 2023-06-20 05:29:43,860 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-20 05:30:24,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=639324.0, ans=0.125 2023-06-20 05:31:03,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=639444.0, ans=0.0 2023-06-20 05:31:08,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-20 05:31:09,497 INFO [train.py:996] (1/4) Epoch 4, batch 15100, loss[loss=0.3341, simple_loss=0.3944, pruned_loss=0.1369, over 21841.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3517, pruned_loss=0.1066, over 4268250.36 frames. ], batch size: 118, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:31:32,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=639564.0, ans=0.0 2023-06-20 05:32:01,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.16 vs. limit=22.5 2023-06-20 05:32:04,547 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:32:17,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=639684.0, ans=0.125 2023-06-20 05:32:17,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=639684.0, ans=0.125 2023-06-20 05:32:27,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=639684.0, ans=0.1 2023-06-20 05:32:39,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=639744.0, ans=0.125 2023-06-20 05:32:45,894 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.063e+02 3.378e+02 3.992e+02 7.623e+02, threshold=6.756e+02, percent-clipped=0.0 2023-06-20 05:32:49,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=639744.0, ans=0.1 2023-06-20 05:32:52,574 INFO [train.py:996] (1/4) Epoch 4, batch 15150, loss[loss=0.227, simple_loss=0.2815, pruned_loss=0.0862, over 21405.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3473, pruned_loss=0.1066, over 4266085.47 frames. ], batch size: 194, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:33:08,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=639804.0, ans=0.125 2023-06-20 05:33:17,313 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-20 05:33:43,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=639924.0, ans=0.02 2023-06-20 05:33:49,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=639924.0, ans=0.1 2023-06-20 05:34:41,827 INFO [train.py:996] (1/4) Epoch 4, batch 15200, loss[loss=0.2129, simple_loss=0.2603, pruned_loss=0.08275, over 21246.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.339, pruned_loss=0.1026, over 4269020.52 frames. ], batch size: 144, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:35:15,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.84 vs. limit=15.0 2023-06-20 05:35:17,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-06-20 05:35:35,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=640224.0, ans=0.125 2023-06-20 05:35:43,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=640284.0, ans=0.0 2023-06-20 05:35:48,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=640284.0, ans=0.125 2023-06-20 05:36:14,278 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.035e+02 3.960e+02 4.645e+02 7.650e+02, threshold=7.920e+02, percent-clipped=3.0 2023-06-20 05:36:25,891 INFO [train.py:996] (1/4) Epoch 4, batch 15250, loss[loss=0.2694, simple_loss=0.3253, pruned_loss=0.1067, over 21146.00 frames. ], tot_loss[loss=0.267, simple_loss=0.333, pruned_loss=0.1005, over 4271141.78 frames. ], batch size: 143, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:36:28,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=640404.0, ans=0.035 2023-06-20 05:37:41,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=640584.0, ans=0.0 2023-06-20 05:37:54,576 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.37 vs. limit=10.0 2023-06-20 05:38:17,119 INFO [train.py:996] (1/4) Epoch 4, batch 15300, loss[loss=0.2278, simple_loss=0.3172, pruned_loss=0.06917, over 20761.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3352, pruned_loss=0.1035, over 4277822.80 frames. ], batch size: 609, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:38:22,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=640704.0, ans=0.125 2023-06-20 05:38:43,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=12.0 2023-06-20 05:39:10,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=640824.0, ans=0.1 2023-06-20 05:39:47,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=640944.0, ans=0.125 2023-06-20 05:39:54,771 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.484e+02 2.881e+02 3.296e+02 3.984e+02 9.139e+02, threshold=6.591e+02, percent-clipped=2.0 2023-06-20 05:40:01,916 INFO [train.py:996] (1/4) Epoch 4, batch 15350, loss[loss=0.2604, simple_loss=0.3555, pruned_loss=0.08267, over 21764.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3378, pruned_loss=0.1044, over 4271468.96 frames. ], batch size: 247, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:40:32,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=641064.0, ans=0.0 2023-06-20 05:40:38,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=641124.0, ans=0.0 2023-06-20 05:41:00,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=641184.0, ans=0.0 2023-06-20 05:41:32,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=641244.0, ans=0.125 2023-06-20 05:41:36,536 INFO [train.py:996] (1/4) Epoch 4, batch 15400, loss[loss=0.3077, simple_loss=0.3942, pruned_loss=0.1106, over 19962.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3384, pruned_loss=0.103, over 4268227.59 frames. ], batch size: 703, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:41:50,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=641304.0, ans=0.1 2023-06-20 05:42:12,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=641364.0, ans=0.04949747468305833 2023-06-20 05:42:37,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=641484.0, ans=0.1 2023-06-20 05:42:46,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=641484.0, ans=0.125 2023-06-20 05:42:55,104 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-20 05:43:07,580 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.755e+02 3.304e+02 3.947e+02 7.271e+02, threshold=6.607e+02, percent-clipped=2.0 2023-06-20 05:43:19,990 INFO [train.py:996] (1/4) Epoch 4, batch 15450, loss[loss=0.2841, simple_loss=0.36, pruned_loss=0.1041, over 21554.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3374, pruned_loss=0.1025, over 4249825.13 frames. ], batch size: 471, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:43:22,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-20 05:43:45,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=641664.0, ans=0.07 2023-06-20 05:44:18,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=641784.0, ans=0.2 2023-06-20 05:44:30,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=641784.0, ans=0.0 2023-06-20 05:45:10,523 INFO [train.py:996] (1/4) Epoch 4, batch 15500, loss[loss=0.3213, simple_loss=0.3748, pruned_loss=0.1339, over 21515.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3392, pruned_loss=0.1012, over 4246680.55 frames. ], batch size: 194, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:45:18,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-20 05:45:28,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=641904.0, ans=0.125 2023-06-20 05:45:28,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=641904.0, ans=0.1 2023-06-20 05:45:28,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-20 05:45:34,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=641964.0, ans=0.125 2023-06-20 05:45:40,012 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-20 05:45:52,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-20 05:46:42,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=642144.0, ans=0.125 2023-06-20 05:46:56,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.827e+02 3.458e+02 4.345e+02 6.798e+02, threshold=6.916e+02, percent-clipped=2.0 2023-06-20 05:47:00,854 INFO [train.py:996] (1/4) Epoch 4, batch 15550, loss[loss=0.2372, simple_loss=0.307, pruned_loss=0.08371, over 21400.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3385, pruned_loss=0.09826, over 4245123.32 frames. ], batch size: 131, lr: 7.96e-03, grad_scale: 16.0 2023-06-20 05:47:20,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=642264.0, ans=0.125 2023-06-20 05:47:25,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=642264.0, ans=0.125 2023-06-20 05:47:27,396 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:47:28,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=642264.0, ans=0.125 2023-06-20 05:47:29,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=642264.0, ans=0.125 2023-06-20 05:47:51,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.92 vs. limit=22.5 2023-06-20 05:47:52,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=642384.0, ans=0.04949747468305833 2023-06-20 05:48:22,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=642444.0, ans=0.1 2023-06-20 05:48:22,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=642444.0, ans=0.1 2023-06-20 05:48:33,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=642444.0, ans=0.1 2023-06-20 05:48:37,123 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.22 vs. limit=22.5 2023-06-20 05:48:39,090 INFO [train.py:996] (1/4) Epoch 4, batch 15600, loss[loss=0.3129, simple_loss=0.4408, pruned_loss=0.09255, over 19762.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3329, pruned_loss=0.09722, over 4243927.54 frames. ], batch size: 702, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 05:49:43,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=642684.0, ans=0.0 2023-06-20 05:50:03,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=642744.0, ans=0.125 2023-06-20 05:50:17,873 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.694e+02 3.221e+02 3.969e+02 6.566e+02, threshold=6.442e+02, percent-clipped=0.0 2023-06-20 05:50:21,190 INFO [train.py:996] (1/4) Epoch 4, batch 15650, loss[loss=0.2383, simple_loss=0.3002, pruned_loss=0.08824, over 21768.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3315, pruned_loss=0.0959, over 4244981.48 frames. ], batch size: 112, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:50:25,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=642804.0, ans=0.0 2023-06-20 05:50:34,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=642804.0, ans=0.2 2023-06-20 05:50:36,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=642864.0, ans=0.2 2023-06-20 05:51:40,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=642984.0, ans=0.09899494936611666 2023-06-20 05:52:06,088 INFO [train.py:996] (1/4) Epoch 4, batch 15700, loss[loss=0.2445, simple_loss=0.3277, pruned_loss=0.08067, over 21736.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3273, pruned_loss=0.09512, over 4248072.78 frames. ], batch size: 282, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:52:23,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-20 05:52:26,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=12.0 2023-06-20 05:52:43,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=643224.0, ans=0.0 2023-06-20 05:52:59,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=643224.0, ans=0.0 2023-06-20 05:53:02,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=643224.0, ans=0.125 2023-06-20 05:53:04,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=643224.0, ans=15.0 2023-06-20 05:53:35,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=643344.0, ans=0.05 2023-06-20 05:53:46,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.685e+02 3.204e+02 3.710e+02 6.179e+02, threshold=6.407e+02, percent-clipped=0.0 2023-06-20 05:53:49,709 INFO [train.py:996] (1/4) Epoch 4, batch 15750, loss[loss=0.2507, simple_loss=0.2952, pruned_loss=0.1031, over 20233.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3224, pruned_loss=0.09524, over 4247654.13 frames. ], batch size: 702, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:54:44,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=643524.0, ans=0.1 2023-06-20 05:55:31,420 INFO [train.py:996] (1/4) Epoch 4, batch 15800, loss[loss=0.2897, simple_loss=0.3368, pruned_loss=0.1213, over 21827.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3191, pruned_loss=0.09627, over 4257585.53 frames. ], batch size: 372, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:56:31,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-20 05:56:32,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=643884.0, ans=0.2 2023-06-20 05:57:01,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=643944.0, ans=0.0 2023-06-20 05:57:01,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=643944.0, ans=0.125 2023-06-20 05:57:11,280 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.033e+02 3.511e+02 4.104e+02 6.063e+02, threshold=7.023e+02, percent-clipped=0.0 2023-06-20 05:57:14,441 INFO [train.py:996] (1/4) Epoch 4, batch 15850, loss[loss=0.2371, simple_loss=0.2958, pruned_loss=0.0892, over 21337.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3203, pruned_loss=0.09838, over 4266697.21 frames. ], batch size: 194, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:58:11,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=644184.0, ans=0.125 2023-06-20 05:58:48,829 INFO [train.py:996] (1/4) Epoch 4, batch 15900, loss[loss=0.2852, simple_loss=0.3194, pruned_loss=0.1255, over 21313.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3187, pruned_loss=0.09873, over 4269190.75 frames. ], batch size: 507, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 05:58:50,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=644304.0, ans=0.125 2023-06-20 05:58:52,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=644304.0, ans=0.125 2023-06-20 05:58:55,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=644304.0, ans=0.0 2023-06-20 05:59:58,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=644484.0, ans=0.125 2023-06-20 06:00:18,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=644544.0, ans=0.125 2023-06-20 06:00:28,931 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.691e+02 3.018e+02 3.866e+02 6.282e+02, threshold=6.037e+02, percent-clipped=0.0 2023-06-20 06:00:32,129 INFO [train.py:996] (1/4) Epoch 4, batch 15950, loss[loss=0.2027, simple_loss=0.2751, pruned_loss=0.06513, over 21338.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3182, pruned_loss=0.09577, over 4261193.68 frames. ], batch size: 194, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 06:00:42,917 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-20 06:00:56,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.63 vs. limit=22.5 2023-06-20 06:01:26,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=644784.0, ans=0.1 2023-06-20 06:02:06,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=644844.0, ans=0.0 2023-06-20 06:02:14,372 INFO [train.py:996] (1/4) Epoch 4, batch 16000, loss[loss=0.2368, simple_loss=0.3276, pruned_loss=0.07296, over 21850.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3212, pruned_loss=0.09348, over 4271376.36 frames. ], batch size: 282, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:02:40,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=644964.0, ans=0.125 2023-06-20 06:02:53,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=645024.0, ans=0.1 2023-06-20 06:03:38,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=645144.0, ans=0.125 2023-06-20 06:03:53,919 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 2.704e+02 3.132e+02 3.952e+02 7.192e+02, threshold=6.264e+02, percent-clipped=3.0 2023-06-20 06:03:57,352 INFO [train.py:996] (1/4) Epoch 4, batch 16050, loss[loss=0.2154, simple_loss=0.3047, pruned_loss=0.06307, over 21724.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3241, pruned_loss=0.091, over 4279308.08 frames. ], batch size: 298, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:04:00,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=645204.0, ans=0.0 2023-06-20 06:04:36,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=645324.0, ans=0.1 2023-06-20 06:05:17,797 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-20 06:05:40,812 INFO [train.py:996] (1/4) Epoch 4, batch 16100, loss[loss=0.2865, simple_loss=0.342, pruned_loss=0.1155, over 21760.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3283, pruned_loss=0.09289, over 4287717.43 frames. ], batch size: 441, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:05:49,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=645504.0, ans=0.0 2023-06-20 06:06:09,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=645564.0, ans=0.125 2023-06-20 06:06:38,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=645684.0, ans=0.125 2023-06-20 06:07:20,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 3.023e+02 3.501e+02 4.351e+02 8.172e+02, threshold=7.003e+02, percent-clipped=2.0 2023-06-20 06:07:23,065 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-20 06:07:23,551 INFO [train.py:996] (1/4) Epoch 4, batch 16150, loss[loss=0.2673, simple_loss=0.3258, pruned_loss=0.1044, over 21645.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3301, pruned_loss=0.09589, over 4288934.09 frames. ], batch size: 195, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:07:25,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=645804.0, ans=0.2 2023-06-20 06:07:30,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=645804.0, ans=0.0 2023-06-20 06:07:43,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=645864.0, ans=0.125 2023-06-20 06:09:06,385 INFO [train.py:996] (1/4) Epoch 4, batch 16200, loss[loss=0.3425, simple_loss=0.3973, pruned_loss=0.1438, over 21437.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3352, pruned_loss=0.09863, over 4294344.47 frames. ], batch size: 471, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:09:26,813 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:09:34,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=646164.0, ans=0.0 2023-06-20 06:09:43,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=646224.0, ans=0.125 2023-06-20 06:10:39,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=646344.0, ans=0.125 2023-06-20 06:10:40,626 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.687e+02 3.085e+02 4.076e+02 6.886e+02, threshold=6.170e+02, percent-clipped=1.0 2023-06-20 06:10:44,041 INFO [train.py:996] (1/4) Epoch 4, batch 16250, loss[loss=0.1758, simple_loss=0.2441, pruned_loss=0.05377, over 21003.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3341, pruned_loss=0.09817, over 4284905.03 frames. ], batch size: 143, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:11:18,522 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:12:17,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=646644.0, ans=0.125 2023-06-20 06:12:25,355 INFO [train.py:996] (1/4) Epoch 4, batch 16300, loss[loss=0.2295, simple_loss=0.3132, pruned_loss=0.07283, over 21719.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3263, pruned_loss=0.09275, over 4283635.27 frames. ], batch size: 332, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:12:25,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=646704.0, ans=0.0 2023-06-20 06:12:38,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=646704.0, ans=0.125 2023-06-20 06:12:40,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=646764.0, ans=0.125 2023-06-20 06:13:35,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=646884.0, ans=0.125 2023-06-20 06:13:51,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.95 vs. limit=10.0 2023-06-20 06:14:05,361 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.535e+02 3.018e+02 3.417e+02 5.954e+02, threshold=6.036e+02, percent-clipped=0.0 2023-06-20 06:14:08,711 INFO [train.py:996] (1/4) Epoch 4, batch 16350, loss[loss=0.3336, simple_loss=0.3859, pruned_loss=0.1406, over 21411.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3259, pruned_loss=0.09405, over 4283529.17 frames. ], batch size: 471, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:14:57,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=647124.0, ans=0.125 2023-06-20 06:15:23,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=647184.0, ans=0.07 2023-06-20 06:15:52,734 INFO [train.py:996] (1/4) Epoch 4, batch 16400, loss[loss=0.3449, simple_loss=0.3827, pruned_loss=0.1535, over 21707.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3302, pruned_loss=0.09655, over 4289228.83 frames. ], batch size: 507, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 06:16:38,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=647424.0, ans=0.125 2023-06-20 06:16:38,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=647424.0, ans=0.1 2023-06-20 06:16:57,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=647484.0, ans=0.125 2023-06-20 06:17:31,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 2.888e+02 3.304e+02 3.936e+02 7.106e+02, threshold=6.607e+02, percent-clipped=3.0 2023-06-20 06:17:35,176 INFO [train.py:996] (1/4) Epoch 4, batch 16450, loss[loss=0.2791, simple_loss=0.334, pruned_loss=0.1121, over 21916.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3317, pruned_loss=0.09794, over 4295646.20 frames. ], batch size: 316, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 06:18:04,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=647664.0, ans=0.125 2023-06-20 06:18:41,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=647784.0, ans=0.125 2023-06-20 06:19:19,905 INFO [train.py:996] (1/4) Epoch 4, batch 16500, loss[loss=0.285, simple_loss=0.358, pruned_loss=0.106, over 21687.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3287, pruned_loss=0.09782, over 4290032.28 frames. ], batch size: 414, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:19:20,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=647904.0, ans=0.0 2023-06-20 06:19:32,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=647904.0, ans=0.125 2023-06-20 06:19:37,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=647964.0, ans=0.125 2023-06-20 06:19:48,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=647964.0, ans=0.1 2023-06-20 06:19:59,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=647964.0, ans=0.0 2023-06-20 06:20:14,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=648024.0, ans=8.0 2023-06-20 06:20:28,800 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.665e-03 2023-06-20 06:20:51,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=648144.0, ans=0.125 2023-06-20 06:20:54,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-20 06:21:03,942 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.948e+02 3.472e+02 4.236e+02 9.691e+02, threshold=6.943e+02, percent-clipped=9.0 2023-06-20 06:21:05,575 INFO [train.py:996] (1/4) Epoch 4, batch 16550, loss[loss=0.287, simple_loss=0.3748, pruned_loss=0.09962, over 19825.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3254, pruned_loss=0.09383, over 4274832.98 frames. ], batch size: 702, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:21:19,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=648204.0, ans=0.2 2023-06-20 06:21:31,901 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.67 vs. limit=10.0 2023-06-20 06:21:42,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=648264.0, ans=0.125 2023-06-20 06:22:15,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-20 06:22:59,758 INFO [train.py:996] (1/4) Epoch 4, batch 16600, loss[loss=0.3837, simple_loss=0.4602, pruned_loss=0.1536, over 21691.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3356, pruned_loss=0.0981, over 4273105.26 frames. ], batch size: 441, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:23:34,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-20 06:24:06,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=648684.0, ans=0.0 2023-06-20 06:24:48,212 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.275e+02 4.120e+02 5.174e+02 8.172e+02, threshold=8.240e+02, percent-clipped=2.0 2023-06-20 06:24:49,934 INFO [train.py:996] (1/4) Epoch 4, batch 16650, loss[loss=0.3254, simple_loss=0.378, pruned_loss=0.1364, over 19889.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3482, pruned_loss=0.103, over 4266287.19 frames. ], batch size: 702, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:25:14,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=648864.0, ans=0.125 2023-06-20 06:25:19,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=648864.0, ans=0.1 2023-06-20 06:26:16,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.46 vs. limit=6.0 2023-06-20 06:26:24,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=649044.0, ans=10.0 2023-06-20 06:26:33,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-06-20 06:26:41,397 INFO [train.py:996] (1/4) Epoch 4, batch 16700, loss[loss=0.2532, simple_loss=0.3142, pruned_loss=0.0961, over 21668.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3506, pruned_loss=0.1042, over 4265460.74 frames. ], batch size: 263, lr: 7.91e-03, grad_scale: 16.0 2023-06-20 06:26:41,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=649104.0, ans=0.0 2023-06-20 06:26:57,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=649164.0, ans=0.0 2023-06-20 06:28:23,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=649344.0, ans=0.125 2023-06-20 06:28:25,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=649344.0, ans=0.1 2023-06-20 06:28:26,846 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.053e+02 3.657e+02 4.357e+02 8.504e+02, threshold=7.314e+02, percent-clipped=1.0 2023-06-20 06:28:28,504 INFO [train.py:996] (1/4) Epoch 4, batch 16750, loss[loss=0.379, simple_loss=0.4321, pruned_loss=0.163, over 21358.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.351, pruned_loss=0.1064, over 4263452.51 frames. ], batch size: 507, lr: 7.91e-03, grad_scale: 16.0 2023-06-20 06:29:21,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=649524.0, ans=0.025 2023-06-20 06:29:27,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=649524.0, ans=0.125 2023-06-20 06:29:28,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=649524.0, ans=0.0 2023-06-20 06:29:54,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=649584.0, ans=0.1 2023-06-20 06:29:54,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=649584.0, ans=0.125 2023-06-20 06:30:08,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=649644.0, ans=0.125 2023-06-20 06:30:09,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-20 06:30:11,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=649704.0, ans=0.2 2023-06-20 06:30:12,940 INFO [train.py:996] (1/4) Epoch 4, batch 16800, loss[loss=0.2682, simple_loss=0.3256, pruned_loss=0.1054, over 22046.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3551, pruned_loss=0.1067, over 4250228.33 frames. ], batch size: 119, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:30:21,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=649704.0, ans=0.0 2023-06-20 06:30:46,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=649764.0, ans=0.125 2023-06-20 06:30:51,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=649764.0, ans=0.1 2023-06-20 06:31:24,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=649884.0, ans=0.1 2023-06-20 06:31:31,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=649884.0, ans=0.125 2023-06-20 06:31:58,991 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.418e+02 3.215e+02 3.899e+02 5.075e+02 9.613e+02, threshold=7.798e+02, percent-clipped=9.0 2023-06-20 06:32:00,526 INFO [train.py:996] (1/4) Epoch 4, batch 16850, loss[loss=0.2319, simple_loss=0.2978, pruned_loss=0.08301, over 21767.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3503, pruned_loss=0.1061, over 4263247.47 frames. ], batch size: 247, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:32:50,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=650124.0, ans=0.95 2023-06-20 06:33:02,981 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-20 06:33:37,196 INFO [train.py:996] (1/4) Epoch 4, batch 16900, loss[loss=0.2674, simple_loss=0.3202, pruned_loss=0.1073, over 20107.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3438, pruned_loss=0.1037, over 4270088.05 frames. ], batch size: 703, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:34:11,101 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-20 06:34:46,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-20 06:35:14,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=650544.0, ans=0.2 2023-06-20 06:35:17,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.495e+02 2.882e+02 3.347e+02 4.730e+02, threshold=5.764e+02, percent-clipped=0.0 2023-06-20 06:35:18,581 INFO [train.py:996] (1/4) Epoch 4, batch 16950, loss[loss=0.2377, simple_loss=0.3039, pruned_loss=0.08571, over 21472.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3353, pruned_loss=0.1007, over 4277162.41 frames. ], batch size: 194, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:35:40,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.36 vs. limit=6.0 2023-06-20 06:36:24,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=650724.0, ans=0.125 2023-06-20 06:36:47,637 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:36:48,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-20 06:37:01,047 INFO [train.py:996] (1/4) Epoch 4, batch 17000, loss[loss=0.2594, simple_loss=0.3127, pruned_loss=0.103, over 21703.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.333, pruned_loss=0.1016, over 4289615.54 frames. ], batch size: 230, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:37:45,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=650964.0, ans=0.95 2023-06-20 06:38:20,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=651084.0, ans=0.0 2023-06-20 06:38:57,173 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.930e+02 3.368e+02 4.146e+02 7.848e+02, threshold=6.737e+02, percent-clipped=5.0 2023-06-20 06:38:57,197 INFO [train.py:996] (1/4) Epoch 4, batch 17050, loss[loss=0.2806, simple_loss=0.3526, pruned_loss=0.1043, over 21664.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3397, pruned_loss=0.1042, over 4285509.29 frames. ], batch size: 263, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:39:01,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=651204.0, ans=0.0 2023-06-20 06:39:40,981 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:39:41,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=651324.0, ans=0.2 2023-06-20 06:39:55,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=651384.0, ans=0.125 2023-06-20 06:40:31,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=651504.0, ans=0.0 2023-06-20 06:40:37,625 INFO [train.py:996] (1/4) Epoch 4, batch 17100, loss[loss=0.2682, simple_loss=0.3277, pruned_loss=0.1044, over 21311.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3392, pruned_loss=0.1045, over 4292745.23 frames. ], batch size: 159, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:41:34,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=651684.0, ans=0.125 2023-06-20 06:42:19,602 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.907e+02 3.321e+02 3.692e+02 6.035e+02, threshold=6.643e+02, percent-clipped=0.0 2023-06-20 06:42:19,623 INFO [train.py:996] (1/4) Epoch 4, batch 17150, loss[loss=0.2309, simple_loss=0.3045, pruned_loss=0.07867, over 21694.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3355, pruned_loss=0.1045, over 4299343.22 frames. ], batch size: 230, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:42:23,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.73 vs. limit=10.0 2023-06-20 06:42:46,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=651864.0, ans=0.0 2023-06-20 06:42:47,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-20 06:42:52,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-20 06:43:17,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651984.0, ans=0.1 2023-06-20 06:43:40,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=652044.0, ans=0.04949747468305833 2023-06-20 06:44:07,963 INFO [train.py:996] (1/4) Epoch 4, batch 17200, loss[loss=0.2895, simple_loss=0.3482, pruned_loss=0.1154, over 21603.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3363, pruned_loss=0.1044, over 4294401.15 frames. ], batch size: 230, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:44:08,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=652104.0, ans=0.125 2023-06-20 06:44:46,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=652224.0, ans=0.1 2023-06-20 06:44:51,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=652224.0, ans=0.125 2023-06-20 06:44:55,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=652224.0, ans=0.125 2023-06-20 06:44:59,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-20 06:45:57,617 INFO [train.py:996] (1/4) Epoch 4, batch 17250, loss[loss=0.2962, simple_loss=0.355, pruned_loss=0.1187, over 21371.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3404, pruned_loss=0.1065, over 4286796.27 frames. ], batch size: 143, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:45:59,489 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.975e+02 3.249e+02 4.201e+02 6.802e+02, threshold=6.498e+02, percent-clipped=2.0 2023-06-20 06:46:23,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=652464.0, ans=10.0 2023-06-20 06:46:24,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=652464.0, ans=0.0 2023-06-20 06:46:42,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.44 vs. limit=12.0 2023-06-20 06:47:36,634 INFO [train.py:996] (1/4) Epoch 4, batch 17300, loss[loss=0.3079, simple_loss=0.4033, pruned_loss=0.1062, over 19895.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3486, pruned_loss=0.1093, over 4279911.12 frames. ], batch size: 703, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:47:51,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=652764.0, ans=0.125 2023-06-20 06:47:57,190 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=8.0 2023-06-20 06:48:10,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=652824.0, ans=0.125 2023-06-20 06:49:06,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=652944.0, ans=0.0 2023-06-20 06:49:17,041 INFO [train.py:996] (1/4) Epoch 4, batch 17350, loss[loss=0.2329, simple_loss=0.3157, pruned_loss=0.07509, over 21623.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3496, pruned_loss=0.1092, over 4279838.64 frames. ], batch size: 263, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:49:18,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.182e+02 3.780e+02 4.285e+02 5.975e+02, threshold=7.560e+02, percent-clipped=0.0 2023-06-20 06:49:36,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=653064.0, ans=0.125 2023-06-20 06:49:36,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=653064.0, ans=0.125 2023-06-20 06:50:17,635 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.78 vs. limit=15.0 2023-06-20 06:50:19,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-06-20 06:50:44,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=653244.0, ans=0.125 2023-06-20 06:50:59,146 INFO [train.py:996] (1/4) Epoch 4, batch 17400, loss[loss=0.2103, simple_loss=0.2694, pruned_loss=0.0756, over 21824.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3444, pruned_loss=0.1048, over 4269458.69 frames. ], batch size: 118, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:51:14,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=653364.0, ans=0.95 2023-06-20 06:51:51,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653424.0, ans=0.1 2023-06-20 06:52:42,145 INFO [train.py:996] (1/4) Epoch 4, batch 17450, loss[loss=0.1784, simple_loss=0.2593, pruned_loss=0.04876, over 21289.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3402, pruned_loss=0.1018, over 4264659.18 frames. ], batch size: 176, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:52:43,975 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.974e+02 3.600e+02 4.231e+02 7.262e+02, threshold=7.200e+02, percent-clipped=0.0 2023-06-20 06:53:02,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653604.0, ans=0.1 2023-06-20 06:53:47,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=653724.0, ans=0.1 2023-06-20 06:53:50,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=653784.0, ans=0.125 2023-06-20 06:54:04,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=653844.0, ans=0.125 2023-06-20 06:54:06,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=653844.0, ans=0.125 2023-06-20 06:54:08,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=653844.0, ans=0.125 2023-06-20 06:54:13,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=653844.0, ans=0.125 2023-06-20 06:54:28,261 INFO [train.py:996] (1/4) Epoch 4, batch 17500, loss[loss=0.292, simple_loss=0.3459, pruned_loss=0.119, over 21452.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3346, pruned_loss=0.09934, over 4264137.89 frames. ], batch size: 131, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:55:37,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=654084.0, ans=0.2 2023-06-20 06:55:58,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=654144.0, ans=0.125 2023-06-20 06:56:02,395 INFO [train.py:996] (1/4) Epoch 4, batch 17550, loss[loss=0.2206, simple_loss=0.3097, pruned_loss=0.06577, over 21854.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3351, pruned_loss=0.09797, over 4270900.01 frames. ], batch size: 107, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 06:56:04,001 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 2.427e+02 2.904e+02 3.611e+02 6.733e+02, threshold=5.808e+02, percent-clipped=0.0 2023-06-20 06:56:26,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=654204.0, ans=0.125 2023-06-20 06:56:27,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=654264.0, ans=0.125 2023-06-20 06:57:15,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=654384.0, ans=0.2 2023-06-20 06:57:43,283 INFO [train.py:996] (1/4) Epoch 4, batch 17600, loss[loss=0.3115, simple_loss=0.3704, pruned_loss=0.1263, over 21560.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3365, pruned_loss=0.09758, over 4268212.12 frames. ], batch size: 389, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 06:57:43,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=654504.0, ans=10.0 2023-06-20 06:58:03,665 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:58:33,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=654564.0, ans=0.0 2023-06-20 06:59:32,081 INFO [train.py:996] (1/4) Epoch 4, batch 17650, loss[loss=0.1845, simple_loss=0.2498, pruned_loss=0.05957, over 21696.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3351, pruned_loss=0.09779, over 4259874.55 frames. ], batch size: 247, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 06:59:40,725 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.037e+02 3.780e+02 4.490e+02 8.251e+02, threshold=7.559e+02, percent-clipped=12.0 2023-06-20 06:59:56,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=654804.0, ans=0.0 2023-06-20 07:00:50,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-20 07:01:03,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-20 07:01:20,960 INFO [train.py:996] (1/4) Epoch 4, batch 17700, loss[loss=0.2361, simple_loss=0.3345, pruned_loss=0.06884, over 20745.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3265, pruned_loss=0.0932, over 4256677.15 frames. ], batch size: 607, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 07:01:57,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=655164.0, ans=0.125 2023-06-20 07:02:06,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=655224.0, ans=0.0 2023-06-20 07:02:19,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=655284.0, ans=0.125 2023-06-20 07:02:28,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=655284.0, ans=0.125 2023-06-20 07:03:06,197 INFO [train.py:996] (1/4) Epoch 4, batch 17750, loss[loss=0.3117, simple_loss=0.3787, pruned_loss=0.1224, over 21457.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3358, pruned_loss=0.0979, over 4259687.43 frames. ], batch size: 211, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 07:03:09,445 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.874e+02 3.454e+02 4.122e+02 5.655e+02, threshold=6.909e+02, percent-clipped=0.0 2023-06-20 07:04:50,168 INFO [train.py:996] (1/4) Epoch 4, batch 17800, loss[loss=0.2324, simple_loss=0.3087, pruned_loss=0.07806, over 21626.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3372, pruned_loss=0.09817, over 4266209.42 frames. ], batch size: 230, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:05:14,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=655764.0, ans=0.1 2023-06-20 07:05:26,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-20 07:05:33,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-20 07:05:35,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=655824.0, ans=0.0 2023-06-20 07:05:44,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=655824.0, ans=0.2 2023-06-20 07:06:02,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2023-06-20 07:06:05,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=655884.0, ans=0.09899494936611666 2023-06-20 07:06:17,857 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.10 vs. limit=6.0 2023-06-20 07:06:33,108 INFO [train.py:996] (1/4) Epoch 4, batch 17850, loss[loss=0.295, simple_loss=0.3569, pruned_loss=0.1166, over 21621.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3407, pruned_loss=0.1001, over 4261038.60 frames. ], batch size: 263, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:06:36,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.792e+02 3.277e+02 4.116e+02 6.981e+02, threshold=6.554e+02, percent-clipped=1.0 2023-06-20 07:06:55,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=656064.0, ans=0.125 2023-06-20 07:07:11,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=656124.0, ans=0.025 2023-06-20 07:07:11,948 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-20 07:08:02,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-20 07:08:17,923 INFO [train.py:996] (1/4) Epoch 4, batch 17900, loss[loss=0.3336, simple_loss=0.4211, pruned_loss=0.123, over 21844.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.345, pruned_loss=0.1013, over 4263627.33 frames. ], batch size: 371, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:08:22,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=656304.0, ans=0.125 2023-06-20 07:09:16,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-20 07:09:23,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=656424.0, ans=0.125 2023-06-20 07:09:38,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=656484.0, ans=0.125 2023-06-20 07:09:53,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=656544.0, ans=0.0 2023-06-20 07:10:01,317 INFO [train.py:996] (1/4) Epoch 4, batch 17950, loss[loss=0.182, simple_loss=0.2574, pruned_loss=0.05331, over 21359.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3425, pruned_loss=0.09678, over 4265271.01 frames. ], batch size: 131, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:10:04,341 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.785e+02 3.219e+02 3.835e+02 8.514e+02, threshold=6.438e+02, percent-clipped=3.0 2023-06-20 07:10:42,230 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.35 vs. limit=15.0 2023-06-20 07:11:23,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=656784.0, ans=10.0 2023-06-20 07:11:44,086 INFO [train.py:996] (1/4) Epoch 4, batch 18000, loss[loss=0.2454, simple_loss=0.3012, pruned_loss=0.09481, over 21421.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3358, pruned_loss=0.09548, over 4256123.52 frames. ], batch size: 389, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 07:11:44,086 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 07:11:59,889 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.3588, 3.6309, 2.0049, 1.6623], device='cuda:1') 2023-06-20 07:12:05,671 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2767, simple_loss=0.3741, pruned_loss=0.08966, over 1796401.00 frames. 2023-06-20 07:12:05,671 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 07:12:50,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-20 07:13:09,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657084.0, ans=0.1 2023-06-20 07:13:11,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=657084.0, ans=0.2 2023-06-20 07:13:14,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=657084.0, ans=0.125 2023-06-20 07:13:24,405 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:13:33,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=657144.0, ans=0.125 2023-06-20 07:13:55,468 INFO [train.py:996] (1/4) Epoch 4, batch 18050, loss[loss=0.23, simple_loss=0.2923, pruned_loss=0.08389, over 21658.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3292, pruned_loss=0.09409, over 4265007.12 frames. ], batch size: 247, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 07:14:03,898 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 3.056e+02 3.833e+02 4.791e+02 8.139e+02, threshold=7.666e+02, percent-clipped=8.0 2023-06-20 07:14:04,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=657204.0, ans=0.125 2023-06-20 07:14:44,386 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:14:52,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=657324.0, ans=0.1 2023-06-20 07:15:26,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=657444.0, ans=0.125 2023-06-20 07:15:36,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=12.0 2023-06-20 07:15:41,208 INFO [train.py:996] (1/4) Epoch 4, batch 18100, loss[loss=0.2853, simple_loss=0.3421, pruned_loss=0.1142, over 21900.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3349, pruned_loss=0.09787, over 4255098.45 frames. ], batch size: 98, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:16:23,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=657624.0, ans=0.0 2023-06-20 07:16:33,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=657624.0, ans=0.0 2023-06-20 07:17:26,417 INFO [train.py:996] (1/4) Epoch 4, batch 18150, loss[loss=0.2493, simple_loss=0.3034, pruned_loss=0.09758, over 21217.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3347, pruned_loss=0.09708, over 4247014.10 frames. ], batch size: 176, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:17:31,377 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.784e+02 3.397e+02 4.397e+02 8.554e+02, threshold=6.794e+02, percent-clipped=1.0 2023-06-20 07:17:40,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-20 07:17:41,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=657864.0, ans=0.0 2023-06-20 07:17:49,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-20 07:17:54,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=657864.0, ans=0.125 2023-06-20 07:18:47,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=658044.0, ans=0.0 2023-06-20 07:19:02,640 INFO [train.py:996] (1/4) Epoch 4, batch 18200, loss[loss=0.2411, simple_loss=0.3083, pruned_loss=0.08692, over 21612.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3302, pruned_loss=0.09807, over 4241820.40 frames. ], batch size: 415, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:19:12,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=658104.0, ans=0.125 2023-06-20 07:19:50,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=658224.0, ans=0.125 2023-06-20 07:19:55,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=658284.0, ans=0.0 2023-06-20 07:20:21,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=658344.0, ans=0.05 2023-06-20 07:20:31,243 INFO [train.py:996] (1/4) Epoch 4, batch 18250, loss[loss=0.1931, simple_loss=0.2598, pruned_loss=0.06323, over 21375.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3201, pruned_loss=0.09369, over 4234156.63 frames. ], batch size: 160, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:20:41,501 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.684e+02 3.179e+02 3.917e+02 6.277e+02, threshold=6.359e+02, percent-clipped=0.0 2023-06-20 07:21:50,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=658644.0, ans=0.0 2023-06-20 07:21:53,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=658644.0, ans=0.0 2023-06-20 07:21:54,472 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-20 07:22:02,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=658644.0, ans=10.0 2023-06-20 07:22:03,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=658644.0, ans=0.5 2023-06-20 07:22:08,069 INFO [train.py:996] (1/4) Epoch 4, batch 18300, loss[loss=0.2607, simple_loss=0.3535, pruned_loss=0.08391, over 21381.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3217, pruned_loss=0.0942, over 4242870.17 frames. ], batch size: 194, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:23:49,843 INFO [train.py:996] (1/4) Epoch 4, batch 18350, loss[loss=0.3384, simple_loss=0.4183, pruned_loss=0.1292, over 21651.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3286, pruned_loss=0.0946, over 4241067.89 frames. ], batch size: 441, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 07:24:00,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.835e+02 3.516e+02 4.714e+02 7.993e+02, threshold=7.032e+02, percent-clipped=4.0 2023-06-20 07:24:16,215 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:24:17,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=659004.0, ans=0.125 2023-06-20 07:24:57,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.73 vs. limit=10.0 2023-06-20 07:25:25,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=22.5 2023-06-20 07:25:38,221 INFO [train.py:996] (1/4) Epoch 4, batch 18400, loss[loss=0.212, simple_loss=0.2761, pruned_loss=0.07395, over 21175.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3232, pruned_loss=0.09298, over 4245635.81 frames. ], batch size: 143, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:25:50,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=659304.0, ans=0.2 2023-06-20 07:26:00,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=659304.0, ans=0.5 2023-06-20 07:26:16,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=659364.0, ans=10.0 2023-06-20 07:26:40,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-20 07:27:27,665 INFO [train.py:996] (1/4) Epoch 4, batch 18450, loss[loss=0.2008, simple_loss=0.2921, pruned_loss=0.05479, over 21774.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3196, pruned_loss=0.08807, over 4256104.65 frames. ], batch size: 352, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:27:28,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=659604.0, ans=0.0 2023-06-20 07:27:37,437 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.607e+02 3.111e+02 4.080e+02 7.142e+02, threshold=6.222e+02, percent-clipped=1.0 2023-06-20 07:28:10,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=659724.0, ans=0.125 2023-06-20 07:28:10,274 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:28:22,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=659724.0, ans=0.125 2023-06-20 07:28:28,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-20 07:28:50,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=659844.0, ans=0.125 2023-06-20 07:29:09,377 INFO [train.py:996] (1/4) Epoch 4, batch 18500, loss[loss=0.2449, simple_loss=0.3217, pruned_loss=0.08409, over 21752.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.315, pruned_loss=0.08724, over 4257367.67 frames. ], batch size: 351, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:29:09,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=659904.0, ans=0.0 2023-06-20 07:29:26,682 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-06-20 07:30:07,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-20 07:30:27,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=660144.0, ans=0.125 2023-06-20 07:30:51,288 INFO [train.py:996] (1/4) Epoch 4, batch 18550, loss[loss=0.2585, simple_loss=0.3516, pruned_loss=0.08264, over 19979.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3138, pruned_loss=0.08669, over 4243835.37 frames. ], batch size: 702, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:30:58,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=660204.0, ans=0.125 2023-06-20 07:31:01,298 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.666e+02 3.227e+02 3.857e+02 6.093e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-20 07:31:31,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=660264.0, ans=0.125 2023-06-20 07:31:36,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=660324.0, ans=0.125 2023-06-20 07:31:52,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=660384.0, ans=0.1 2023-06-20 07:32:34,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=660444.0, ans=0.2 2023-06-20 07:32:39,728 INFO [train.py:996] (1/4) Epoch 4, batch 18600, loss[loss=0.2614, simple_loss=0.3242, pruned_loss=0.09928, over 21493.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3122, pruned_loss=0.08755, over 4246455.80 frames. ], batch size: 212, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:32:57,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=660504.0, ans=0.125 2023-06-20 07:33:01,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=660564.0, ans=0.125 2023-06-20 07:33:17,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=660624.0, ans=0.0 2023-06-20 07:33:23,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=660624.0, ans=0.1 2023-06-20 07:33:28,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=660624.0, ans=0.125 2023-06-20 07:33:38,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=660684.0, ans=0.04949747468305833 2023-06-20 07:34:02,382 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-20 07:34:10,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-20 07:34:16,723 INFO [train.py:996] (1/4) Epoch 4, batch 18650, loss[loss=0.2685, simple_loss=0.3176, pruned_loss=0.1097, over 21875.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3109, pruned_loss=0.08738, over 4234262.46 frames. ], batch size: 98, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:34:26,810 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.903e+02 3.307e+02 4.145e+02 6.218e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-20 07:34:28,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=660804.0, ans=0.2 2023-06-20 07:34:41,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=660864.0, ans=0.125 2023-06-20 07:34:44,943 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:35:47,871 INFO [train.py:996] (1/4) Epoch 4, batch 18700, loss[loss=0.2966, simple_loss=0.3467, pruned_loss=0.1233, over 21840.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3091, pruned_loss=0.08847, over 4248948.59 frames. ], batch size: 414, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:36:18,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=661164.0, ans=0.125 2023-06-20 07:37:19,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-20 07:37:30,151 INFO [train.py:996] (1/4) Epoch 4, batch 18750, loss[loss=0.2476, simple_loss=0.2912, pruned_loss=0.102, over 20737.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3123, pruned_loss=0.09225, over 4254557.00 frames. ], batch size: 608, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:37:45,237 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.659e+02 3.125e+02 3.916e+02 7.035e+02, threshold=6.249e+02, percent-clipped=1.0 2023-06-20 07:37:55,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=661464.0, ans=0.0 2023-06-20 07:38:00,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=661464.0, ans=0.125 2023-06-20 07:38:22,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=661524.0, ans=0.125 2023-06-20 07:38:35,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=661584.0, ans=0.1 2023-06-20 07:39:06,556 INFO [train.py:996] (1/4) Epoch 4, batch 18800, loss[loss=0.1888, simple_loss=0.2668, pruned_loss=0.0554, over 21385.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3173, pruned_loss=0.09314, over 4249822.25 frames. ], batch size: 131, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:39:57,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=661824.0, ans=0.125 2023-06-20 07:40:31,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=661944.0, ans=0.1 2023-06-20 07:40:55,488 INFO [train.py:996] (1/4) Epoch 4, batch 18850, loss[loss=0.2032, simple_loss=0.2764, pruned_loss=0.065, over 21626.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3153, pruned_loss=0.08884, over 4246004.08 frames. ], batch size: 247, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:41:00,434 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.538e+02 3.009e+02 3.652e+02 6.341e+02, threshold=6.019e+02, percent-clipped=1.0 2023-06-20 07:41:52,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-20 07:42:32,244 INFO [train.py:996] (1/4) Epoch 4, batch 18900, loss[loss=0.2629, simple_loss=0.3143, pruned_loss=0.1057, over 21661.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3121, pruned_loss=0.08918, over 4245381.43 frames. ], batch size: 298, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:43:51,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=662544.0, ans=0.0 2023-06-20 07:44:03,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=662544.0, ans=0.125 2023-06-20 07:44:10,277 INFO [train.py:996] (1/4) Epoch 4, batch 18950, loss[loss=0.272, simple_loss=0.3712, pruned_loss=0.08638, over 21606.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3129, pruned_loss=0.09153, over 4257839.13 frames. ], batch size: 389, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 07:44:25,538 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.815e+02 3.147e+02 3.726e+02 6.285e+02, threshold=6.294e+02, percent-clipped=0.0 2023-06-20 07:44:50,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=662664.0, ans=0.035 2023-06-20 07:44:50,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=662664.0, ans=0.1 2023-06-20 07:44:54,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=662724.0, ans=0.2 2023-06-20 07:45:06,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=662724.0, ans=0.1 2023-06-20 07:45:09,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=662784.0, ans=0.035 2023-06-20 07:45:18,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=662784.0, ans=22.5 2023-06-20 07:45:18,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.83 vs. limit=22.5 2023-06-20 07:45:33,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=662844.0, ans=0.0 2023-06-20 07:46:03,986 INFO [train.py:996] (1/4) Epoch 4, batch 19000, loss[loss=0.2931, simple_loss=0.3517, pruned_loss=0.1172, over 21278.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3234, pruned_loss=0.09411, over 4259999.16 frames. ], batch size: 143, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:46:09,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=662904.0, ans=0.0 2023-06-20 07:46:40,678 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.00 vs. limit=15.0 2023-06-20 07:46:44,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-20 07:47:02,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-20 07:47:31,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=663144.0, ans=0.125 2023-06-20 07:47:38,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=663144.0, ans=0.125 2023-06-20 07:47:46,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=663204.0, ans=0.035 2023-06-20 07:47:47,392 INFO [train.py:996] (1/4) Epoch 4, batch 19050, loss[loss=0.2557, simple_loss=0.3222, pruned_loss=0.09457, over 21829.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3287, pruned_loss=0.09891, over 4270333.70 frames. ], batch size: 332, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:47:53,685 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.224e+02 3.773e+02 4.391e+02 1.056e+03, threshold=7.547e+02, percent-clipped=6.0 2023-06-20 07:48:03,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=663264.0, ans=0.05 2023-06-20 07:48:47,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=663384.0, ans=0.125 2023-06-20 07:49:09,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=663444.0, ans=0.125 2023-06-20 07:49:28,667 INFO [train.py:996] (1/4) Epoch 4, batch 19100, loss[loss=0.2577, simple_loss=0.3136, pruned_loss=0.1009, over 21691.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3269, pruned_loss=0.09963, over 4267147.95 frames. ], batch size: 332, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:49:31,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-20 07:49:46,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=663564.0, ans=0.0 2023-06-20 07:49:52,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-20 07:50:00,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=663564.0, ans=0.2 2023-06-20 07:51:06,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=663744.0, ans=0.1 2023-06-20 07:51:14,832 INFO [train.py:996] (1/4) Epoch 4, batch 19150, loss[loss=0.2482, simple_loss=0.324, pruned_loss=0.08623, over 21258.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3302, pruned_loss=0.1, over 4266208.91 frames. ], batch size: 159, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:51:21,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.031e+02 3.382e+02 4.089e+02 6.377e+02, threshold=6.763e+02, percent-clipped=0.0 2023-06-20 07:51:23,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=663804.0, ans=0.1 2023-06-20 07:51:35,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=663864.0, ans=0.05 2023-06-20 07:51:48,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=663924.0, ans=0.0 2023-06-20 07:52:39,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-06-20 07:52:57,816 INFO [train.py:996] (1/4) Epoch 4, batch 19200, loss[loss=0.2552, simple_loss=0.3606, pruned_loss=0.07493, over 21596.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3387, pruned_loss=0.09998, over 4271791.46 frames. ], batch size: 230, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 07:53:08,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=664104.0, ans=0.125 2023-06-20 07:53:25,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-20 07:53:28,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=664164.0, ans=0.1 2023-06-20 07:53:30,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=664224.0, ans=0.0 2023-06-20 07:53:31,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=664224.0, ans=0.125 2023-06-20 07:54:21,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-20 07:54:31,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=664344.0, ans=0.04949747468305833 2023-06-20 07:54:37,022 INFO [train.py:996] (1/4) Epoch 4, batch 19250, loss[loss=0.1762, simple_loss=0.277, pruned_loss=0.0377, over 21656.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3379, pruned_loss=0.09389, over 4265265.87 frames. ], batch size: 263, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:54:44,730 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.666e+02 2.530e+02 3.089e+02 4.105e+02 6.871e+02, threshold=6.178e+02, percent-clipped=1.0 2023-06-20 07:55:33,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=664524.0, ans=0.2 2023-06-20 07:55:53,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=664584.0, ans=0.0 2023-06-20 07:55:54,338 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.60 vs. limit=6.0 2023-06-20 07:56:00,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-20 07:56:19,716 INFO [train.py:996] (1/4) Epoch 4, batch 19300, loss[loss=0.2382, simple_loss=0.2971, pruned_loss=0.08964, over 21868.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3346, pruned_loss=0.0924, over 4266317.80 frames. ], batch size: 124, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:56:48,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=664764.0, ans=10.0 2023-06-20 07:56:50,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=664764.0, ans=0.125 2023-06-20 07:58:04,192 INFO [train.py:996] (1/4) Epoch 4, batch 19350, loss[loss=0.202, simple_loss=0.2927, pruned_loss=0.05569, over 21709.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.327, pruned_loss=0.08818, over 4264130.62 frames. ], batch size: 298, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:58:12,482 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.563e+02 3.117e+02 3.921e+02 9.439e+02, threshold=6.235e+02, percent-clipped=6.0 2023-06-20 07:59:46,534 INFO [train.py:996] (1/4) Epoch 4, batch 19400, loss[loss=0.2231, simple_loss=0.2884, pruned_loss=0.07888, over 21694.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3236, pruned_loss=0.08696, over 4276075.72 frames. ], batch size: 230, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:59:58,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=665304.0, ans=0.0 2023-06-20 08:00:06,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=665364.0, ans=0.2 2023-06-20 08:00:39,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=665424.0, ans=0.1 2023-06-20 08:01:07,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=665544.0, ans=0.2 2023-06-20 08:01:09,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=665544.0, ans=0.1 2023-06-20 08:01:23,613 INFO [train.py:996] (1/4) Epoch 4, batch 19450, loss[loss=0.2347, simple_loss=0.3131, pruned_loss=0.0781, over 20041.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3222, pruned_loss=0.08947, over 4280948.34 frames. ], batch size: 702, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 08:01:31,158 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.618e+02 3.250e+02 3.891e+02 5.569e+02, threshold=6.499e+02, percent-clipped=0.0 2023-06-20 08:01:35,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-20 08:01:44,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=665664.0, ans=0.2 2023-06-20 08:02:46,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=665784.0, ans=0.125 2023-06-20 08:03:06,364 INFO [train.py:996] (1/4) Epoch 4, batch 19500, loss[loss=0.2535, simple_loss=0.3222, pruned_loss=0.09244, over 21516.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3191, pruned_loss=0.09108, over 4270648.99 frames. ], batch size: 389, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 08:03:22,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-20 08:04:33,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=666144.0, ans=0.0 2023-06-20 08:04:45,106 INFO [train.py:996] (1/4) Epoch 4, batch 19550, loss[loss=0.2032, simple_loss=0.2937, pruned_loss=0.05636, over 21381.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3147, pruned_loss=0.08921, over 4271973.20 frames. ], batch size: 194, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 08:04:53,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 2.884e+02 3.306e+02 4.120e+02 1.024e+03, threshold=6.612e+02, percent-clipped=7.0 2023-06-20 08:04:54,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=666204.0, ans=0.95 2023-06-20 08:04:55,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=666204.0, ans=0.0 2023-06-20 08:05:31,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=666324.0, ans=0.125 2023-06-20 08:06:29,090 INFO [train.py:996] (1/4) Epoch 4, batch 19600, loss[loss=0.256, simple_loss=0.3122, pruned_loss=0.09988, over 21834.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3161, pruned_loss=0.08989, over 4273700.27 frames. ], batch size: 298, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:07:23,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=666624.0, ans=0.2 2023-06-20 08:07:30,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=666624.0, ans=10.0 2023-06-20 08:08:08,655 INFO [train.py:996] (1/4) Epoch 4, batch 19650, loss[loss=0.3129, simple_loss=0.3707, pruned_loss=0.1275, over 21708.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3219, pruned_loss=0.09492, over 4279519.44 frames. ], batch size: 351, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:08:16,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=666804.0, ans=0.125 2023-06-20 08:08:17,365 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.087e+02 3.531e+02 4.155e+02 7.951e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-20 08:09:04,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=666924.0, ans=0.025 2023-06-20 08:09:05,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=666924.0, ans=0.0 2023-06-20 08:09:14,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=666924.0, ans=0.125 2023-06-20 08:09:32,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-20 08:09:37,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=667044.0, ans=0.1 2023-06-20 08:09:38,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=667044.0, ans=0.1 2023-06-20 08:10:00,463 INFO [train.py:996] (1/4) Epoch 4, batch 19700, loss[loss=0.2235, simple_loss=0.3045, pruned_loss=0.07124, over 21618.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3252, pruned_loss=0.09544, over 4283861.56 frames. ], batch size: 263, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:10:53,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=667224.0, ans=0.125 2023-06-20 08:11:00,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=667284.0, ans=0.125 2023-06-20 08:11:02,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-20 08:11:10,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=667284.0, ans=0.0 2023-06-20 08:11:30,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=667344.0, ans=0.125 2023-06-20 08:11:42,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=667344.0, ans=0.04949747468305833 2023-06-20 08:11:45,273 INFO [train.py:996] (1/4) Epoch 4, batch 19750, loss[loss=0.2411, simple_loss=0.3037, pruned_loss=0.0893, over 21848.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.336, pruned_loss=0.09714, over 4284483.80 frames. ], batch size: 118, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:11:56,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=667404.0, ans=0.125 2023-06-20 08:11:58,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-20 08:12:04,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.193e+02 3.734e+02 5.091e+02 8.572e+02, threshold=7.467e+02, percent-clipped=4.0 2023-06-20 08:12:29,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-20 08:12:46,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=667584.0, ans=0.2 2023-06-20 08:13:33,726 INFO [train.py:996] (1/4) Epoch 4, batch 19800, loss[loss=0.2445, simple_loss=0.3084, pruned_loss=0.09029, over 21776.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3367, pruned_loss=0.09853, over 4292127.70 frames. ], batch size: 282, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:13:34,078 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:14:00,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.88 vs. limit=15.0 2023-06-20 08:14:26,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=667824.0, ans=0.0 2023-06-20 08:14:33,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=667884.0, ans=0.125 2023-06-20 08:14:35,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=667884.0, ans=0.2 2023-06-20 08:15:22,727 INFO [train.py:996] (1/4) Epoch 4, batch 19850, loss[loss=0.1969, simple_loss=0.2755, pruned_loss=0.05912, over 21182.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3298, pruned_loss=0.09298, over 4292806.12 frames. ], batch size: 176, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:15:23,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=668004.0, ans=0.0 2023-06-20 08:15:30,727 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.674e+02 3.222e+02 4.278e+02 7.795e+02, threshold=6.444e+02, percent-clipped=1.0 2023-06-20 08:15:46,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=668064.0, ans=0.125 2023-06-20 08:16:02,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=668124.0, ans=0.0 2023-06-20 08:16:04,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-20 08:16:55,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=668244.0, ans=0.0 2023-06-20 08:16:59,861 INFO [train.py:996] (1/4) Epoch 4, batch 19900, loss[loss=0.2396, simple_loss=0.3084, pruned_loss=0.08543, over 21545.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3291, pruned_loss=0.09013, over 4294882.32 frames. ], batch size: 195, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:17:22,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=668364.0, ans=0.2 2023-06-20 08:17:30,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=668364.0, ans=0.125 2023-06-20 08:17:30,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-20 08:17:38,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=668424.0, ans=0.1 2023-06-20 08:18:47,715 INFO [train.py:996] (1/4) Epoch 4, batch 19950, loss[loss=0.2404, simple_loss=0.3042, pruned_loss=0.08827, over 21689.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3248, pruned_loss=0.0905, over 4279005.03 frames. ], batch size: 333, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:18:56,256 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.748e+02 3.104e+02 3.905e+02 6.692e+02, threshold=6.208e+02, percent-clipped=1.0 2023-06-20 08:19:39,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=668724.0, ans=0.0 2023-06-20 08:19:53,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=668784.0, ans=0.125 2023-06-20 08:20:26,094 INFO [train.py:996] (1/4) Epoch 4, batch 20000, loss[loss=0.2697, simple_loss=0.3293, pruned_loss=0.1051, over 21256.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.325, pruned_loss=0.0903, over 4275149.74 frames. ], batch size: 143, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:21:43,092 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.07 vs. limit=6.0 2023-06-20 08:21:52,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=669144.0, ans=0.0 2023-06-20 08:21:55,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=22.5 2023-06-20 08:22:13,381 INFO [train.py:996] (1/4) Epoch 4, batch 20050, loss[loss=0.2547, simple_loss=0.319, pruned_loss=0.09522, over 21284.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3265, pruned_loss=0.09339, over 4277582.66 frames. ], batch size: 176, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 08:22:21,024 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 2.833e+02 3.313e+02 4.048e+02 6.603e+02, threshold=6.626e+02, percent-clipped=1.0 2023-06-20 08:23:40,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-20 08:23:57,176 INFO [train.py:996] (1/4) Epoch 4, batch 20100, loss[loss=0.2649, simple_loss=0.3685, pruned_loss=0.08064, over 20980.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3293, pruned_loss=0.0964, over 4282438.43 frames. ], batch size: 607, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 08:24:36,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=669624.0, ans=0.0 2023-06-20 08:25:42,875 INFO [train.py:996] (1/4) Epoch 4, batch 20150, loss[loss=0.3388, simple_loss=0.3911, pruned_loss=0.1433, over 21812.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.34, pruned_loss=0.1006, over 4278871.25 frames. ], batch size: 441, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:25:47,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=669804.0, ans=0.035 2023-06-20 08:25:53,564 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.395e+02 3.883e+02 5.021e+02 8.143e+02, threshold=7.766e+02, percent-clipped=4.0 2023-06-20 08:27:13,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=670044.0, ans=0.125 2023-06-20 08:27:23,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=670044.0, ans=0.125 2023-06-20 08:27:28,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=670104.0, ans=0.0 2023-06-20 08:27:29,935 INFO [train.py:996] (1/4) Epoch 4, batch 20200, loss[loss=0.2806, simple_loss=0.3539, pruned_loss=0.1036, over 21649.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3456, pruned_loss=0.1041, over 4280440.90 frames. ], batch size: 263, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:29:18,097 INFO [train.py:996] (1/4) Epoch 4, batch 20250, loss[loss=0.2641, simple_loss=0.3475, pruned_loss=0.09036, over 21003.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3457, pruned_loss=0.1015, over 4286667.03 frames. ], batch size: 607, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:29:33,195 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 3.002e+02 3.510e+02 4.411e+02 6.052e+02, threshold=7.021e+02, percent-clipped=0.0 2023-06-20 08:29:53,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.07 vs. limit=15.0 2023-06-20 08:30:12,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=670524.0, ans=0.07 2023-06-20 08:30:48,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=670644.0, ans=0.035 2023-06-20 08:30:48,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=670644.0, ans=0.125 2023-06-20 08:30:56,375 INFO [train.py:996] (1/4) Epoch 4, batch 20300, loss[loss=0.2098, simple_loss=0.2861, pruned_loss=0.06679, over 21876.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3419, pruned_loss=0.09754, over 4279522.08 frames. ], batch size: 98, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:31:18,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=670764.0, ans=0.2 2023-06-20 08:31:20,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=670764.0, ans=0.0 2023-06-20 08:31:33,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-20 08:31:53,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=670824.0, ans=0.1 2023-06-20 08:32:08,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=670884.0, ans=0.0 2023-06-20 08:32:14,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=670944.0, ans=0.125 2023-06-20 08:32:35,103 INFO [train.py:996] (1/4) Epoch 4, batch 20350, loss[loss=0.3226, simple_loss=0.379, pruned_loss=0.1331, over 21756.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3408, pruned_loss=0.09687, over 4258118.00 frames. ], batch size: 441, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:32:49,939 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.818e+02 3.132e+02 3.924e+02 7.054e+02, threshold=6.264e+02, percent-clipped=1.0 2023-06-20 08:33:19,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=671124.0, ans=0.0 2023-06-20 08:34:21,808 INFO [train.py:996] (1/4) Epoch 4, batch 20400, loss[loss=0.3164, simple_loss=0.3774, pruned_loss=0.1277, over 21821.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3442, pruned_loss=0.1006, over 4246499.79 frames. ], batch size: 112, lr: 7.78e-03, grad_scale: 32.0 2023-06-20 08:35:15,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671424.0, ans=0.1 2023-06-20 08:35:44,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-20 08:36:04,133 INFO [train.py:996] (1/4) Epoch 4, batch 20450, loss[loss=0.2916, simple_loss=0.344, pruned_loss=0.1196, over 21965.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.346, pruned_loss=0.1035, over 4249947.94 frames. ], batch size: 113, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:36:14,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=671604.0, ans=0.0 2023-06-20 08:36:20,594 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.947e+02 3.456e+02 4.255e+02 7.158e+02, threshold=6.912e+02, percent-clipped=2.0 2023-06-20 08:36:25,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=671664.0, ans=0.0 2023-06-20 08:36:32,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=671664.0, ans=0.125 2023-06-20 08:36:40,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=671724.0, ans=0.0 2023-06-20 08:36:40,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-20 08:37:16,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=671784.0, ans=0.125 2023-06-20 08:37:40,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671844.0, ans=0.1 2023-06-20 08:37:44,292 INFO [train.py:996] (1/4) Epoch 4, batch 20500, loss[loss=0.2503, simple_loss=0.2979, pruned_loss=0.1014, over 20991.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3403, pruned_loss=0.1036, over 4242098.39 frames. ], batch size: 608, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:38:11,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=671964.0, ans=0.0 2023-06-20 08:38:37,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=672024.0, ans=0.125 2023-06-20 08:38:41,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-20 08:38:42,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=672024.0, ans=0.5 2023-06-20 08:38:46,718 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-20 08:39:00,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=672144.0, ans=0.04949747468305833 2023-06-20 08:39:17,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=672144.0, ans=0.125 2023-06-20 08:39:28,487 INFO [train.py:996] (1/4) Epoch 4, batch 20550, loss[loss=0.2142, simple_loss=0.2845, pruned_loss=0.07201, over 17446.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3333, pruned_loss=0.1025, over 4250878.55 frames. ], batch size: 64, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:39:44,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=672204.0, ans=0.0 2023-06-20 08:39:45,982 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.765e+02 3.154e+02 3.666e+02 5.388e+02, threshold=6.309e+02, percent-clipped=0.0 2023-06-20 08:39:51,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=672264.0, ans=0.0 2023-06-20 08:40:18,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=672324.0, ans=10.0 2023-06-20 08:40:46,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=672444.0, ans=0.04949747468305833 2023-06-20 08:41:12,471 INFO [train.py:996] (1/4) Epoch 4, batch 20600, loss[loss=0.2769, simple_loss=0.3434, pruned_loss=0.1052, over 21654.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3362, pruned_loss=0.1003, over 4249753.81 frames. ], batch size: 263, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:42:11,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=672684.0, ans=0.125 2023-06-20 08:42:49,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=672744.0, ans=0.125 2023-06-20 08:42:52,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672744.0, ans=0.1 2023-06-20 08:42:55,703 INFO [train.py:996] (1/4) Epoch 4, batch 20650, loss[loss=0.2292, simple_loss=0.2883, pruned_loss=0.08503, over 21452.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3324, pruned_loss=0.101, over 4249748.39 frames. ], batch size: 195, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:42:57,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=672804.0, ans=0.1 2023-06-20 08:42:57,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=672804.0, ans=0.2 2023-06-20 08:43:11,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-20 08:43:12,511 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.784e+02 3.361e+02 3.771e+02 8.301e+02, threshold=6.721e+02, percent-clipped=1.0 2023-06-20 08:43:19,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=672864.0, ans=0.0 2023-06-20 08:43:45,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=672924.0, ans=0.1 2023-06-20 08:43:46,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672924.0, ans=0.1 2023-06-20 08:43:54,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.39 vs. limit=22.5 2023-06-20 08:44:20,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=673044.0, ans=0.2 2023-06-20 08:44:45,617 INFO [train.py:996] (1/4) Epoch 4, batch 20700, loss[loss=0.2023, simple_loss=0.2663, pruned_loss=0.06917, over 21732.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3232, pruned_loss=0.09669, over 4249178.54 frames. ], batch size: 124, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:45:03,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=673164.0, ans=0.0 2023-06-20 08:45:46,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=673284.0, ans=0.0 2023-06-20 08:45:53,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=673284.0, ans=0.0 2023-06-20 08:46:28,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=673344.0, ans=0.0 2023-06-20 08:46:31,822 INFO [train.py:996] (1/4) Epoch 4, batch 20750, loss[loss=0.3901, simple_loss=0.4689, pruned_loss=0.1557, over 21540.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3251, pruned_loss=0.0951, over 4253704.37 frames. ], batch size: 471, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:46:32,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=673404.0, ans=0.125 2023-06-20 08:46:48,691 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.639e+02 3.184e+02 4.013e+02 6.063e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-20 08:47:59,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=673644.0, ans=0.125 2023-06-20 08:48:15,621 INFO [train.py:996] (1/4) Epoch 4, batch 20800, loss[loss=0.3087, simple_loss=0.3441, pruned_loss=0.1367, over 21358.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3274, pruned_loss=0.09587, over 4256289.67 frames. ], batch size: 473, lr: 7.77e-03, grad_scale: 32.0 2023-06-20 08:48:52,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.94 vs. limit=15.0 2023-06-20 08:49:57,065 INFO [train.py:996] (1/4) Epoch 4, batch 20850, loss[loss=0.2335, simple_loss=0.2998, pruned_loss=0.08355, over 21697.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3205, pruned_loss=0.09387, over 4257988.83 frames. ], batch size: 414, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:49:57,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=674004.0, ans=0.05 2023-06-20 08:50:15,155 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.828e+02 3.362e+02 3.995e+02 7.673e+02, threshold=6.724e+02, percent-clipped=2.0 2023-06-20 08:50:34,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-20 08:51:08,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=674184.0, ans=0.2 2023-06-20 08:51:10,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=674184.0, ans=0.125 2023-06-20 08:51:39,469 INFO [train.py:996] (1/4) Epoch 4, batch 20900, loss[loss=0.2474, simple_loss=0.308, pruned_loss=0.09345, over 21577.00 frames. ], tot_loss[loss=0.258, simple_loss=0.324, pruned_loss=0.09599, over 4262608.47 frames. ], batch size: 548, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:51:41,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=674304.0, ans=0.025 2023-06-20 08:52:01,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-20 08:52:19,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=674424.0, ans=0.1 2023-06-20 08:52:28,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=674424.0, ans=0.2 2023-06-20 08:52:51,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=674484.0, ans=0.0 2023-06-20 08:52:53,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=674484.0, ans=0.125 2023-06-20 08:53:16,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=674544.0, ans=0.1 2023-06-20 08:53:21,026 INFO [train.py:996] (1/4) Epoch 4, batch 20950, loss[loss=0.2536, simple_loss=0.3185, pruned_loss=0.09435, over 21828.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3191, pruned_loss=0.0922, over 4266860.75 frames. ], batch size: 102, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:53:33,580 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.815e+02 3.248e+02 3.950e+02 6.519e+02, threshold=6.496e+02, percent-clipped=0.0 2023-06-20 08:53:33,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=674604.0, ans=0.125 2023-06-20 08:53:45,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=674664.0, ans=0.0 2023-06-20 08:54:44,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=674844.0, ans=0.0 2023-06-20 08:54:56,995 INFO [train.py:996] (1/4) Epoch 4, batch 21000, loss[loss=0.2506, simple_loss=0.3202, pruned_loss=0.09052, over 21896.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3174, pruned_loss=0.09225, over 4268687.06 frames. ], batch size: 124, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:54:56,996 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 08:55:14,685 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2759, simple_loss=0.3744, pruned_loss=0.08874, over 1796401.00 frames. 2023-06-20 08:55:14,685 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 08:56:51,309 INFO [train.py:996] (1/4) Epoch 4, batch 21050, loss[loss=0.2499, simple_loss=0.3168, pruned_loss=0.09145, over 15826.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3162, pruned_loss=0.09359, over 4267482.11 frames. ], batch size: 61, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:56:56,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-20 08:56:59,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=675204.0, ans=0.0 2023-06-20 08:56:59,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=675204.0, ans=0.2 2023-06-20 08:57:04,073 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.634e+02 3.115e+02 4.220e+02 7.961e+02, threshold=6.229e+02, percent-clipped=3.0 2023-06-20 08:57:04,377 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:57:11,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=675264.0, ans=0.0 2023-06-20 08:57:51,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=675384.0, ans=0.0 2023-06-20 08:57:59,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=675384.0, ans=0.1 2023-06-20 08:58:02,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=675384.0, ans=0.125 2023-06-20 08:58:21,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=675444.0, ans=0.1 2023-06-20 08:58:22,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=675444.0, ans=0.125 2023-06-20 08:58:26,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=675444.0, ans=0.125 2023-06-20 08:58:31,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=675444.0, ans=0.0 2023-06-20 08:58:33,731 INFO [train.py:996] (1/4) Epoch 4, batch 21100, loss[loss=0.2997, simple_loss=0.3278, pruned_loss=0.1358, over 21335.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3131, pruned_loss=0.09269, over 4260792.04 frames. ], batch size: 508, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:58:47,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=675504.0, ans=0.2 2023-06-20 08:59:15,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=675624.0, ans=0.1 2023-06-20 09:00:16,561 INFO [train.py:996] (1/4) Epoch 4, batch 21150, loss[loss=0.2272, simple_loss=0.2839, pruned_loss=0.08525, over 21158.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3105, pruned_loss=0.0939, over 4266170.82 frames. ], batch size: 176, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 09:00:29,287 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.738e+02 3.201e+02 4.018e+02 7.456e+02, threshold=6.402e+02, percent-clipped=2.0 2023-06-20 09:01:28,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=675984.0, ans=0.125 2023-06-20 09:01:35,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=675984.0, ans=0.07 2023-06-20 09:01:43,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=676044.0, ans=0.0 2023-06-20 09:01:51,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=676044.0, ans=0.0 2023-06-20 09:01:59,514 INFO [train.py:996] (1/4) Epoch 4, batch 21200, loss[loss=0.2392, simple_loss=0.2854, pruned_loss=0.09646, over 21466.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3066, pruned_loss=0.09362, over 4257204.25 frames. ], batch size: 212, lr: 7.76e-03, grad_scale: 32.0 2023-06-20 09:02:03,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=676104.0, ans=0.1 2023-06-20 09:02:26,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=676164.0, ans=0.0 2023-06-20 09:02:31,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=676164.0, ans=0.2 2023-06-20 09:03:03,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=676284.0, ans=0.0 2023-06-20 09:03:17,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=676284.0, ans=0.1 2023-06-20 09:03:44,320 INFO [train.py:996] (1/4) Epoch 4, batch 21250, loss[loss=0.3373, simple_loss=0.3937, pruned_loss=0.1405, over 21638.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3058, pruned_loss=0.09326, over 4243997.60 frames. ], batch size: 415, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:04:02,650 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.827e+02 3.407e+02 4.227e+02 7.586e+02, threshold=6.813e+02, percent-clipped=3.0 2023-06-20 09:04:12,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=676464.0, ans=0.125 2023-06-20 09:04:17,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=676464.0, ans=0.2 2023-06-20 09:04:38,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-20 09:04:48,623 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:04:51,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=676584.0, ans=0.0 2023-06-20 09:05:01,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=676584.0, ans=0.0 2023-06-20 09:05:16,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=676644.0, ans=0.0 2023-06-20 09:05:21,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=676644.0, ans=0.125 2023-06-20 09:05:27,285 INFO [train.py:996] (1/4) Epoch 4, batch 21300, loss[loss=0.2668, simple_loss=0.3206, pruned_loss=0.1065, over 21571.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3119, pruned_loss=0.09572, over 4257038.65 frames. ], batch size: 548, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:06:31,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=676884.0, ans=0.025 2023-06-20 09:06:41,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=676884.0, ans=0.125 2023-06-20 09:07:11,221 INFO [train.py:996] (1/4) Epoch 4, batch 21350, loss[loss=0.1653, simple_loss=0.2243, pruned_loss=0.05314, over 16153.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3148, pruned_loss=0.09549, over 4257148.22 frames. ], batch size: 60, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:07:12,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=677004.0, ans=0.0 2023-06-20 09:07:29,382 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.939e+02 3.451e+02 4.084e+02 6.160e+02, threshold=6.901e+02, percent-clipped=0.0 2023-06-20 09:07:35,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=677064.0, ans=0.0 2023-06-20 09:07:56,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=677064.0, ans=0.125 2023-06-20 09:08:00,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=677124.0, ans=0.1 2023-06-20 09:08:00,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=677124.0, ans=0.125 2023-06-20 09:08:06,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-20 09:08:28,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=677184.0, ans=0.125 2023-06-20 09:08:54,424 INFO [train.py:996] (1/4) Epoch 4, batch 21400, loss[loss=0.2755, simple_loss=0.3524, pruned_loss=0.09927, over 21289.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3183, pruned_loss=0.09479, over 4265053.07 frames. ], batch size: 548, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:09:19,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=677364.0, ans=0.1 2023-06-20 09:09:36,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=677424.0, ans=0.1 2023-06-20 09:09:41,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=677424.0, ans=0.0 2023-06-20 09:10:06,625 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-20 09:10:10,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=677484.0, ans=0.0 2023-06-20 09:10:27,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=677544.0, ans=0.125 2023-06-20 09:10:31,803 INFO [train.py:996] (1/4) Epoch 4, batch 21450, loss[loss=0.3108, simple_loss=0.3616, pruned_loss=0.13, over 21847.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3227, pruned_loss=0.09677, over 4271478.73 frames. ], batch size: 118, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:10:49,754 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.725e+02 3.459e+02 4.369e+02 7.075e+02, threshold=6.919e+02, percent-clipped=1.0 2023-06-20 09:10:57,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-20 09:11:27,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=677724.0, ans=0.125 2023-06-20 09:12:06,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=677844.0, ans=0.1 2023-06-20 09:12:13,304 INFO [train.py:996] (1/4) Epoch 4, batch 21500, loss[loss=0.2451, simple_loss=0.2997, pruned_loss=0.09524, over 21397.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3215, pruned_loss=0.09777, over 4273594.28 frames. ], batch size: 131, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:12:31,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=677904.0, ans=0.125 2023-06-20 09:13:08,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=678024.0, ans=0.0 2023-06-20 09:13:36,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=678084.0, ans=0.125 2023-06-20 09:13:41,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=678144.0, ans=0.0 2023-06-20 09:13:56,017 INFO [train.py:996] (1/4) Epoch 4, batch 21550, loss[loss=0.1896, simple_loss=0.2503, pruned_loss=0.06439, over 21143.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3129, pruned_loss=0.09338, over 4264073.54 frames. ], batch size: 548, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:14:04,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=678204.0, ans=0.1 2023-06-20 09:14:09,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=678204.0, ans=0.125 2023-06-20 09:14:14,647 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.699e+02 3.223e+02 3.892e+02 7.035e+02, threshold=6.447e+02, percent-clipped=1.0 2023-06-20 09:14:39,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=678264.0, ans=0.125 2023-06-20 09:15:12,453 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:15:42,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=678444.0, ans=0.2 2023-06-20 09:15:45,545 INFO [train.py:996] (1/4) Epoch 4, batch 21600, loss[loss=0.2521, simple_loss=0.3207, pruned_loss=0.09174, over 21469.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3069, pruned_loss=0.09117, over 4263931.91 frames. ], batch size: 389, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:16:44,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=678624.0, ans=0.125 2023-06-20 09:16:57,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-06-20 09:17:03,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=678684.0, ans=0.1 2023-06-20 09:17:12,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=678744.0, ans=0.0 2023-06-20 09:17:30,278 INFO [train.py:996] (1/4) Epoch 4, batch 21650, loss[loss=0.2554, simple_loss=0.3537, pruned_loss=0.07855, over 21810.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.31, pruned_loss=0.08828, over 4265504.36 frames. ], batch size: 371, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:17:53,970 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.809e+02 3.238e+02 3.645e+02 6.702e+02, threshold=6.475e+02, percent-clipped=1.0 2023-06-20 09:19:05,618 INFO [train.py:996] (1/4) Epoch 4, batch 21700, loss[loss=0.2372, simple_loss=0.2953, pruned_loss=0.08959, over 21433.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3109, pruned_loss=0.08627, over 4261816.72 frames. ], batch size: 194, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:19:07,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=679104.0, ans=0.0 2023-06-20 09:19:49,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-20 09:20:02,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-20 09:20:07,148 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-20 09:20:44,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=679344.0, ans=0.0 2023-06-20 09:20:47,248 INFO [train.py:996] (1/4) Epoch 4, batch 21750, loss[loss=0.2487, simple_loss=0.3044, pruned_loss=0.09651, over 21645.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3067, pruned_loss=0.08644, over 4247356.45 frames. ], batch size: 333, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:20:55,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=679404.0, ans=0.125 2023-06-20 09:20:57,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=679404.0, ans=0.0 2023-06-20 09:21:12,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.575e+02 3.289e+02 4.229e+02 7.703e+02, threshold=6.577e+02, percent-clipped=2.0 2023-06-20 09:21:17,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=679464.0, ans=0.2 2023-06-20 09:22:09,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=679584.0, ans=0.2 2023-06-20 09:22:16,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-20 09:22:31,482 INFO [train.py:996] (1/4) Epoch 4, batch 21800, loss[loss=0.287, simple_loss=0.3503, pruned_loss=0.1119, over 21595.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3056, pruned_loss=0.08755, over 4229638.10 frames. ], batch size: 442, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:22:38,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=679704.0, ans=0.125 2023-06-20 09:23:07,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=679764.0, ans=0.1 2023-06-20 09:23:16,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=679764.0, ans=0.0 2023-06-20 09:24:08,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=679944.0, ans=0.0 2023-06-20 09:24:19,743 INFO [train.py:996] (1/4) Epoch 4, batch 21850, loss[loss=0.2629, simple_loss=0.3308, pruned_loss=0.09752, over 21846.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3131, pruned_loss=0.08929, over 4233405.30 frames. ], batch size: 351, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:24:39,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.758e+02 3.560e+02 4.592e+02 6.859e+02, threshold=7.120e+02, percent-clipped=3.0 2023-06-20 09:24:47,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=680064.0, ans=0.2 2023-06-20 09:24:57,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=680064.0, ans=0.125 2023-06-20 09:26:00,375 INFO [train.py:996] (1/4) Epoch 4, batch 21900, loss[loss=0.2285, simple_loss=0.2922, pruned_loss=0.08244, over 21242.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3139, pruned_loss=0.09061, over 4245132.90 frames. ], batch size: 176, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:26:26,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=680364.0, ans=0.125 2023-06-20 09:26:33,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=680364.0, ans=0.2 2023-06-20 09:26:36,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=680364.0, ans=0.125 2023-06-20 09:26:45,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=680424.0, ans=0.0 2023-06-20 09:26:45,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=680424.0, ans=0.125 2023-06-20 09:27:42,091 INFO [train.py:996] (1/4) Epoch 4, batch 21950, loss[loss=0.1968, simple_loss=0.2575, pruned_loss=0.06808, over 21234.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3104, pruned_loss=0.09005, over 4250151.08 frames. ], batch size: 176, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:28:06,010 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.677e+02 3.027e+02 3.757e+02 5.142e+02, threshold=6.054e+02, percent-clipped=0.0 2023-06-20 09:28:21,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=680664.0, ans=0.1 2023-06-20 09:28:21,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=680664.0, ans=0.1 2023-06-20 09:29:17,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=680904.0, ans=0.0 2023-06-20 09:29:24,061 INFO [train.py:996] (1/4) Epoch 4, batch 22000, loss[loss=0.2083, simple_loss=0.2691, pruned_loss=0.07372, over 21246.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3034, pruned_loss=0.08686, over 4241373.07 frames. ], batch size: 144, lr: 7.73e-03, grad_scale: 32.0 2023-06-20 09:29:31,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=680904.0, ans=0.125 2023-06-20 09:29:40,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-20 09:29:40,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.85 vs. limit=5.0 2023-06-20 09:30:01,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-20 09:31:13,113 INFO [train.py:996] (1/4) Epoch 4, batch 22050, loss[loss=0.3013, simple_loss=0.3669, pruned_loss=0.1178, over 21327.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3042, pruned_loss=0.08711, over 4219701.68 frames. ], batch size: 159, lr: 7.73e-03, grad_scale: 32.0 2023-06-20 09:31:16,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=681204.0, ans=0.1 2023-06-20 09:31:20,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=681204.0, ans=0.2 2023-06-20 09:31:23,743 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:31:32,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=681204.0, ans=0.2 2023-06-20 09:31:33,502 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.650e+02 3.249e+02 4.022e+02 7.710e+02, threshold=6.498e+02, percent-clipped=6.0 2023-06-20 09:32:00,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=681324.0, ans=0.0 2023-06-20 09:32:38,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-20 09:32:57,307 INFO [train.py:996] (1/4) Epoch 4, batch 22100, loss[loss=0.2425, simple_loss=0.3132, pruned_loss=0.08592, over 21631.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3191, pruned_loss=0.09306, over 4227539.89 frames. ], batch size: 263, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:33:01,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=681504.0, ans=0.125 2023-06-20 09:33:09,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=681504.0, ans=0.2 2023-06-20 09:34:38,829 INFO [train.py:996] (1/4) Epoch 4, batch 22150, loss[loss=0.2742, simple_loss=0.3405, pruned_loss=0.1039, over 21901.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3235, pruned_loss=0.09635, over 4242160.25 frames. ], batch size: 351, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:34:56,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=681804.0, ans=0.1 2023-06-20 09:34:57,905 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 3.000e+02 3.535e+02 4.245e+02 7.467e+02, threshold=7.071e+02, percent-clipped=4.0 2023-06-20 09:35:10,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=681864.0, ans=0.09899494936611666 2023-06-20 09:36:21,212 INFO [train.py:996] (1/4) Epoch 4, batch 22200, loss[loss=0.2845, simple_loss=0.3683, pruned_loss=0.1003, over 21792.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3251, pruned_loss=0.09785, over 4257613.13 frames. ], batch size: 414, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:36:56,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=682164.0, ans=15.0 2023-06-20 09:37:53,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=682344.0, ans=0.0 2023-06-20 09:38:04,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-20 09:38:08,606 INFO [train.py:996] (1/4) Epoch 4, batch 22250, loss[loss=0.289, simple_loss=0.4102, pruned_loss=0.08397, over 19785.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3333, pruned_loss=0.09944, over 4259601.20 frames. ], batch size: 702, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:38:18,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=682404.0, ans=0.0 2023-06-20 09:38:23,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.823e+02 3.649e+02 4.510e+02 8.047e+02, threshold=7.298e+02, percent-clipped=1.0 2023-06-20 09:38:46,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=682524.0, ans=0.125 2023-06-20 09:39:12,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=682584.0, ans=0.0 2023-06-20 09:39:14,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=682584.0, ans=0.1 2023-06-20 09:39:20,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=682584.0, ans=0.125 2023-06-20 09:39:30,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=682644.0, ans=0.125 2023-06-20 09:39:49,748 INFO [train.py:996] (1/4) Epoch 4, batch 22300, loss[loss=0.2663, simple_loss=0.3275, pruned_loss=0.1025, over 21877.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3368, pruned_loss=0.1025, over 4258137.55 frames. ], batch size: 332, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:39:51,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=682704.0, ans=0.0 2023-06-20 09:39:54,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=682704.0, ans=0.1 2023-06-20 09:40:29,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=682824.0, ans=0.5 2023-06-20 09:40:38,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=682824.0, ans=0.0 2023-06-20 09:41:31,556 INFO [train.py:996] (1/4) Epoch 4, batch 22350, loss[loss=0.2485, simple_loss=0.3048, pruned_loss=0.09607, over 21377.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3351, pruned_loss=0.1036, over 4269700.71 frames. ], batch size: 176, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:41:33,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=683004.0, ans=0.1 2023-06-20 09:41:46,592 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.465e+02 3.048e+02 3.517e+02 4.689e+02 8.292e+02, threshold=7.034e+02, percent-clipped=3.0 2023-06-20 09:41:47,068 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:42:14,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=683124.0, ans=0.125 2023-06-20 09:42:38,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=683184.0, ans=0.125 2023-06-20 09:43:13,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=683244.0, ans=0.125 2023-06-20 09:43:16,272 INFO [train.py:996] (1/4) Epoch 4, batch 22400, loss[loss=0.2219, simple_loss=0.2871, pruned_loss=0.07838, over 21618.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3306, pruned_loss=0.09964, over 4276647.29 frames. ], batch size: 263, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:43:48,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-20 09:43:54,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=683424.0, ans=0.125 2023-06-20 09:43:58,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=683424.0, ans=0.0 2023-06-20 09:44:16,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=683424.0, ans=0.07 2023-06-20 09:44:26,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=683484.0, ans=0.125 2023-06-20 09:44:44,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=683544.0, ans=0.2 2023-06-20 09:44:58,367 INFO [train.py:996] (1/4) Epoch 4, batch 22450, loss[loss=0.2213, simple_loss=0.2766, pruned_loss=0.08302, over 21539.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3248, pruned_loss=0.09854, over 4266834.58 frames. ], batch size: 263, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:45:06,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=683604.0, ans=0.1 2023-06-20 09:45:14,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-06-20 09:45:15,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=683604.0, ans=0.5 2023-06-20 09:45:18,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.635e+02 3.000e+02 3.519e+02 5.856e+02, threshold=6.001e+02, percent-clipped=0.0 2023-06-20 09:45:22,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=683664.0, ans=0.125 2023-06-20 09:45:49,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=683724.0, ans=0.125 2023-06-20 09:46:09,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=683784.0, ans=0.1 2023-06-20 09:46:48,145 INFO [train.py:996] (1/4) Epoch 4, batch 22500, loss[loss=0.3344, simple_loss=0.381, pruned_loss=0.1439, over 21326.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3194, pruned_loss=0.09793, over 4272234.67 frames. ], batch size: 471, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:46:52,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.01 vs. limit=15.0 2023-06-20 09:46:54,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=683904.0, ans=0.1 2023-06-20 09:48:30,843 INFO [train.py:996] (1/4) Epoch 4, batch 22550, loss[loss=0.2673, simple_loss=0.3326, pruned_loss=0.101, over 21669.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3255, pruned_loss=0.09898, over 4275238.41 frames. ], batch size: 263, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:48:45,428 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 2.905e+02 3.340e+02 4.222e+02 9.344e+02, threshold=6.680e+02, percent-clipped=7.0 2023-06-20 09:48:46,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-20 09:48:52,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=684264.0, ans=0.2 2023-06-20 09:49:21,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=684324.0, ans=0.1 2023-06-20 09:49:36,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=684384.0, ans=0.0 2023-06-20 09:49:46,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=684384.0, ans=0.2 2023-06-20 09:50:00,046 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:50:14,893 INFO [train.py:996] (1/4) Epoch 4, batch 22600, loss[loss=0.2259, simple_loss=0.2886, pruned_loss=0.08156, over 21480.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3253, pruned_loss=0.09836, over 4278008.40 frames. ], batch size: 211, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:50:23,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=684504.0, ans=0.0 2023-06-20 09:50:37,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=684564.0, ans=0.125 2023-06-20 09:50:37,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=684564.0, ans=0.1 2023-06-20 09:50:48,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=684564.0, ans=0.125 2023-06-20 09:51:56,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=12.0 2023-06-20 09:51:57,233 INFO [train.py:996] (1/4) Epoch 4, batch 22650, loss[loss=0.2967, simple_loss=0.4014, pruned_loss=0.09602, over 20804.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3233, pruned_loss=0.09764, over 4268914.47 frames. ], batch size: 607, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:52:12,037 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.311e+02 3.814e+02 4.832e+02 8.626e+02, threshold=7.628e+02, percent-clipped=4.0 2023-06-20 09:52:36,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=684864.0, ans=0.0 2023-06-20 09:52:37,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=684864.0, ans=0.0 2023-06-20 09:52:40,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=684924.0, ans=0.125 2023-06-20 09:52:42,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=684924.0, ans=0.125 2023-06-20 09:52:49,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=684924.0, ans=0.125 2023-06-20 09:53:23,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=22.5 2023-06-20 09:53:40,927 INFO [train.py:996] (1/4) Epoch 4, batch 22700, loss[loss=0.2503, simple_loss=0.3117, pruned_loss=0.09439, over 21842.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3166, pruned_loss=0.09661, over 4267594.88 frames. ], batch size: 118, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:54:13,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-20 09:54:26,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-20 09:54:27,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=685224.0, ans=0.0 2023-06-20 09:54:31,333 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:54:46,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=685284.0, ans=0.1 2023-06-20 09:54:51,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=685284.0, ans=0.125 2023-06-20 09:55:23,776 INFO [train.py:996] (1/4) Epoch 4, batch 22750, loss[loss=0.3408, simple_loss=0.3898, pruned_loss=0.1459, over 21592.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3187, pruned_loss=0.09931, over 4268677.16 frames. ], batch size: 389, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:55:43,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.761e+02 3.069e+02 3.774e+02 7.547e+02, threshold=6.137e+02, percent-clipped=0.0 2023-06-20 09:55:43,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=685464.0, ans=0.2 2023-06-20 09:55:50,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-20 09:55:53,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=685464.0, ans=0.125 2023-06-20 09:56:10,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=685524.0, ans=0.5 2023-06-20 09:56:36,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=685584.0, ans=0.125 2023-06-20 09:57:05,727 INFO [train.py:996] (1/4) Epoch 4, batch 22800, loss[loss=0.2655, simple_loss=0.3335, pruned_loss=0.09869, over 21842.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3233, pruned_loss=0.1012, over 4267209.96 frames. ], batch size: 124, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:58:05,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-20 09:58:15,023 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:58:47,340 INFO [train.py:996] (1/4) Epoch 4, batch 22850, loss[loss=0.2893, simple_loss=0.319, pruned_loss=0.1298, over 21354.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3206, pruned_loss=0.1009, over 4267887.45 frames. ], batch size: 508, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:59:07,739 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.174e+02 3.905e+02 4.697e+02 7.560e+02, threshold=7.810e+02, percent-clipped=8.0 2023-06-20 09:59:10,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=686064.0, ans=0.125 2023-06-20 09:59:33,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-20 09:59:45,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-20 09:59:48,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=686124.0, ans=0.0 2023-06-20 09:59:55,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=686184.0, ans=0.0 2023-06-20 10:00:17,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-20 10:00:18,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=686244.0, ans=0.035 2023-06-20 10:00:18,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=686244.0, ans=0.125 2023-06-20 10:00:23,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=686244.0, ans=0.0 2023-06-20 10:00:31,665 INFO [train.py:996] (1/4) Epoch 4, batch 22900, loss[loss=0.3425, simple_loss=0.4264, pruned_loss=0.1293, over 21485.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3235, pruned_loss=0.1, over 4258882.11 frames. ], batch size: 507, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 10:01:00,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=686364.0, ans=0.1 2023-06-20 10:01:06,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=686364.0, ans=0.125 2023-06-20 10:01:09,998 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:01:10,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=686364.0, ans=0.125 2023-06-20 10:01:23,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=686424.0, ans=0.0 2023-06-20 10:01:54,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=686484.0, ans=0.125 2023-06-20 10:02:00,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=686544.0, ans=0.125 2023-06-20 10:02:22,294 INFO [train.py:996] (1/4) Epoch 4, batch 22950, loss[loss=0.2857, simple_loss=0.4088, pruned_loss=0.08126, over 21587.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3378, pruned_loss=0.09864, over 4261572.23 frames. ], batch size: 389, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 10:02:41,647 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.062e+02 3.428e+02 4.438e+02 8.217e+02, threshold=6.855e+02, percent-clipped=1.0 2023-06-20 10:04:09,116 INFO [train.py:996] (1/4) Epoch 4, batch 23000, loss[loss=0.2571, simple_loss=0.3113, pruned_loss=0.1015, over 21781.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3372, pruned_loss=0.09638, over 4263883.98 frames. ], batch size: 247, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:04:29,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=686964.0, ans=0.125 2023-06-20 10:04:40,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=686964.0, ans=0.05 2023-06-20 10:05:52,759 INFO [train.py:996] (1/4) Epoch 4, batch 23050, loss[loss=0.3121, simple_loss=0.3641, pruned_loss=0.1301, over 21456.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3389, pruned_loss=0.09964, over 4271537.36 frames. ], batch size: 211, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:06:03,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=687204.0, ans=0.125 2023-06-20 10:06:12,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.843e+02 3.308e+02 4.395e+02 9.677e+02, threshold=6.616e+02, percent-clipped=9.0 2023-06-20 10:06:58,778 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-20 10:07:05,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-20 10:07:31,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=687444.0, ans=0.0 2023-06-20 10:07:35,953 INFO [train.py:996] (1/4) Epoch 4, batch 23100, loss[loss=0.2184, simple_loss=0.2723, pruned_loss=0.08229, over 21599.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3341, pruned_loss=0.09971, over 4273189.83 frames. ], batch size: 231, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:07:37,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=687504.0, ans=0.125 2023-06-20 10:07:44,758 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:08:30,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-20 10:08:34,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-20 10:09:06,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=687744.0, ans=0.125 2023-06-20 10:09:06,373 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:09:17,207 INFO [train.py:996] (1/4) Epoch 4, batch 23150, loss[loss=0.2557, simple_loss=0.3163, pruned_loss=0.09758, over 21846.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3274, pruned_loss=0.09848, over 4282340.14 frames. ], batch size: 371, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:09:29,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=687804.0, ans=0.0 2023-06-20 10:09:36,701 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.820e+02 3.256e+02 3.906e+02 5.764e+02, threshold=6.513e+02, percent-clipped=0.0 2023-06-20 10:09:36,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=687864.0, ans=0.0 2023-06-20 10:10:53,452 INFO [train.py:996] (1/4) Epoch 4, batch 23200, loss[loss=0.3241, simple_loss=0.362, pruned_loss=0.1431, over 21808.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3268, pruned_loss=0.1003, over 4284278.00 frames. ], batch size: 508, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:11:42,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=688224.0, ans=0.125 2023-06-20 10:12:13,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=688344.0, ans=0.125 2023-06-20 10:12:36,032 INFO [train.py:996] (1/4) Epoch 4, batch 23250, loss[loss=0.255, simple_loss=0.3252, pruned_loss=0.09236, over 21881.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3261, pruned_loss=0.1005, over 4289713.57 frames. ], batch size: 118, lr: 7.69e-03, grad_scale: 16.0 2023-06-20 10:12:44,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=688404.0, ans=0.0 2023-06-20 10:12:57,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.915e+02 3.512e+02 4.464e+02 9.491e+02, threshold=7.024e+02, percent-clipped=1.0 2023-06-20 10:13:37,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=688584.0, ans=0.95 2023-06-20 10:13:41,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=688584.0, ans=0.0 2023-06-20 10:14:18,781 INFO [train.py:996] (1/4) Epoch 4, batch 23300, loss[loss=0.2813, simple_loss=0.3917, pruned_loss=0.08545, over 21855.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3344, pruned_loss=0.1028, over 4290429.65 frames. ], batch size: 316, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:14:29,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=688704.0, ans=0.0 2023-06-20 10:15:31,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=688884.0, ans=0.0 2023-06-20 10:16:03,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=15.0 2023-06-20 10:16:05,778 INFO [train.py:996] (1/4) Epoch 4, batch 23350, loss[loss=0.2885, simple_loss=0.3581, pruned_loss=0.1094, over 19873.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3384, pruned_loss=0.1016, over 4278512.94 frames. ], batch size: 702, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:16:28,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.739e+02 3.338e+02 4.222e+02 6.703e+02, threshold=6.676e+02, percent-clipped=0.0 2023-06-20 10:16:49,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=689124.0, ans=0.125 2023-06-20 10:16:50,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=689124.0, ans=10.0 2023-06-20 10:16:53,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=689124.0, ans=0.125 2023-06-20 10:17:38,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=689244.0, ans=0.125 2023-06-20 10:17:42,535 INFO [train.py:996] (1/4) Epoch 4, batch 23400, loss[loss=0.23, simple_loss=0.2962, pruned_loss=0.08187, over 21777.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3283, pruned_loss=0.09559, over 4280811.31 frames. ], batch size: 247, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:18:42,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=689424.0, ans=0.2 2023-06-20 10:18:43,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=689424.0, ans=0.125 2023-06-20 10:19:30,570 INFO [train.py:996] (1/4) Epoch 4, batch 23450, loss[loss=0.302, simple_loss=0.3544, pruned_loss=0.1248, over 21440.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3296, pruned_loss=0.09861, over 4282735.51 frames. ], batch size: 471, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:19:48,761 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.746e+02 3.178e+02 4.019e+02 6.793e+02, threshold=6.356e+02, percent-clipped=1.0 2023-06-20 10:20:14,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=689724.0, ans=0.125 2023-06-20 10:20:42,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-20 10:20:48,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=689844.0, ans=0.0 2023-06-20 10:21:08,118 INFO [train.py:996] (1/4) Epoch 4, batch 23500, loss[loss=0.2411, simple_loss=0.3095, pruned_loss=0.08631, over 21859.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3312, pruned_loss=0.1013, over 4282088.17 frames. ], batch size: 124, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:21:23,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=689964.0, ans=0.2 2023-06-20 10:21:40,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.82 vs. limit=10.0 2023-06-20 10:22:09,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=690084.0, ans=0.125 2023-06-20 10:22:46,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=690144.0, ans=0.125 2023-06-20 10:22:49,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=690144.0, ans=0.2 2023-06-20 10:22:52,375 INFO [train.py:996] (1/4) Epoch 4, batch 23550, loss[loss=0.219, simple_loss=0.2757, pruned_loss=0.08121, over 21588.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3251, pruned_loss=0.1005, over 4272305.89 frames. ], batch size: 263, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:23:03,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-20 10:23:10,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.904e+02 3.219e+02 3.862e+02 7.198e+02, threshold=6.438e+02, percent-clipped=1.0 2023-06-20 10:24:02,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=690384.0, ans=0.2 2023-06-20 10:24:26,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-20 10:24:30,326 INFO [train.py:996] (1/4) Epoch 4, batch 23600, loss[loss=0.297, simple_loss=0.3574, pruned_loss=0.1183, over 21737.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.327, pruned_loss=0.1007, over 4272814.22 frames. ], batch size: 441, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:24:31,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=12.0 2023-06-20 10:24:32,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=690504.0, ans=0.2 2023-06-20 10:24:55,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=690564.0, ans=0.1 2023-06-20 10:26:10,330 INFO [train.py:996] (1/4) Epoch 4, batch 23650, loss[loss=0.2346, simple_loss=0.3194, pruned_loss=0.07489, over 20681.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3261, pruned_loss=0.09791, over 4279606.80 frames. ], batch size: 607, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:26:10,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=690804.0, ans=0.1 2023-06-20 10:26:38,376 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 3.037e+02 3.480e+02 4.305e+02 8.157e+02, threshold=6.960e+02, percent-clipped=7.0 2023-06-20 10:26:50,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=690864.0, ans=0.0 2023-06-20 10:27:53,235 INFO [train.py:996] (1/4) Epoch 4, batch 23700, loss[loss=0.2961, simple_loss=0.3563, pruned_loss=0.1179, over 21908.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3306, pruned_loss=0.0984, over 4282099.96 frames. ], batch size: 372, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:27:55,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=691104.0, ans=0.0 2023-06-20 10:28:28,349 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:28:31,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=691164.0, ans=0.125 2023-06-20 10:29:18,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=691284.0, ans=0.125 2023-06-20 10:29:21,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=691344.0, ans=0.0 2023-06-20 10:29:28,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=691344.0, ans=0.125 2023-06-20 10:29:48,028 INFO [train.py:996] (1/4) Epoch 4, batch 23750, loss[loss=0.2146, simple_loss=0.3339, pruned_loss=0.04766, over 20821.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3339, pruned_loss=0.09973, over 4283426.35 frames. ], batch size: 607, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:29:49,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=691404.0, ans=0.04949747468305833 2023-06-20 10:30:01,654 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:30:11,540 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.867e+02 3.217e+02 4.313e+02 7.122e+02, threshold=6.434e+02, percent-clipped=1.0 2023-06-20 10:30:59,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=691584.0, ans=0.1 2023-06-20 10:31:16,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=691644.0, ans=0.125 2023-06-20 10:31:37,962 INFO [train.py:996] (1/4) Epoch 4, batch 23800, loss[loss=0.2777, simple_loss=0.3615, pruned_loss=0.09691, over 21584.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3319, pruned_loss=0.09697, over 4281160.71 frames. ], batch size: 263, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:32:16,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=691824.0, ans=0.0 2023-06-20 10:33:00,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.81 vs. limit=22.5 2023-06-20 10:33:09,562 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-20 10:33:10,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=691944.0, ans=0.0 2023-06-20 10:33:23,366 INFO [train.py:996] (1/4) Epoch 4, batch 23850, loss[loss=0.2931, simple_loss=0.3588, pruned_loss=0.1137, over 21678.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3421, pruned_loss=0.09985, over 4284526.07 frames. ], batch size: 351, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:33:28,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=692004.0, ans=0.2 2023-06-20 10:33:48,515 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.013e+02 3.868e+02 5.325e+02 1.077e+03, threshold=7.737e+02, percent-clipped=14.0 2023-06-20 10:34:05,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=692124.0, ans=0.0 2023-06-20 10:35:12,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=692304.0, ans=0.125 2023-06-20 10:35:13,521 INFO [train.py:996] (1/4) Epoch 4, batch 23900, loss[loss=0.2475, simple_loss=0.3226, pruned_loss=0.08619, over 21363.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3487, pruned_loss=0.1025, over 4276291.97 frames. ], batch size: 131, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:36:52,305 INFO [train.py:996] (1/4) Epoch 4, batch 23950, loss[loss=0.318, simple_loss=0.3679, pruned_loss=0.134, over 21602.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3415, pruned_loss=0.1016, over 4270411.36 frames. ], batch size: 391, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:37:07,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=692664.0, ans=0.0 2023-06-20 10:37:10,664 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.929e+02 3.410e+02 4.340e+02 7.845e+02, threshold=6.819e+02, percent-clipped=1.0 2023-06-20 10:37:15,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=692664.0, ans=0.05 2023-06-20 10:37:25,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=692724.0, ans=0.0 2023-06-20 10:37:44,149 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-20 10:38:23,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=692844.0, ans=0.2 2023-06-20 10:38:36,278 INFO [train.py:996] (1/4) Epoch 4, batch 24000, loss[loss=0.2893, simple_loss=0.3572, pruned_loss=0.1107, over 21396.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3427, pruned_loss=0.1048, over 4275443.22 frames. ], batch size: 176, lr: 7.66e-03, grad_scale: 32.0 2023-06-20 10:38:36,278 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 10:38:50,996 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7150, 3.1617, 3.2587, 2.9906], device='cuda:1') 2023-06-20 10:38:53,759 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2722, simple_loss=0.3716, pruned_loss=0.08645, over 1796401.00 frames. 2023-06-20 10:38:53,760 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 10:38:54,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=22.5 2023-06-20 10:38:56,743 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=12.0 2023-06-20 10:39:33,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=692964.0, ans=0.125 2023-06-20 10:39:49,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=693024.0, ans=0.1 2023-06-20 10:39:53,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-20 10:40:21,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=693144.0, ans=0.04949747468305833 2023-06-20 10:40:25,058 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:40:32,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=693144.0, ans=0.125 2023-06-20 10:40:37,763 INFO [train.py:996] (1/4) Epoch 4, batch 24050, loss[loss=0.2717, simple_loss=0.3546, pruned_loss=0.09434, over 21684.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3428, pruned_loss=0.1041, over 4278658.21 frames. ], batch size: 441, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:41:03,493 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.885e+02 3.459e+02 4.129e+02 6.625e+02, threshold=6.917e+02, percent-clipped=0.0 2023-06-20 10:41:16,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=693264.0, ans=0.1 2023-06-20 10:42:01,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=693384.0, ans=0.125 2023-06-20 10:42:16,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=693444.0, ans=0.2 2023-06-20 10:42:21,495 INFO [train.py:996] (1/4) Epoch 4, batch 24100, loss[loss=0.2595, simple_loss=0.3186, pruned_loss=0.1003, over 20122.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.343, pruned_loss=0.1022, over 4272041.68 frames. ], batch size: 702, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:42:22,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=693504.0, ans=0.125 2023-06-20 10:42:25,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-20 10:42:48,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=693564.0, ans=0.125 2023-06-20 10:42:48,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-20 10:42:53,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=693564.0, ans=0.125 2023-06-20 10:43:35,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=693684.0, ans=0.125 2023-06-20 10:43:40,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=693684.0, ans=0.125 2023-06-20 10:43:49,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=693744.0, ans=0.05 2023-06-20 10:44:03,689 INFO [train.py:996] (1/4) Epoch 4, batch 24150, loss[loss=0.3077, simple_loss=0.3498, pruned_loss=0.1328, over 21716.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3424, pruned_loss=0.1043, over 4279423.80 frames. ], batch size: 473, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:44:29,076 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:44:38,764 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 3.099e+02 3.647e+02 4.955e+02 8.844e+02, threshold=7.295e+02, percent-clipped=4.0 2023-06-20 10:44:50,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=693924.0, ans=0.0 2023-06-20 10:44:55,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=693924.0, ans=0.2 2023-06-20 10:45:06,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=693924.0, ans=0.125 2023-06-20 10:45:11,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-20 10:45:23,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=693984.0, ans=0.125 2023-06-20 10:45:35,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=694044.0, ans=0.125 2023-06-20 10:45:45,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=694104.0, ans=0.125 2023-06-20 10:45:52,568 INFO [train.py:996] (1/4) Epoch 4, batch 24200, loss[loss=0.2471, simple_loss=0.3336, pruned_loss=0.08026, over 21631.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3448, pruned_loss=0.1065, over 4284227.29 frames. ], batch size: 263, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:46:31,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=694164.0, ans=0.0 2023-06-20 10:47:29,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=694344.0, ans=0.1 2023-06-20 10:47:46,455 INFO [train.py:996] (1/4) Epoch 4, batch 24250, loss[loss=0.2392, simple_loss=0.3404, pruned_loss=0.06904, over 21640.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3404, pruned_loss=0.0979, over 4279322.64 frames. ], batch size: 389, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:48:11,640 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.800e+02 3.363e+02 4.220e+02 7.304e+02, threshold=6.726e+02, percent-clipped=1.0 2023-06-20 10:49:24,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=694644.0, ans=0.0 2023-06-20 10:49:28,806 INFO [train.py:996] (1/4) Epoch 4, batch 24300, loss[loss=0.1666, simple_loss=0.2444, pruned_loss=0.04436, over 21500.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3325, pruned_loss=0.09151, over 4275866.99 frames. ], batch size: 212, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:49:29,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=694704.0, ans=0.0 2023-06-20 10:49:30,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=694704.0, ans=0.0 2023-06-20 10:49:39,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=694704.0, ans=0.0 2023-06-20 10:50:01,161 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-20 10:50:28,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=694884.0, ans=0.125 2023-06-20 10:51:12,285 INFO [train.py:996] (1/4) Epoch 4, batch 24350, loss[loss=0.3549, simple_loss=0.3958, pruned_loss=0.157, over 21604.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3286, pruned_loss=0.09195, over 4283905.52 frames. ], batch size: 507, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:51:25,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=695004.0, ans=0.1 2023-06-20 10:51:37,143 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.964e+02 3.769e+02 4.911e+02 1.046e+03, threshold=7.538e+02, percent-clipped=11.0 2023-06-20 10:51:44,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=695064.0, ans=0.0 2023-06-20 10:51:48,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-20 10:52:13,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.47 vs. limit=10.0 2023-06-20 10:52:34,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=695244.0, ans=0.0 2023-06-20 10:52:42,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=695244.0, ans=0.2 2023-06-20 10:52:56,596 INFO [train.py:996] (1/4) Epoch 4, batch 24400, loss[loss=0.2795, simple_loss=0.3539, pruned_loss=0.1025, over 21858.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3336, pruned_loss=0.09565, over 4278144.90 frames. ], batch size: 372, lr: 7.65e-03, grad_scale: 32.0 2023-06-20 10:53:49,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=695424.0, ans=0.0 2023-06-20 10:54:03,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=695484.0, ans=0.125 2023-06-20 10:54:39,017 INFO [train.py:996] (1/4) Epoch 4, batch 24450, loss[loss=0.2222, simple_loss=0.3038, pruned_loss=0.0703, over 21544.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.334, pruned_loss=0.09668, over 4275225.96 frames. ], batch size: 230, lr: 7.65e-03, grad_scale: 32.0 2023-06-20 10:54:52,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=695604.0, ans=0.125 2023-06-20 10:54:59,430 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.828e+02 3.266e+02 3.769e+02 5.234e+02, threshold=6.531e+02, percent-clipped=0.0 2023-06-20 10:55:04,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=695664.0, ans=0.125 2023-06-20 10:56:10,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=695844.0, ans=0.0 2023-06-20 10:56:11,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-20 10:56:21,818 INFO [train.py:996] (1/4) Epoch 4, batch 24500, loss[loss=0.2804, simple_loss=0.3403, pruned_loss=0.1102, over 21856.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3345, pruned_loss=0.09662, over 4279887.52 frames. ], batch size: 414, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:56:22,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=695904.0, ans=0.0 2023-06-20 10:56:35,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=695904.0, ans=0.1 2023-06-20 10:57:31,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=696084.0, ans=0.125 2023-06-20 10:57:50,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-20 10:57:57,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=696144.0, ans=0.0 2023-06-20 10:58:02,717 INFO [train.py:996] (1/4) Epoch 4, batch 24550, loss[loss=0.3522, simple_loss=0.4088, pruned_loss=0.1478, over 21452.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.338, pruned_loss=0.09937, over 4284650.83 frames. ], batch size: 471, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:58:04,617 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:58:21,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.855e+02 3.314e+02 4.015e+02 6.051e+02, threshold=6.629e+02, percent-clipped=0.0 2023-06-20 10:58:24,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=696264.0, ans=0.0 2023-06-20 10:59:16,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=696384.0, ans=0.1 2023-06-20 10:59:37,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=696444.0, ans=0.125 2023-06-20 10:59:40,418 INFO [train.py:996] (1/4) Epoch 4, batch 24600, loss[loss=0.2774, simple_loss=0.3258, pruned_loss=0.1145, over 21822.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3343, pruned_loss=0.1002, over 4286387.28 frames. ], batch size: 352, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:59:48,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=696504.0, ans=0.1 2023-06-20 10:59:49,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=696504.0, ans=0.125 2023-06-20 11:00:02,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=696564.0, ans=0.1 2023-06-20 11:00:24,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=696624.0, ans=0.125 2023-06-20 11:00:52,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.80 vs. limit=6.0 2023-06-20 11:00:55,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=696684.0, ans=0.125 2023-06-20 11:00:57,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=696684.0, ans=0.1 2023-06-20 11:01:18,235 INFO [train.py:996] (1/4) Epoch 4, batch 24650, loss[loss=0.2639, simple_loss=0.3152, pruned_loss=0.1064, over 21583.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3272, pruned_loss=0.09857, over 4282312.73 frames. ], batch size: 415, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 11:01:38,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=696864.0, ans=0.125 2023-06-20 11:01:39,821 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 3.151e+02 3.751e+02 4.902e+02 9.106e+02, threshold=7.501e+02, percent-clipped=6.0 2023-06-20 11:02:13,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=696924.0, ans=0.2 2023-06-20 11:03:01,116 INFO [train.py:996] (1/4) Epoch 4, batch 24700, loss[loss=0.2246, simple_loss=0.2946, pruned_loss=0.07724, over 20711.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3256, pruned_loss=0.09698, over 4277602.44 frames. ], batch size: 607, lr: 7.64e-03, grad_scale: 16.0 2023-06-20 11:04:17,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=697284.0, ans=0.125 2023-06-20 11:04:28,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-20 11:04:38,953 INFO [train.py:996] (1/4) Epoch 4, batch 24750, loss[loss=0.2359, simple_loss=0.2939, pruned_loss=0.08893, over 21744.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.319, pruned_loss=0.09392, over 4269746.14 frames. ], batch size: 112, lr: 7.64e-03, grad_scale: 16.0 2023-06-20 11:05:00,208 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.625e+02 3.082e+02 3.571e+02 6.291e+02, threshold=6.165e+02, percent-clipped=0.0 2023-06-20 11:05:00,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=697464.0, ans=0.125 2023-06-20 11:05:10,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=697464.0, ans=0.1 2023-06-20 11:05:23,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=697524.0, ans=0.02 2023-06-20 11:05:56,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2023-06-20 11:06:06,568 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-20 11:06:22,080 INFO [train.py:996] (1/4) Epoch 4, batch 24800, loss[loss=0.2698, simple_loss=0.3252, pruned_loss=0.1072, over 21868.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3142, pruned_loss=0.09384, over 4262508.28 frames. ], batch size: 371, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:06:24,597 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-20 11:06:25,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=697704.0, ans=0.125 2023-06-20 11:06:32,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=697704.0, ans=0.2 2023-06-20 11:07:36,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=697884.0, ans=0.125 2023-06-20 11:07:56,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=697944.0, ans=0.125 2023-06-20 11:08:01,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=697944.0, ans=0.125 2023-06-20 11:08:01,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=697944.0, ans=0.0 2023-06-20 11:08:05,493 INFO [train.py:996] (1/4) Epoch 4, batch 24850, loss[loss=0.2692, simple_loss=0.3326, pruned_loss=0.1029, over 21910.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3158, pruned_loss=0.09597, over 4271060.75 frames. ], batch size: 316, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:08:14,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=698004.0, ans=0.0 2023-06-20 11:08:27,077 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.980e+02 3.585e+02 4.162e+02 8.983e+02, threshold=7.171e+02, percent-clipped=3.0 2023-06-20 11:08:51,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=698124.0, ans=0.0 2023-06-20 11:09:01,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=698124.0, ans=0.125 2023-06-20 11:09:01,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=698124.0, ans=0.125 2023-06-20 11:09:21,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=698184.0, ans=0.125 2023-06-20 11:09:31,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=698244.0, ans=0.125 2023-06-20 11:09:40,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=698244.0, ans=0.07 2023-06-20 11:09:43,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=698244.0, ans=0.04949747468305833 2023-06-20 11:09:46,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=698244.0, ans=0.125 2023-06-20 11:09:49,563 INFO [train.py:996] (1/4) Epoch 4, batch 24900, loss[loss=0.2308, simple_loss=0.3155, pruned_loss=0.0731, over 21849.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3178, pruned_loss=0.09682, over 4269805.60 frames. ], batch size: 316, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:11:13,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-20 11:11:17,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=698544.0, ans=0.0 2023-06-20 11:11:29,321 INFO [train.py:996] (1/4) Epoch 4, batch 24950, loss[loss=0.3216, simple_loss=0.38, pruned_loss=0.1316, over 21586.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3277, pruned_loss=0.1021, over 4273855.37 frames. ], batch size: 389, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:11:38,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-20 11:11:54,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=698604.0, ans=0.1 2023-06-20 11:12:12,241 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 3.311e+02 3.992e+02 4.985e+02 7.150e+02, threshold=7.983e+02, percent-clipped=0.0 2023-06-20 11:12:33,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=698724.0, ans=0.125 2023-06-20 11:13:20,088 INFO [train.py:996] (1/4) Epoch 4, batch 25000, loss[loss=0.2766, simple_loss=0.3433, pruned_loss=0.105, over 21755.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3362, pruned_loss=0.1046, over 4274710.84 frames. ], batch size: 282, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:13:22,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=698904.0, ans=0.125 2023-06-20 11:13:23,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=698904.0, ans=0.125 2023-06-20 11:13:48,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=698964.0, ans=0.0 2023-06-20 11:14:08,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-20 11:14:13,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=699024.0, ans=0.2 2023-06-20 11:14:24,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=699084.0, ans=0.0 2023-06-20 11:14:42,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=699144.0, ans=0.125 2023-06-20 11:15:02,925 INFO [train.py:996] (1/4) Epoch 4, batch 25050, loss[loss=0.2395, simple_loss=0.2983, pruned_loss=0.0904, over 21807.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3274, pruned_loss=0.1021, over 4277092.58 frames. ], batch size: 107, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:15:18,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.95 vs. limit=15.0 2023-06-20 11:15:40,002 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.775e+02 3.166e+02 3.769e+02 6.146e+02, threshold=6.333e+02, percent-clipped=0.0 2023-06-20 11:15:47,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.06 vs. limit=22.5 2023-06-20 11:15:56,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=699324.0, ans=0.2 2023-06-20 11:16:05,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-20 11:16:30,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=22.5 2023-06-20 11:16:36,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=699444.0, ans=0.125 2023-06-20 11:16:47,487 INFO [train.py:996] (1/4) Epoch 4, batch 25100, loss[loss=0.2243, simple_loss=0.3091, pruned_loss=0.06972, over 21408.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3206, pruned_loss=0.1002, over 4272005.44 frames. ], batch size: 211, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:17:22,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=699564.0, ans=0.2 2023-06-20 11:17:41,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=699624.0, ans=0.1 2023-06-20 11:18:12,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.26 vs. limit=22.5 2023-06-20 11:18:29,719 INFO [train.py:996] (1/4) Epoch 4, batch 25150, loss[loss=0.2392, simple_loss=0.3213, pruned_loss=0.07849, over 21867.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3237, pruned_loss=0.09784, over 4263901.73 frames. ], batch size: 333, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:18:31,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=699804.0, ans=0.125 2023-06-20 11:18:41,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-20 11:19:00,274 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.678e+02 3.105e+02 3.619e+02 6.270e+02, threshold=6.210e+02, percent-clipped=0.0 2023-06-20 11:19:07,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=699864.0, ans=0.125 2023-06-20 11:19:45,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=699984.0, ans=0.125 2023-06-20 11:19:49,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-20 11:19:55,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=700044.0, ans=0.125 2023-06-20 11:19:56,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=700044.0, ans=0.025 2023-06-20 11:20:06,409 INFO [train.py:996] (1/4) Epoch 4, batch 25200, loss[loss=0.2233, simple_loss=0.2777, pruned_loss=0.0845, over 20372.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.322, pruned_loss=0.09454, over 4261496.69 frames. ], batch size: 703, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:20:38,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=700164.0, ans=0.125 2023-06-20 11:20:57,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=700224.0, ans=0.05 2023-06-20 11:20:57,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=700224.0, ans=10.0 2023-06-20 11:21:13,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=700284.0, ans=15.0 2023-06-20 11:21:16,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=700284.0, ans=0.125 2023-06-20 11:21:34,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-20 11:21:43,472 INFO [train.py:996] (1/4) Epoch 4, batch 25250, loss[loss=0.2207, simple_loss=0.2826, pruned_loss=0.07941, over 21523.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3205, pruned_loss=0.09214, over 4262890.38 frames. ], batch size: 230, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:22:20,661 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.744e+02 3.112e+02 3.810e+02 6.947e+02, threshold=6.224e+02, percent-clipped=3.0 2023-06-20 11:22:49,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=700524.0, ans=0.0 2023-06-20 11:22:50,226 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-20 11:23:03,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=700584.0, ans=0.125 2023-06-20 11:23:32,828 INFO [train.py:996] (1/4) Epoch 4, batch 25300, loss[loss=0.2395, simple_loss=0.3169, pruned_loss=0.08104, over 21711.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3197, pruned_loss=0.0928, over 4264305.61 frames. ], batch size: 298, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:24:03,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=700764.0, ans=0.5 2023-06-20 11:24:12,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=700764.0, ans=0.2 2023-06-20 11:24:12,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-20 11:24:17,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=700764.0, ans=0.2 2023-06-20 11:24:26,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=700824.0, ans=15.0 2023-06-20 11:24:39,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=700884.0, ans=0.125 2023-06-20 11:25:26,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=701004.0, ans=0.1 2023-06-20 11:25:27,710 INFO [train.py:996] (1/4) Epoch 4, batch 25350, loss[loss=0.1988, simple_loss=0.2697, pruned_loss=0.06393, over 21516.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3194, pruned_loss=0.0916, over 4265125.99 frames. ], batch size: 195, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:25:33,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-20 11:25:55,986 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.721e+02 3.100e+02 3.889e+02 7.002e+02, threshold=6.200e+02, percent-clipped=1.0 2023-06-20 11:26:13,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=701124.0, ans=0.0 2023-06-20 11:26:22,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.04 vs. limit=15.0 2023-06-20 11:26:30,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=701184.0, ans=0.0 2023-06-20 11:26:32,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-20 11:27:04,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=701304.0, ans=0.0 2023-06-20 11:27:05,566 INFO [train.py:996] (1/4) Epoch 4, batch 25400, loss[loss=0.282, simple_loss=0.3354, pruned_loss=0.1143, over 21568.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3149, pruned_loss=0.09046, over 4267618.52 frames. ], batch size: 441, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:27:39,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=701364.0, ans=0.125 2023-06-20 11:28:04,995 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:28:40,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-20 11:28:42,942 INFO [train.py:996] (1/4) Epoch 4, batch 25450, loss[loss=0.2302, simple_loss=0.3194, pruned_loss=0.07055, over 21659.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3168, pruned_loss=0.09197, over 4273565.11 frames. ], batch size: 247, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:29:17,647 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.926e+02 3.539e+02 4.386e+02 7.693e+02, threshold=7.077e+02, percent-clipped=6.0 2023-06-20 11:29:26,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=701724.0, ans=0.125 2023-06-20 11:29:46,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=701784.0, ans=0.1 2023-06-20 11:30:27,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=701844.0, ans=0.125 2023-06-20 11:30:33,803 INFO [train.py:996] (1/4) Epoch 4, batch 25500, loss[loss=0.2582, simple_loss=0.3476, pruned_loss=0.08441, over 21732.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3182, pruned_loss=0.0901, over 4270572.14 frames. ], batch size: 298, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:31:20,369 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-20 11:31:22,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=702024.0, ans=0.2 2023-06-20 11:31:44,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=702144.0, ans=0.0 2023-06-20 11:31:51,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=702144.0, ans=0.0 2023-06-20 11:32:16,665 INFO [train.py:996] (1/4) Epoch 4, batch 25550, loss[loss=0.2688, simple_loss=0.3637, pruned_loss=0.08689, over 21879.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3228, pruned_loss=0.08961, over 4254041.79 frames. ], batch size: 371, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:32:17,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=702204.0, ans=0.0 2023-06-20 11:32:33,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=702204.0, ans=0.125 2023-06-20 11:32:45,271 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.927e+02 3.531e+02 4.668e+02 7.861e+02, threshold=7.061e+02, percent-clipped=1.0 2023-06-20 11:33:20,500 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=12.0 2023-06-20 11:33:50,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-20 11:34:05,301 INFO [train.py:996] (1/4) Epoch 4, batch 25600, loss[loss=0.3146, simple_loss=0.3767, pruned_loss=0.1263, over 21253.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3265, pruned_loss=0.09016, over 4258289.31 frames. ], batch size: 143, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:34:22,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=22.5 2023-06-20 11:34:25,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=702564.0, ans=0.2 2023-06-20 11:34:31,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=702564.0, ans=0.1 2023-06-20 11:35:47,372 INFO [train.py:996] (1/4) Epoch 4, batch 25650, loss[loss=0.3169, simple_loss=0.4401, pruned_loss=0.09687, over 19887.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3288, pruned_loss=0.09384, over 4252310.64 frames. ], batch size: 702, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:36:10,585 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.974e+02 3.726e+02 4.769e+02 9.123e+02, threshold=7.452e+02, percent-clipped=4.0 2023-06-20 11:36:35,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=702984.0, ans=0.0 2023-06-20 11:36:48,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-20 11:37:31,120 INFO [train.py:996] (1/4) Epoch 4, batch 25700, loss[loss=0.3138, simple_loss=0.3545, pruned_loss=0.1365, over 21727.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3269, pruned_loss=0.0958, over 4259023.11 frames. ], batch size: 441, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:37:33,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-20 11:38:01,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=703164.0, ans=0.04949747468305833 2023-06-20 11:38:09,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=703224.0, ans=0.2 2023-06-20 11:38:35,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=703284.0, ans=0.0 2023-06-20 11:38:49,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=703344.0, ans=0.125 2023-06-20 11:39:12,199 INFO [train.py:996] (1/4) Epoch 4, batch 25750, loss[loss=0.3324, simple_loss=0.4046, pruned_loss=0.1301, over 21683.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3312, pruned_loss=0.09863, over 4265995.61 frames. ], batch size: 351, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:39:35,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.258e+02 3.853e+02 4.721e+02 7.384e+02, threshold=7.705e+02, percent-clipped=0.0 2023-06-20 11:39:48,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-20 11:40:17,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=703584.0, ans=6.0 2023-06-20 11:40:25,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=703584.0, ans=0.125 2023-06-20 11:40:30,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=703584.0, ans=0.125 2023-06-20 11:40:42,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=703644.0, ans=0.0 2023-06-20 11:40:57,297 INFO [train.py:996] (1/4) Epoch 4, batch 25800, loss[loss=0.364, simple_loss=0.4137, pruned_loss=0.1571, over 21546.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3417, pruned_loss=0.1031, over 4267119.06 frames. ], batch size: 414, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:42:39,643 INFO [train.py:996] (1/4) Epoch 4, batch 25850, loss[loss=0.3096, simple_loss=0.3671, pruned_loss=0.126, over 21635.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3435, pruned_loss=0.102, over 4271133.29 frames. ], batch size: 471, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:43:23,439 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.030e+02 3.694e+02 4.405e+02 6.989e+02, threshold=7.387e+02, percent-clipped=0.0 2023-06-20 11:43:42,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=704124.0, ans=0.0 2023-06-20 11:44:02,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=704184.0, ans=0.1 2023-06-20 11:44:16,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-20 11:44:28,574 INFO [train.py:996] (1/4) Epoch 4, batch 25900, loss[loss=0.3527, simple_loss=0.4282, pruned_loss=0.1386, over 21886.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3454, pruned_loss=0.1025, over 4280174.34 frames. ], batch size: 372, lr: 7.60e-03, grad_scale: 16.0 2023-06-20 11:44:59,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=704364.0, ans=0.125 2023-06-20 11:45:12,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=8.0 2023-06-20 11:45:25,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-20 11:45:46,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-20 11:45:53,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-20 11:46:02,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=704544.0, ans=0.125 2023-06-20 11:46:17,925 INFO [train.py:996] (1/4) Epoch 4, batch 25950, loss[loss=0.2744, simple_loss=0.3464, pruned_loss=0.1012, over 21466.00 frames. ], tot_loss[loss=0.281, simple_loss=0.3513, pruned_loss=0.1053, over 4277335.15 frames. ], batch size: 211, lr: 7.60e-03, grad_scale: 16.0 2023-06-20 11:46:39,487 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.17 vs. limit=5.0 2023-06-20 11:46:52,818 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.258e+02 4.006e+02 4.613e+02 7.769e+02, threshold=8.011e+02, percent-clipped=1.0 2023-06-20 11:48:11,729 INFO [train.py:996] (1/4) Epoch 4, batch 26000, loss[loss=0.2887, simple_loss=0.3544, pruned_loss=0.1115, over 21348.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3521, pruned_loss=0.1042, over 4270489.87 frames. ], batch size: 159, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:48:16,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=704904.0, ans=0.125 2023-06-20 11:49:20,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=705084.0, ans=0.125 2023-06-20 11:49:53,023 INFO [train.py:996] (1/4) Epoch 4, batch 26050, loss[loss=0.2623, simple_loss=0.3259, pruned_loss=0.09939, over 21801.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.352, pruned_loss=0.1049, over 4274235.47 frames. ], batch size: 112, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:49:59,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=705204.0, ans=0.125 2023-06-20 11:49:59,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=705204.0, ans=0.2 2023-06-20 11:50:01,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=705204.0, ans=0.125 2023-06-20 11:50:07,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=705264.0, ans=0.0 2023-06-20 11:50:13,625 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-20 11:50:13,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.40 vs. limit=22.5 2023-06-20 11:50:17,432 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.846e+02 3.300e+02 3.976e+02 7.984e+02, threshold=6.600e+02, percent-clipped=0.0 2023-06-20 11:50:32,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=705324.0, ans=0.125 2023-06-20 11:50:40,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=705384.0, ans=0.2 2023-06-20 11:50:45,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=705384.0, ans=0.125 2023-06-20 11:51:26,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=705444.0, ans=0.0 2023-06-20 11:51:35,532 INFO [train.py:996] (1/4) Epoch 4, batch 26100, loss[loss=0.2467, simple_loss=0.3026, pruned_loss=0.0954, over 21627.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3466, pruned_loss=0.1045, over 4281324.25 frames. ], batch size: 548, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:51:57,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=705564.0, ans=0.125 2023-06-20 11:52:06,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=705564.0, ans=0.125 2023-06-20 11:52:52,290 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-20 11:53:17,457 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-20 11:53:19,670 INFO [train.py:996] (1/4) Epoch 4, batch 26150, loss[loss=0.2747, simple_loss=0.3367, pruned_loss=0.1063, over 21805.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3431, pruned_loss=0.1044, over 4279774.36 frames. ], batch size: 247, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:53:27,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-20 11:53:45,107 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.006e+02 3.444e+02 4.248e+02 6.303e+02, threshold=6.888e+02, percent-clipped=0.0 2023-06-20 11:53:52,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=705924.0, ans=0.0 2023-06-20 11:53:55,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=705924.0, ans=0.125 2023-06-20 11:54:59,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=706044.0, ans=0.125 2023-06-20 11:55:05,064 INFO [train.py:996] (1/4) Epoch 4, batch 26200, loss[loss=0.2653, simple_loss=0.3692, pruned_loss=0.08066, over 21765.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3432, pruned_loss=0.1025, over 4277936.49 frames. ], batch size: 351, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:55:21,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=706164.0, ans=0.125 2023-06-20 11:55:48,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-20 11:56:02,095 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-20 11:56:32,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=706344.0, ans=0.125 2023-06-20 11:56:45,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=706404.0, ans=0.125 2023-06-20 11:56:46,811 INFO [train.py:996] (1/4) Epoch 4, batch 26250, loss[loss=0.2721, simple_loss=0.3383, pruned_loss=0.103, over 21900.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3487, pruned_loss=0.1018, over 4284317.14 frames. ], batch size: 351, lr: 7.59e-03, grad_scale: 16.0 2023-06-20 11:57:12,713 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.833e+02 3.243e+02 4.065e+02 7.438e+02, threshold=6.486e+02, percent-clipped=1.0 2023-06-20 11:57:39,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=706524.0, ans=0.125 2023-06-20 11:58:28,944 INFO [train.py:996] (1/4) Epoch 4, batch 26300, loss[loss=0.2589, simple_loss=0.3176, pruned_loss=0.1001, over 21924.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3451, pruned_loss=0.1028, over 4292210.96 frames. ], batch size: 351, lr: 7.59e-03, grad_scale: 16.0 2023-06-20 11:58:48,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.19 vs. limit=6.0 2023-06-20 11:59:03,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=706764.0, ans=0.2 2023-06-20 11:59:55,041 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.81 vs. limit=10.0 2023-06-20 12:00:14,446 INFO [train.py:996] (1/4) Epoch 4, batch 26350, loss[loss=0.2782, simple_loss=0.34, pruned_loss=0.1082, over 20676.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3433, pruned_loss=0.1038, over 4294528.82 frames. ], batch size: 607, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:00:27,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=707004.0, ans=0.1 2023-06-20 12:00:50,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.098e+02 3.455e+02 4.050e+02 6.767e+02, threshold=6.909e+02, percent-clipped=5.0 2023-06-20 12:00:58,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=707124.0, ans=0.125 2023-06-20 12:01:00,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=707124.0, ans=0.0 2023-06-20 12:01:02,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=707124.0, ans=0.125 2023-06-20 12:01:27,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=707184.0, ans=0.125 2023-06-20 12:01:57,022 INFO [train.py:996] (1/4) Epoch 4, batch 26400, loss[loss=0.2292, simple_loss=0.2811, pruned_loss=0.08866, over 21603.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3368, pruned_loss=0.1032, over 4291220.20 frames. ], batch size: 231, lr: 7.58e-03, grad_scale: 32.0 2023-06-20 12:02:13,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=707304.0, ans=0.125 2023-06-20 12:02:20,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=707364.0, ans=0.125 2023-06-20 12:02:38,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=707364.0, ans=0.125 2023-06-20 12:03:22,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=707484.0, ans=0.0 2023-06-20 12:03:49,514 INFO [train.py:996] (1/4) Epoch 4, batch 26450, loss[loss=0.2642, simple_loss=0.3286, pruned_loss=0.09988, over 21269.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3354, pruned_loss=0.102, over 4279340.41 frames. ], batch size: 176, lr: 7.58e-03, grad_scale: 32.0 2023-06-20 12:04:21,772 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.976e+02 3.665e+02 4.778e+02 9.045e+02, threshold=7.330e+02, percent-clipped=3.0 2023-06-20 12:04:59,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=707784.0, ans=0.0 2023-06-20 12:05:35,238 INFO [train.py:996] (1/4) Epoch 4, batch 26500, loss[loss=0.2441, simple_loss=0.3201, pruned_loss=0.08404, over 21833.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3372, pruned_loss=0.1009, over 4272951.71 frames. ], batch size: 316, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:05:35,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=707904.0, ans=0.2 2023-06-20 12:05:56,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=707964.0, ans=0.0 2023-06-20 12:06:04,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=707964.0, ans=0.125 2023-06-20 12:07:07,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=708144.0, ans=0.125 2023-06-20 12:07:31,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=708204.0, ans=0.0 2023-06-20 12:07:32,507 INFO [train.py:996] (1/4) Epoch 4, batch 26550, loss[loss=0.2323, simple_loss=0.3442, pruned_loss=0.06027, over 20728.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3338, pruned_loss=0.09715, over 4265049.71 frames. ], batch size: 608, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:07:32,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=708204.0, ans=0.125 2023-06-20 12:07:47,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=708264.0, ans=0.2 2023-06-20 12:08:00,819 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.832e+02 3.306e+02 3.943e+02 6.835e+02, threshold=6.613e+02, percent-clipped=0.0 2023-06-20 12:08:45,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=708384.0, ans=10.0 2023-06-20 12:09:08,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=708444.0, ans=0.04949747468305833 2023-06-20 12:09:16,644 INFO [train.py:996] (1/4) Epoch 4, batch 26600, loss[loss=0.2172, simple_loss=0.2822, pruned_loss=0.07607, over 21499.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3346, pruned_loss=0.09469, over 4254106.34 frames. ], batch size: 230, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:09:27,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=708504.0, ans=0.125 2023-06-20 12:09:36,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-20 12:09:37,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=708564.0, ans=0.1 2023-06-20 12:10:41,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=708744.0, ans=0.125 2023-06-20 12:10:54,693 INFO [train.py:996] (1/4) Epoch 4, batch 26650, loss[loss=0.1861, simple_loss=0.2748, pruned_loss=0.04868, over 21644.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3281, pruned_loss=0.09378, over 4253868.57 frames. ], batch size: 391, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:11:11,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708864.0, ans=0.1 2023-06-20 12:11:27,249 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.914e+02 3.392e+02 4.054e+02 7.182e+02, threshold=6.783e+02, percent-clipped=2.0 2023-06-20 12:11:34,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-20 12:11:40,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=708924.0, ans=0.2 2023-06-20 12:12:00,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=708984.0, ans=0.1 2023-06-20 12:12:26,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-20 12:12:29,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=709044.0, ans=0.2 2023-06-20 12:12:32,344 INFO [train.py:996] (1/4) Epoch 4, batch 26700, loss[loss=0.2645, simple_loss=0.3235, pruned_loss=0.1028, over 21528.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.32, pruned_loss=0.08968, over 4259379.76 frames. ], batch size: 131, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:12:44,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=12.0 2023-06-20 12:13:11,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=709224.0, ans=0.0 2023-06-20 12:13:16,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=709224.0, ans=0.09899494936611666 2023-06-20 12:13:33,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=709284.0, ans=0.2 2023-06-20 12:14:11,928 INFO [train.py:996] (1/4) Epoch 4, batch 26750, loss[loss=0.235, simple_loss=0.3128, pruned_loss=0.07858, over 21849.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3188, pruned_loss=0.08795, over 4271744.63 frames. ], batch size: 247, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:14:50,944 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.700e+02 3.315e+02 4.013e+02 5.519e+02, threshold=6.631e+02, percent-clipped=0.0 2023-06-20 12:14:53,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=709464.0, ans=0.0 2023-06-20 12:15:04,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=709524.0, ans=0.0 2023-06-20 12:15:56,459 INFO [train.py:996] (1/4) Epoch 4, batch 26800, loss[loss=0.337, simple_loss=0.4009, pruned_loss=0.1365, over 21859.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3273, pruned_loss=0.09291, over 4272580.37 frames. ], batch size: 118, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:16:05,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=709704.0, ans=0.125 2023-06-20 12:17:18,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=709884.0, ans=0.125 2023-06-20 12:17:29,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=709944.0, ans=0.1 2023-06-20 12:17:32,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=709944.0, ans=0.0 2023-06-20 12:17:43,540 INFO [train.py:996] (1/4) Epoch 4, batch 26850, loss[loss=0.2343, simple_loss=0.2936, pruned_loss=0.0875, over 21377.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3299, pruned_loss=0.09612, over 4268087.08 frames. ], batch size: 131, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:17:43,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=710004.0, ans=0.125 2023-06-20 12:17:58,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=710064.0, ans=0.0 2023-06-20 12:18:21,730 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 2.953e+02 3.297e+02 3.985e+02 6.841e+02, threshold=6.593e+02, percent-clipped=1.0 2023-06-20 12:18:36,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=710124.0, ans=0.0 2023-06-20 12:18:45,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=710124.0, ans=0.09899494936611666 2023-06-20 12:18:55,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=710184.0, ans=0.0 2023-06-20 12:18:56,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=710184.0, ans=0.0 2023-06-20 12:19:20,894 INFO [train.py:996] (1/4) Epoch 4, batch 26900, loss[loss=0.2172, simple_loss=0.2718, pruned_loss=0.08133, over 21410.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3212, pruned_loss=0.09601, over 4265817.78 frames. ], batch size: 212, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:19:29,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=710304.0, ans=0.125 2023-06-20 12:19:31,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=710304.0, ans=15.0 2023-06-20 12:19:44,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-20 12:21:00,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=710604.0, ans=0.0 2023-06-20 12:21:01,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-20 12:21:01,660 INFO [train.py:996] (1/4) Epoch 4, batch 26950, loss[loss=0.2573, simple_loss=0.3502, pruned_loss=0.08222, over 21731.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3223, pruned_loss=0.09605, over 4261814.86 frames. ], batch size: 332, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:21:20,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=710604.0, ans=0.125 2023-06-20 12:21:36,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=710664.0, ans=0.0 2023-06-20 12:21:38,727 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.998e+02 3.332e+02 4.075e+02 6.086e+02, threshold=6.663e+02, percent-clipped=0.0 2023-06-20 12:21:42,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=710724.0, ans=0.2 2023-06-20 12:21:54,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=710724.0, ans=0.0 2023-06-20 12:22:18,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=710784.0, ans=0.0 2023-06-20 12:22:45,190 INFO [train.py:996] (1/4) Epoch 4, batch 27000, loss[loss=0.2202, simple_loss=0.2943, pruned_loss=0.0731, over 21177.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3208, pruned_loss=0.0932, over 4258959.68 frames. ], batch size: 176, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:22:45,190 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 12:23:07,055 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2473, simple_loss=0.3466, pruned_loss=0.07399, over 1796401.00 frames. 2023-06-20 12:23:07,055 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 12:23:16,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=710904.0, ans=0.125 2023-06-20 12:23:26,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=710904.0, ans=0.05 2023-06-20 12:23:36,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=710964.0, ans=0.0 2023-06-20 12:23:55,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=711024.0, ans=0.125 2023-06-20 12:24:27,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=711144.0, ans=0.125 2023-06-20 12:24:50,422 INFO [train.py:996] (1/4) Epoch 4, batch 27050, loss[loss=0.2275, simple_loss=0.3253, pruned_loss=0.06485, over 21707.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3222, pruned_loss=0.08981, over 4265148.34 frames. ], batch size: 298, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:25:23,362 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.528e+02 2.843e+02 3.432e+02 6.081e+02, threshold=5.686e+02, percent-clipped=0.0 2023-06-20 12:25:28,382 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-20 12:26:00,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=711384.0, ans=0.1 2023-06-20 12:26:12,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=711444.0, ans=0.1 2023-06-20 12:26:28,419 INFO [train.py:996] (1/4) Epoch 4, batch 27100, loss[loss=0.2343, simple_loss=0.3073, pruned_loss=0.0807, over 21491.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3242, pruned_loss=0.09139, over 4276036.10 frames. ], batch size: 211, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:26:38,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-20 12:26:49,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=711564.0, ans=0.125 2023-06-20 12:27:17,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=711624.0, ans=0.125 2023-06-20 12:27:26,764 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:27:38,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=711684.0, ans=0.0 2023-06-20 12:28:07,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=711744.0, ans=0.2 2023-06-20 12:28:19,228 INFO [train.py:996] (1/4) Epoch 4, batch 27150, loss[loss=0.267, simple_loss=0.3578, pruned_loss=0.0881, over 21794.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.335, pruned_loss=0.09462, over 4271186.29 frames. ], batch size: 282, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:28:23,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-20 12:28:39,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=711864.0, ans=0.5 2023-06-20 12:28:43,753 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=22.5 2023-06-20 12:28:47,386 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.195e+02 3.755e+02 4.562e+02 7.359e+02, threshold=7.509e+02, percent-clipped=7.0 2023-06-20 12:29:57,464 INFO [train.py:996] (1/4) Epoch 4, batch 27200, loss[loss=0.2867, simple_loss=0.3528, pruned_loss=0.1103, over 21612.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3449, pruned_loss=0.09881, over 4273881.45 frames. ], batch size: 230, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:30:01,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=712104.0, ans=0.125 2023-06-20 12:31:42,695 INFO [train.py:996] (1/4) Epoch 4, batch 27250, loss[loss=0.2779, simple_loss=0.3396, pruned_loss=0.1081, over 21791.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3499, pruned_loss=0.1041, over 4276489.15 frames. ], batch size: 247, lr: 7.56e-03, grad_scale: 16.0 2023-06-20 12:32:18,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.455e+02 4.039e+02 4.883e+02 8.665e+02, threshold=8.078e+02, percent-clipped=1.0 2023-06-20 12:32:54,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=712584.0, ans=0.125 2023-06-20 12:32:59,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=712584.0, ans=0.1 2023-06-20 12:33:33,309 INFO [train.py:996] (1/4) Epoch 4, batch 27300, loss[loss=0.3048, simple_loss=0.3848, pruned_loss=0.1125, over 21360.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3516, pruned_loss=0.1048, over 4283514.94 frames. ], batch size: 549, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:33:37,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=712704.0, ans=0.0 2023-06-20 12:35:08,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=712944.0, ans=0.125 2023-06-20 12:35:15,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=713004.0, ans=0.025 2023-06-20 12:35:16,736 INFO [train.py:996] (1/4) Epoch 4, batch 27350, loss[loss=0.275, simple_loss=0.3402, pruned_loss=0.1049, over 21454.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3541, pruned_loss=0.1059, over 4274798.77 frames. ], batch size: 211, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:36:00,959 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.787e+02 3.114e+02 3.820e+02 5.936e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-20 12:36:08,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=713124.0, ans=0.2 2023-06-20 12:36:48,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-20 12:36:58,512 INFO [train.py:996] (1/4) Epoch 4, batch 27400, loss[loss=0.2276, simple_loss=0.2782, pruned_loss=0.08849, over 21443.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3491, pruned_loss=0.1051, over 4282109.35 frames. ], batch size: 212, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:37:02,237 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:38:41,420 INFO [train.py:996] (1/4) Epoch 4, batch 27450, loss[loss=0.2887, simple_loss=0.3592, pruned_loss=0.1091, over 21286.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3416, pruned_loss=0.1022, over 4278716.60 frames. ], batch size: 176, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:39:26,430 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.637e+02 2.964e+02 3.334e+02 5.036e+02, threshold=5.928e+02, percent-clipped=0.0 2023-06-20 12:39:38,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=713724.0, ans=0.0 2023-06-20 12:39:43,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=713724.0, ans=0.025 2023-06-20 12:39:43,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-20 12:39:47,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=713784.0, ans=0.125 2023-06-20 12:40:23,674 INFO [train.py:996] (1/4) Epoch 4, batch 27500, loss[loss=0.2468, simple_loss=0.3131, pruned_loss=0.09028, over 21892.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3394, pruned_loss=0.1022, over 4283115.83 frames. ], batch size: 371, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:40:56,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=713964.0, ans=15.0 2023-06-20 12:41:40,066 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2023-06-20 12:41:41,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=714084.0, ans=0.125 2023-06-20 12:42:02,872 INFO [train.py:996] (1/4) Epoch 4, batch 27550, loss[loss=0.2527, simple_loss=0.3097, pruned_loss=0.09787, over 21565.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3337, pruned_loss=0.09906, over 4290096.45 frames. ], batch size: 414, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:42:24,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-20 12:42:33,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.57 vs. limit=10.0 2023-06-20 12:42:46,485 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:42:49,320 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.202e+02 3.879e+02 5.081e+02 9.458e+02, threshold=7.759e+02, percent-clipped=14.0 2023-06-20 12:43:28,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=714444.0, ans=0.125 2023-06-20 12:43:39,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=714444.0, ans=0.125 2023-06-20 12:43:50,122 INFO [train.py:996] (1/4) Epoch 4, batch 27600, loss[loss=0.273, simple_loss=0.3208, pruned_loss=0.1126, over 22016.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3269, pruned_loss=0.09786, over 4289970.93 frames. ], batch size: 103, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:44:51,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=714684.0, ans=0.125 2023-06-20 12:45:03,001 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-20 12:45:26,202 INFO [train.py:996] (1/4) Epoch 4, batch 27650, loss[loss=0.2708, simple_loss=0.3161, pruned_loss=0.1128, over 21613.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3213, pruned_loss=0.09706, over 4281789.89 frames. ], batch size: 508, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:45:26,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=714804.0, ans=0.125 2023-06-20 12:45:45,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=714864.0, ans=0.125 2023-06-20 12:45:59,229 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:46:05,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 2.761e+02 3.105e+02 3.536e+02 5.675e+02, threshold=6.210e+02, percent-clipped=0.0 2023-06-20 12:46:25,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=714984.0, ans=0.0 2023-06-20 12:47:02,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=715104.0, ans=0.0 2023-06-20 12:47:03,665 INFO [train.py:996] (1/4) Epoch 4, batch 27700, loss[loss=0.2506, simple_loss=0.3276, pruned_loss=0.08677, over 21662.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3207, pruned_loss=0.09488, over 4289759.52 frames. ], batch size: 389, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:47:25,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=715104.0, ans=0.0 2023-06-20 12:47:37,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=715164.0, ans=0.125 2023-06-20 12:48:01,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=715224.0, ans=0.125 2023-06-20 12:48:51,165 INFO [train.py:996] (1/4) Epoch 4, batch 27750, loss[loss=0.2508, simple_loss=0.3248, pruned_loss=0.08837, over 21268.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3229, pruned_loss=0.09315, over 4283954.00 frames. ], batch size: 176, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:48:56,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=715404.0, ans=0.0 2023-06-20 12:49:33,991 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.026e+02 3.500e+02 4.221e+02 6.656e+02, threshold=7.000e+02, percent-clipped=2.0 2023-06-20 12:49:59,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-20 12:50:08,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=715584.0, ans=0.125 2023-06-20 12:50:23,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=715644.0, ans=0.2 2023-06-20 12:50:28,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-20 12:50:29,327 INFO [train.py:996] (1/4) Epoch 4, batch 27800, loss[loss=0.2555, simple_loss=0.3133, pruned_loss=0.09881, over 21629.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3223, pruned_loss=0.09396, over 4282423.30 frames. ], batch size: 195, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:51:45,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=715884.0, ans=0.125 2023-06-20 12:52:11,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=715944.0, ans=0.1 2023-06-20 12:52:11,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=715944.0, ans=0.0 2023-06-20 12:52:17,415 INFO [train.py:996] (1/4) Epoch 4, batch 27850, loss[loss=0.2697, simple_loss=0.3328, pruned_loss=0.1032, over 21102.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3237, pruned_loss=0.09635, over 4288643.48 frames. ], batch size: 607, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:52:38,522 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:52:38,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=716004.0, ans=0.1 2023-06-20 12:52:40,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=716004.0, ans=0.2 2023-06-20 12:53:00,721 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.961e+02 3.530e+02 4.217e+02 1.068e+03, threshold=7.060e+02, percent-clipped=1.0 2023-06-20 12:53:06,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=716124.0, ans=0.025 2023-06-20 12:54:13,187 INFO [train.py:996] (1/4) Epoch 4, batch 27900, loss[loss=0.2915, simple_loss=0.3759, pruned_loss=0.1035, over 21612.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3351, pruned_loss=0.0985, over 4292551.39 frames. ], batch size: 441, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:54:56,704 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:55:56,702 INFO [train.py:996] (1/4) Epoch 4, batch 27950, loss[loss=0.2292, simple_loss=0.3061, pruned_loss=0.07609, over 21295.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3331, pruned_loss=0.09432, over 4293950.46 frames. ], batch size: 176, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 12:56:32,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=716664.0, ans=0.0 2023-06-20 12:56:33,392 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.859e+02 3.547e+02 4.362e+02 7.820e+02, threshold=7.095e+02, percent-clipped=2.0 2023-06-20 12:57:21,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=716844.0, ans=0.0 2023-06-20 12:57:23,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=716844.0, ans=0.125 2023-06-20 12:57:34,405 INFO [train.py:996] (1/4) Epoch 4, batch 28000, loss[loss=0.2192, simple_loss=0.2836, pruned_loss=0.07741, over 21668.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3297, pruned_loss=0.0912, over 4295899.67 frames. ], batch size: 230, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 12:57:36,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=716904.0, ans=0.125 2023-06-20 12:58:16,441 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-20 12:58:49,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=717084.0, ans=0.125 2023-06-20 12:58:52,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=717084.0, ans=0.125 2023-06-20 12:59:06,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=717144.0, ans=0.125 2023-06-20 12:59:16,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=717204.0, ans=0.125 2023-06-20 12:59:17,598 INFO [train.py:996] (1/4) Epoch 4, batch 28050, loss[loss=0.2218, simple_loss=0.2902, pruned_loss=0.07671, over 21762.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.328, pruned_loss=0.09254, over 4294957.13 frames. ], batch size: 298, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 12:59:33,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=717204.0, ans=0.09899494936611666 2023-06-20 12:59:54,017 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.872e+02 3.334e+02 4.118e+02 8.421e+02, threshold=6.667e+02, percent-clipped=2.0 2023-06-20 12:59:59,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.52 vs. limit=15.0 2023-06-20 13:00:34,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=717384.0, ans=0.125 2023-06-20 13:00:37,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=717384.0, ans=0.125 2023-06-20 13:00:48,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=717444.0, ans=0.125 2023-06-20 13:01:00,656 INFO [train.py:996] (1/4) Epoch 4, batch 28100, loss[loss=0.2435, simple_loss=0.3004, pruned_loss=0.09326, over 21603.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3247, pruned_loss=0.09247, over 4292254.21 frames. ], batch size: 415, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 13:01:27,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-20 13:01:36,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=717624.0, ans=0.125 2023-06-20 13:02:41,104 INFO [train.py:996] (1/4) Epoch 4, batch 28150, loss[loss=0.2793, simple_loss=0.3258, pruned_loss=0.1164, over 22002.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3194, pruned_loss=0.09283, over 4286083.92 frames. ], batch size: 103, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 13:02:43,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=717804.0, ans=0.0 2023-06-20 13:02:56,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=717804.0, ans=0.035 2023-06-20 13:03:00,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=717864.0, ans=0.125 2023-06-20 13:03:02,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=717864.0, ans=0.125 2023-06-20 13:03:17,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=717864.0, ans=0.125 2023-06-20 13:03:23,524 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.192e+02 4.033e+02 4.957e+02 1.192e+03, threshold=8.065e+02, percent-clipped=8.0 2023-06-20 13:03:35,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=717924.0, ans=0.1 2023-06-20 13:03:58,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=717984.0, ans=0.125 2023-06-20 13:04:00,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=717984.0, ans=0.2 2023-06-20 13:04:08,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-20 13:04:09,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=718044.0, ans=0.0 2023-06-20 13:04:09,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718044.0, ans=0.1 2023-06-20 13:04:27,741 INFO [train.py:996] (1/4) Epoch 4, batch 28200, loss[loss=0.277, simple_loss=0.3384, pruned_loss=0.1078, over 20670.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.318, pruned_loss=0.09464, over 4281957.65 frames. ], batch size: 607, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 13:04:36,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=718104.0, ans=0.125 2023-06-20 13:06:10,225 INFO [train.py:996] (1/4) Epoch 4, batch 28250, loss[loss=0.2739, simple_loss=0.3233, pruned_loss=0.1122, over 21654.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.32, pruned_loss=0.09695, over 4278688.99 frames. ], batch size: 298, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:06:22,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=718404.0, ans=0.0 2023-06-20 13:06:38,860 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.85 vs. limit=6.0 2023-06-20 13:06:53,552 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 3.131e+02 3.683e+02 4.386e+02 7.452e+02, threshold=7.367e+02, percent-clipped=0.0 2023-06-20 13:07:17,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=718584.0, ans=0.035 2023-06-20 13:07:19,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=718584.0, ans=0.125 2023-06-20 13:07:32,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=718644.0, ans=0.1 2023-06-20 13:07:54,528 INFO [train.py:996] (1/4) Epoch 4, batch 28300, loss[loss=0.2176, simple_loss=0.2999, pruned_loss=0.06768, over 21351.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3189, pruned_loss=0.09486, over 4270155.52 frames. ], batch size: 211, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:07:56,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=718704.0, ans=0.1 2023-06-20 13:08:22,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=718764.0, ans=0.2 2023-06-20 13:08:41,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=718824.0, ans=0.125 2023-06-20 13:08:41,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-20 13:08:56,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=718824.0, ans=0.125 2023-06-20 13:09:34,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=718944.0, ans=15.0 2023-06-20 13:09:43,122 INFO [train.py:996] (1/4) Epoch 4, batch 28350, loss[loss=0.2251, simple_loss=0.2964, pruned_loss=0.07694, over 21795.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.315, pruned_loss=0.08822, over 4266662.22 frames. ], batch size: 351, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:10:00,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=719004.0, ans=0.2 2023-06-20 13:10:03,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=719064.0, ans=0.125 2023-06-20 13:10:26,136 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.689e+02 3.125e+02 3.923e+02 6.563e+02, threshold=6.250e+02, percent-clipped=0.0 2023-06-20 13:10:40,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=719184.0, ans=0.1 2023-06-20 13:11:22,822 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=12.0 2023-06-20 13:11:30,282 INFO [train.py:996] (1/4) Epoch 4, batch 28400, loss[loss=0.237, simple_loss=0.2977, pruned_loss=0.08816, over 16174.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3116, pruned_loss=0.08858, over 4267780.11 frames. ], batch size: 64, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:11:32,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=719304.0, ans=0.125 2023-06-20 13:11:33,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-20 13:11:49,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=719304.0, ans=0.125 2023-06-20 13:11:53,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.90 vs. limit=15.0 2023-06-20 13:12:38,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=719484.0, ans=0.0 2023-06-20 13:12:45,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=719544.0, ans=0.125 2023-06-20 13:13:07,848 INFO [train.py:996] (1/4) Epoch 4, batch 28450, loss[loss=0.2656, simple_loss=0.3301, pruned_loss=0.1005, over 21707.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3183, pruned_loss=0.09284, over 4252863.57 frames. ], batch size: 389, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:13:20,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=719604.0, ans=0.0 2023-06-20 13:13:39,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=719664.0, ans=0.2 2023-06-20 13:13:50,962 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.987e+02 3.410e+02 4.090e+02 6.526e+02, threshold=6.821e+02, percent-clipped=2.0 2023-06-20 13:14:13,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=719784.0, ans=0.09899494936611666 2023-06-20 13:14:29,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2023-06-20 13:14:30,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=719844.0, ans=0.1 2023-06-20 13:14:50,124 INFO [train.py:996] (1/4) Epoch 4, batch 28500, loss[loss=0.2636, simple_loss=0.3286, pruned_loss=0.09931, over 20852.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3207, pruned_loss=0.09555, over 4264417.03 frames. ], batch size: 607, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:14:50,590 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:15:12,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=719964.0, ans=0.2 2023-06-20 13:16:07,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=720084.0, ans=10.0 2023-06-20 13:16:41,750 INFO [train.py:996] (1/4) Epoch 4, batch 28550, loss[loss=0.3501, simple_loss=0.4216, pruned_loss=0.1393, over 21504.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3293, pruned_loss=0.09852, over 4267701.88 frames. ], batch size: 471, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:17:05,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=720264.0, ans=0.2 2023-06-20 13:17:15,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=720264.0, ans=0.0 2023-06-20 13:17:15,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=720264.0, ans=0.0 2023-06-20 13:17:20,269 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.197e+02 3.692e+02 4.478e+02 6.914e+02, threshold=7.384e+02, percent-clipped=1.0 2023-06-20 13:18:25,238 INFO [train.py:996] (1/4) Epoch 4, batch 28600, loss[loss=0.2416, simple_loss=0.309, pruned_loss=0.08706, over 20705.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3359, pruned_loss=0.1011, over 4272383.82 frames. ], batch size: 607, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:18:54,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=720564.0, ans=0.0 2023-06-20 13:19:33,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=720684.0, ans=0.0 2023-06-20 13:19:42,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=720684.0, ans=0.125 2023-06-20 13:19:54,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=720744.0, ans=0.125 2023-06-20 13:20:07,375 INFO [train.py:996] (1/4) Epoch 4, batch 28650, loss[loss=0.2365, simple_loss=0.2967, pruned_loss=0.08819, over 21845.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3309, pruned_loss=0.1011, over 4276874.41 frames. ], batch size: 98, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:20:46,266 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 2.893e+02 3.240e+02 3.665e+02 6.143e+02, threshold=6.480e+02, percent-clipped=0.0 2023-06-20 13:21:41,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=721044.0, ans=0.0 2023-06-20 13:21:41,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=721044.0, ans=0.125 2023-06-20 13:21:51,585 INFO [train.py:996] (1/4) Epoch 4, batch 28700, loss[loss=0.2814, simple_loss=0.3467, pruned_loss=0.1081, over 21624.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3296, pruned_loss=0.1023, over 4278906.54 frames. ], batch size: 389, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:22:42,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=721224.0, ans=0.0 2023-06-20 13:22:46,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=721224.0, ans=0.0 2023-06-20 13:22:48,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=721224.0, ans=0.1 2023-06-20 13:22:53,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=721284.0, ans=0.125 2023-06-20 13:23:12,765 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:23:20,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=721344.0, ans=0.2 2023-06-20 13:23:32,187 INFO [train.py:996] (1/4) Epoch 4, batch 28750, loss[loss=0.226, simple_loss=0.2906, pruned_loss=0.08066, over 21146.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3301, pruned_loss=0.1029, over 4289611.14 frames. ], batch size: 607, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:23:35,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=721404.0, ans=0.0 2023-06-20 13:24:15,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.193e+02 3.841e+02 4.687e+02 9.363e+02, threshold=7.682e+02, percent-clipped=10.0 2023-06-20 13:24:43,110 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:24:52,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=721644.0, ans=0.125 2023-06-20 13:25:13,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=721704.0, ans=0.0 2023-06-20 13:25:15,117 INFO [train.py:996] (1/4) Epoch 4, batch 28800, loss[loss=0.2812, simple_loss=0.344, pruned_loss=0.1093, over 21728.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3354, pruned_loss=0.104, over 4292901.59 frames. ], batch size: 298, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:25:21,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=721704.0, ans=0.035 2023-06-20 13:25:26,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=721704.0, ans=0.0 2023-06-20 13:26:11,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=721824.0, ans=0.0 2023-06-20 13:26:55,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=721944.0, ans=0.0 2023-06-20 13:26:58,344 INFO [train.py:996] (1/4) Epoch 4, batch 28850, loss[loss=0.3259, simple_loss=0.3737, pruned_loss=0.1391, over 21742.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3379, pruned_loss=0.106, over 4294201.24 frames. ], batch size: 414, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:27:36,232 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.137e+02 3.550e+02 4.295e+02 6.856e+02, threshold=7.100e+02, percent-clipped=0.0 2023-06-20 13:27:52,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=722124.0, ans=0.2 2023-06-20 13:27:52,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=722124.0, ans=0.0 2023-06-20 13:27:55,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=722124.0, ans=0.125 2023-06-20 13:28:02,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=722184.0, ans=0.125 2023-06-20 13:28:05,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=722184.0, ans=0.125 2023-06-20 13:28:20,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=722244.0, ans=0.1 2023-06-20 13:28:42,245 INFO [train.py:996] (1/4) Epoch 4, batch 28900, loss[loss=0.2826, simple_loss=0.3489, pruned_loss=0.1081, over 21751.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3401, pruned_loss=0.1074, over 4289578.59 frames. ], batch size: 351, lr: 7.50e-03, grad_scale: 32.0 2023-06-20 13:29:59,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=722484.0, ans=0.2 2023-06-20 13:30:28,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=722544.0, ans=0.125 2023-06-20 13:30:30,820 INFO [train.py:996] (1/4) Epoch 4, batch 28950, loss[loss=0.3376, simple_loss=0.447, pruned_loss=0.1141, over 19800.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3423, pruned_loss=0.1064, over 4287795.89 frames. ], batch size: 703, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:30:49,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=722664.0, ans=0.0 2023-06-20 13:31:01,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=722664.0, ans=0.0 2023-06-20 13:31:06,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-20 13:31:11,277 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.092e+02 3.646e+02 4.374e+02 7.156e+02, threshold=7.293e+02, percent-clipped=1.0 2023-06-20 13:31:47,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=722784.0, ans=0.125 2023-06-20 13:32:10,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=722844.0, ans=0.0 2023-06-20 13:32:13,697 INFO [train.py:996] (1/4) Epoch 4, batch 29000, loss[loss=0.2708, simple_loss=0.3304, pruned_loss=0.1056, over 21435.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3467, pruned_loss=0.1054, over 4284179.09 frames. ], batch size: 211, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:32:19,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=722904.0, ans=0.0 2023-06-20 13:32:19,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=722904.0, ans=0.125 2023-06-20 13:33:56,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-20 13:33:56,572 INFO [train.py:996] (1/4) Epoch 4, batch 29050, loss[loss=0.2571, simple_loss=0.312, pruned_loss=0.1011, over 21429.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3448, pruned_loss=0.1065, over 4286329.49 frames. ], batch size: 211, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:34:23,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=723264.0, ans=0.125 2023-06-20 13:34:40,187 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.777e+02 3.114e+02 3.614e+02 7.723e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-20 13:34:46,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=723324.0, ans=0.05 2023-06-20 13:35:17,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-20 13:35:37,367 INFO [train.py:996] (1/4) Epoch 4, batch 29100, loss[loss=0.1966, simple_loss=0.258, pruned_loss=0.06755, over 21594.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3356, pruned_loss=0.1032, over 4271487.31 frames. ], batch size: 231, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:35:56,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=723504.0, ans=0.2 2023-06-20 13:36:10,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-06-20 13:36:54,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-20 13:36:59,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=723684.0, ans=0.04949747468305833 2023-06-20 13:37:10,690 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-20 13:37:19,147 INFO [train.py:996] (1/4) Epoch 4, batch 29150, loss[loss=0.2748, simple_loss=0.3615, pruned_loss=0.09404, over 21622.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3336, pruned_loss=0.1008, over 4270303.05 frames. ], batch size: 414, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:37:39,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=723864.0, ans=0.1 2023-06-20 13:38:03,272 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.983e+02 3.477e+02 4.696e+02 8.127e+02, threshold=6.954e+02, percent-clipped=11.0 2023-06-20 13:38:08,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=723924.0, ans=0.0 2023-06-20 13:38:52,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=724044.0, ans=0.125 2023-06-20 13:39:00,565 INFO [train.py:996] (1/4) Epoch 4, batch 29200, loss[loss=0.2578, simple_loss=0.317, pruned_loss=0.09927, over 21800.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3282, pruned_loss=0.09928, over 4269854.46 frames. ], batch size: 352, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:39:15,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=10.0 2023-06-20 13:39:32,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=724164.0, ans=0.015 2023-06-20 13:39:33,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=724164.0, ans=0.1 2023-06-20 13:40:30,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=724344.0, ans=0.2 2023-06-20 13:40:48,561 INFO [train.py:996] (1/4) Epoch 4, batch 29250, loss[loss=0.2788, simple_loss=0.3664, pruned_loss=0.09564, over 21291.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3274, pruned_loss=0.09754, over 4268145.29 frames. ], batch size: 548, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:41:18,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=724464.0, ans=0.0 2023-06-20 13:41:32,102 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.828e+02 3.224e+02 4.215e+02 8.591e+02, threshold=6.449e+02, percent-clipped=2.0 2023-06-20 13:42:29,501 INFO [train.py:996] (1/4) Epoch 4, batch 29300, loss[loss=0.211, simple_loss=0.269, pruned_loss=0.07644, over 21314.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3265, pruned_loss=0.09569, over 4274061.69 frames. ], batch size: 194, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:42:31,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=724704.0, ans=0.1 2023-06-20 13:42:42,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=724704.0, ans=0.0 2023-06-20 13:42:44,949 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=12.0 2023-06-20 13:42:56,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.73 vs. limit=5.0 2023-06-20 13:43:02,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=724764.0, ans=0.2 2023-06-20 13:43:11,065 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-20 13:43:31,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-20 13:44:05,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=8.0 2023-06-20 13:44:09,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=724944.0, ans=0.125 2023-06-20 13:44:16,736 INFO [train.py:996] (1/4) Epoch 4, batch 29350, loss[loss=0.2521, simple_loss=0.3101, pruned_loss=0.09708, over 21799.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3224, pruned_loss=0.09446, over 4273358.85 frames. ], batch size: 102, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:44:45,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-20 13:45:02,532 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 2.825e+02 3.145e+02 3.740e+02 7.269e+02, threshold=6.289e+02, percent-clipped=1.0 2023-06-20 13:45:09,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=725124.0, ans=0.125 2023-06-20 13:45:42,878 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:45:52,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=725244.0, ans=0.1 2023-06-20 13:46:00,210 INFO [train.py:996] (1/4) Epoch 4, batch 29400, loss[loss=0.2612, simple_loss=0.3349, pruned_loss=0.0937, over 21601.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.323, pruned_loss=0.09178, over 4270228.04 frames. ], batch size: 442, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:46:00,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=725304.0, ans=0.125 2023-06-20 13:46:37,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=725364.0, ans=0.035 2023-06-20 13:47:01,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=725484.0, ans=0.125 2023-06-20 13:47:33,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=725544.0, ans=0.125 2023-06-20 13:47:42,420 INFO [train.py:996] (1/4) Epoch 4, batch 29450, loss[loss=0.1812, simple_loss=0.2422, pruned_loss=0.06011, over 16602.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.321, pruned_loss=0.09087, over 4254824.60 frames. ], batch size: 61, lr: 7.49e-03, grad_scale: 16.0 2023-06-20 13:48:19,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=725664.0, ans=0.125 2023-06-20 13:48:28,454 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 3.113e+02 3.518e+02 4.347e+02 7.926e+02, threshold=7.036e+02, percent-clipped=6.0 2023-06-20 13:49:24,084 INFO [train.py:996] (1/4) Epoch 4, batch 29500, loss[loss=0.2693, simple_loss=0.3226, pruned_loss=0.108, over 21231.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3273, pruned_loss=0.09545, over 4258514.36 frames. ], batch size: 159, lr: 7.49e-03, grad_scale: 16.0 2023-06-20 13:49:49,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=725964.0, ans=0.0 2023-06-20 13:50:03,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=725964.0, ans=0.0 2023-06-20 13:50:54,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=15.0 2023-06-20 13:51:04,346 INFO [train.py:996] (1/4) Epoch 4, batch 29550, loss[loss=0.2753, simple_loss=0.3389, pruned_loss=0.1059, over 21926.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3262, pruned_loss=0.09769, over 4273031.87 frames. ], batch size: 333, lr: 7.48e-03, grad_scale: 16.0 2023-06-20 13:51:24,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=726264.0, ans=0.1 2023-06-20 13:51:50,784 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.392e+02 2.813e+02 3.221e+02 3.823e+02 6.296e+02, threshold=6.442e+02, percent-clipped=0.0 2023-06-20 13:51:59,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=726324.0, ans=0.125 2023-06-20 13:52:17,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=726384.0, ans=10.0 2023-06-20 13:52:36,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726444.0, ans=0.1 2023-06-20 13:52:51,266 INFO [train.py:996] (1/4) Epoch 4, batch 29600, loss[loss=0.2448, simple_loss=0.3229, pruned_loss=0.08338, over 21106.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3305, pruned_loss=0.09945, over 4278079.08 frames. ], batch size: 608, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:52:51,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=726504.0, ans=0.0 2023-06-20 13:53:09,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=726504.0, ans=0.125 2023-06-20 13:53:24,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=726564.0, ans=0.125 2023-06-20 13:54:13,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=726744.0, ans=0.04949747468305833 2023-06-20 13:54:23,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=15.0 2023-06-20 13:54:33,044 INFO [train.py:996] (1/4) Epoch 4, batch 29650, loss[loss=0.172, simple_loss=0.2373, pruned_loss=0.05341, over 17398.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3268, pruned_loss=0.09523, over 4273235.28 frames. ], batch size: 60, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:54:36,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726804.0, ans=0.1 2023-06-20 13:55:04,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=726864.0, ans=0.2 2023-06-20 13:55:13,471 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 3.008e+02 3.849e+02 5.213e+02 1.335e+03, threshold=7.697e+02, percent-clipped=14.0 2023-06-20 13:55:17,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=726924.0, ans=0.125 2023-06-20 13:55:35,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=726984.0, ans=0.125 2023-06-20 13:56:10,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=727044.0, ans=0.125 2023-06-20 13:56:14,508 INFO [train.py:996] (1/4) Epoch 4, batch 29700, loss[loss=0.2658, simple_loss=0.3221, pruned_loss=0.1047, over 21170.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3271, pruned_loss=0.09473, over 4280937.04 frames. ], batch size: 608, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:56:34,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=727164.0, ans=0.0 2023-06-20 13:56:42,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=727164.0, ans=0.07 2023-06-20 13:56:42,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=727164.0, ans=0.0 2023-06-20 13:56:58,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=727224.0, ans=0.2 2023-06-20 13:57:29,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-20 13:57:42,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=727344.0, ans=0.125 2023-06-20 13:57:43,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=727344.0, ans=0.0 2023-06-20 13:57:44,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=12.0 2023-06-20 13:57:56,340 INFO [train.py:996] (1/4) Epoch 4, batch 29750, loss[loss=0.2567, simple_loss=0.3393, pruned_loss=0.08708, over 21385.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3311, pruned_loss=0.09429, over 4279349.69 frames. ], batch size: 176, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:58:06,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=727404.0, ans=0.2 2023-06-20 13:58:11,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=727404.0, ans=0.125 2023-06-20 13:58:36,376 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 2.793e+02 3.232e+02 3.849e+02 7.208e+02, threshold=6.464e+02, percent-clipped=0.0 2023-06-20 13:59:36,823 INFO [train.py:996] (1/4) Epoch 4, batch 29800, loss[loss=0.2549, simple_loss=0.324, pruned_loss=0.09293, over 21912.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3355, pruned_loss=0.09671, over 4280265.46 frames. ], batch size: 316, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 14:00:15,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=727824.0, ans=0.125 2023-06-20 14:00:39,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=727824.0, ans=0.5 2023-06-20 14:01:23,935 INFO [train.py:996] (1/4) Epoch 4, batch 29850, loss[loss=0.2412, simple_loss=0.309, pruned_loss=0.08673, over 21692.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3325, pruned_loss=0.09483, over 4281776.87 frames. ], batch size: 263, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:01:30,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=728004.0, ans=0.1 2023-06-20 14:02:05,834 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.765e+02 3.288e+02 3.804e+02 6.956e+02, threshold=6.577e+02, percent-clipped=2.0 2023-06-20 14:02:14,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=728124.0, ans=0.125 2023-06-20 14:02:32,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-20 14:02:49,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-06-20 14:03:04,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=728244.0, ans=0.0 2023-06-20 14:03:06,607 INFO [train.py:996] (1/4) Epoch 4, batch 29900, loss[loss=0.305, simple_loss=0.357, pruned_loss=0.1265, over 21754.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.33, pruned_loss=0.09569, over 4288500.93 frames. ], batch size: 351, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:03:58,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=728424.0, ans=0.2 2023-06-20 14:04:28,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=728544.0, ans=0.125 2023-06-20 14:04:50,334 INFO [train.py:996] (1/4) Epoch 4, batch 29950, loss[loss=0.2217, simple_loss=0.2624, pruned_loss=0.09054, over 20193.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3335, pruned_loss=0.09965, over 4286950.31 frames. ], batch size: 707, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:05:36,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.386e+02 3.024e+02 3.410e+02 4.010e+02 6.604e+02, threshold=6.821e+02, percent-clipped=1.0 2023-06-20 14:06:03,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-20 14:06:04,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=728784.0, ans=0.0 2023-06-20 14:06:13,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=728784.0, ans=0.125 2023-06-20 14:06:14,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=728844.0, ans=0.0 2023-06-20 14:06:19,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=728844.0, ans=0.0 2023-06-20 14:06:33,128 INFO [train.py:996] (1/4) Epoch 4, batch 30000, loss[loss=0.2532, simple_loss=0.3452, pruned_loss=0.08055, over 21716.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3369, pruned_loss=0.1005, over 4287906.89 frames. ], batch size: 351, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:06:33,129 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 14:06:55,178 INFO [train.py:1028] (1/4) Epoch 4, validation: loss=0.2513, simple_loss=0.3514, pruned_loss=0.07557, over 1796401.00 frames. 2023-06-20 14:06:55,179 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 14:08:48,652 INFO [train.py:996] (1/4) Epoch 4, batch 30050, loss[loss=0.2822, simple_loss=0.3836, pruned_loss=0.09039, over 21864.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3392, pruned_loss=0.0971, over 4282762.70 frames. ], batch size: 372, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:08:55,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=729204.0, ans=0.125 2023-06-20 14:09:23,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=729264.0, ans=0.1 2023-06-20 14:09:28,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=729264.0, ans=0.125 2023-06-20 14:09:34,074 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.668e+02 3.143e+02 3.876e+02 8.051e+02, threshold=6.286e+02, percent-clipped=2.0 2023-06-20 14:10:30,481 INFO [train.py:996] (1/4) Epoch 4, batch 30100, loss[loss=0.2573, simple_loss=0.308, pruned_loss=0.1033, over 21447.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.337, pruned_loss=0.09702, over 4279570.25 frames. ], batch size: 389, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:10:58,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=729564.0, ans=0.07 2023-06-20 14:11:15,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=729624.0, ans=0.1 2023-06-20 14:11:43,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=729684.0, ans=0.0 2023-06-20 14:12:13,032 INFO [train.py:996] (1/4) Epoch 4, batch 30150, loss[loss=0.2449, simple_loss=0.2952, pruned_loss=0.0973, over 20188.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3346, pruned_loss=0.099, over 4270814.90 frames. ], batch size: 702, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:12:26,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=729804.0, ans=0.125 2023-06-20 14:12:47,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=729864.0, ans=0.0 2023-06-20 14:12:50,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-20 14:13:07,770 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.395e+02 3.286e+02 3.695e+02 4.286e+02 6.850e+02, threshold=7.389e+02, percent-clipped=1.0 2023-06-20 14:13:09,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=729924.0, ans=0.0 2023-06-20 14:13:23,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=729924.0, ans=0.125 2023-06-20 14:13:40,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=729984.0, ans=0.125 2023-06-20 14:13:46,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=730044.0, ans=0.0 2023-06-20 14:14:05,635 INFO [train.py:996] (1/4) Epoch 4, batch 30200, loss[loss=0.2512, simple_loss=0.3043, pruned_loss=0.09902, over 21268.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.336, pruned_loss=0.09815, over 4268405.79 frames. ], batch size: 549, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:14:49,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=730224.0, ans=0.1 2023-06-20 14:14:57,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-20 14:15:34,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=730344.0, ans=0.0 2023-06-20 14:15:36,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=730344.0, ans=0.125 2023-06-20 14:15:41,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=730344.0, ans=0.0 2023-06-20 14:15:55,829 INFO [train.py:996] (1/4) Epoch 4, batch 30250, loss[loss=0.267, simple_loss=0.3447, pruned_loss=0.09461, over 21153.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3426, pruned_loss=0.1001, over 4271468.49 frames. ], batch size: 143, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:16:12,754 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-20 14:16:34,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-20 14:16:35,799 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.958e+02 3.586e+02 4.420e+02 6.930e+02, threshold=7.173e+02, percent-clipped=0.0 2023-06-20 14:16:50,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-20 14:17:37,303 INFO [train.py:996] (1/4) Epoch 4, batch 30300, loss[loss=0.2344, simple_loss=0.2929, pruned_loss=0.08792, over 21827.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3421, pruned_loss=0.1004, over 4272152.47 frames. ], batch size: 118, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:18:16,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=730824.0, ans=6.0 2023-06-20 14:18:30,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=730824.0, ans=0.125 2023-06-20 14:18:48,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=730884.0, ans=0.125 2023-06-20 14:19:16,755 INFO [train.py:996] (1/4) Epoch 4, batch 30350, loss[loss=0.4222, simple_loss=0.4868, pruned_loss=0.1788, over 21446.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3423, pruned_loss=0.1013, over 4264820.15 frames. ], batch size: 507, lr: 7.46e-03, grad_scale: 16.0 2023-06-20 14:19:34,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=731004.0, ans=0.125 2023-06-20 14:19:44,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=731064.0, ans=0.035 2023-06-20 14:19:55,920 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.242e+02 3.707e+02 4.467e+02 6.879e+02, threshold=7.414e+02, percent-clipped=0.0 2023-06-20 14:20:03,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=731124.0, ans=0.0 2023-06-20 14:20:06,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=731184.0, ans=0.0 2023-06-20 14:20:43,872 INFO [train.py:996] (1/4) Epoch 4, batch 30400, loss[loss=0.2373, simple_loss=0.2972, pruned_loss=0.08869, over 20016.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3375, pruned_loss=0.09987, over 4261510.20 frames. ], batch size: 703, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:20:48,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=731304.0, ans=0.125 2023-06-20 14:21:37,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=731484.0, ans=0.0 2023-06-20 14:21:40,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=731484.0, ans=0.0 2023-06-20 14:21:56,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-20 14:22:05,351 INFO [train.py:996] (1/4) Epoch 4, batch 30450, loss[loss=0.3223, simple_loss=0.4306, pruned_loss=0.107, over 19897.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3394, pruned_loss=0.0998, over 4203473.30 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:22:09,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=731604.0, ans=0.0 2023-06-20 14:22:18,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-20 14:22:21,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=731664.0, ans=0.015 2023-06-20 14:22:38,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=731724.0, ans=0.035 2023-06-20 14:22:42,729 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-20 14:22:43,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.654e+02 3.992e+02 5.785e+02 8.199e+02 3.035e+03, threshold=1.157e+03, percent-clipped=30.0 2023-06-20 14:22:50,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=731784.0, ans=0.0 2023-06-20 14:23:09,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-20 14:25:02,460 INFO [train.py:996] (1/4) Epoch 5, batch 0, loss[loss=0.2645, simple_loss=0.3259, pruned_loss=0.1015, over 21808.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3259, pruned_loss=0.1015, over 21808.00 frames. ], batch size: 352, lr: 6.61e-03, grad_scale: 32.0 2023-06-20 14:25:02,461 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 14:25:18,272 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2519, simple_loss=0.3587, pruned_loss=0.07257, over 1796401.00 frames. 2023-06-20 14:25:18,273 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 14:25:50,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=731934.0, ans=0.0 2023-06-20 14:26:39,036 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=8.0 2023-06-20 14:26:43,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=732114.0, ans=0.0 2023-06-20 14:26:54,769 INFO [train.py:996] (1/4) Epoch 5, batch 50, loss[loss=0.2559, simple_loss=0.3291, pruned_loss=0.09138, over 21638.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3505, pruned_loss=0.1053, over 968528.11 frames. ], batch size: 247, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:26:56,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=732174.0, ans=0.2 2023-06-20 14:27:11,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=732174.0, ans=0.1 2023-06-20 14:27:53,552 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.548e+02 3.294e+02 4.063e+02 6.432e+02 1.595e+03, threshold=8.127e+02, percent-clipped=6.0 2023-06-20 14:28:23,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=732414.0, ans=0.1 2023-06-20 14:28:32,481 INFO [train.py:996] (1/4) Epoch 5, batch 100, loss[loss=0.2773, simple_loss=0.3653, pruned_loss=0.09464, over 21681.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3592, pruned_loss=0.1038, over 1695946.14 frames. ], batch size: 441, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:28:42,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=15.0 2023-06-20 14:29:12,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=732594.0, ans=0.0 2023-06-20 14:29:34,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=732654.0, ans=0.125 2023-06-20 14:29:35,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-20 14:29:46,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=732654.0, ans=0.0 2023-06-20 14:29:46,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=732654.0, ans=0.125 2023-06-20 14:29:57,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=732714.0, ans=0.0 2023-06-20 14:30:09,606 INFO [train.py:996] (1/4) Epoch 5, batch 150, loss[loss=0.2716, simple_loss=0.3527, pruned_loss=0.09526, over 21580.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3571, pruned_loss=0.1006, over 2264597.04 frames. ], batch size: 230, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:30:29,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=732834.0, ans=10.0 2023-06-20 14:30:52,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=732894.0, ans=0.0 2023-06-20 14:31:07,036 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.823e+02 3.207e+02 3.913e+02 7.422e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-20 14:31:09,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=732954.0, ans=0.1 2023-06-20 14:31:28,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=732954.0, ans=0.1 2023-06-20 14:31:50,623 INFO [train.py:996] (1/4) Epoch 5, batch 200, loss[loss=0.3025, simple_loss=0.3567, pruned_loss=0.1242, over 21891.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3557, pruned_loss=0.1017, over 2702950.67 frames. ], batch size: 351, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:32:08,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=733074.0, ans=0.2 2023-06-20 14:32:16,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=733134.0, ans=0.125 2023-06-20 14:33:20,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-20 14:33:22,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=733314.0, ans=0.125 2023-06-20 14:33:32,163 INFO [train.py:996] (1/4) Epoch 5, batch 250, loss[loss=0.3065, simple_loss=0.3879, pruned_loss=0.1126, over 21719.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3525, pruned_loss=0.102, over 3054459.14 frames. ], batch size: 414, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:34:29,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=733554.0, ans=0.125 2023-06-20 14:34:30,978 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.953e+02 3.443e+02 4.116e+02 7.444e+02, threshold=6.886e+02, percent-clipped=2.0 2023-06-20 14:35:14,581 INFO [train.py:996] (1/4) Epoch 5, batch 300, loss[loss=0.2521, simple_loss=0.3251, pruned_loss=0.08954, over 21463.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3443, pruned_loss=0.09984, over 3326148.03 frames. ], batch size: 194, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:35:33,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=733734.0, ans=0.125 2023-06-20 14:36:51,858 INFO [train.py:996] (1/4) Epoch 5, batch 350, loss[loss=0.2602, simple_loss=0.3117, pruned_loss=0.1044, over 21729.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3348, pruned_loss=0.09791, over 3545125.06 frames. ], batch size: 371, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:37:17,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-20 14:37:40,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=734094.0, ans=0.0 2023-06-20 14:37:51,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.828e+02 3.218e+02 3.899e+02 6.662e+02, threshold=6.437e+02, percent-clipped=0.0 2023-06-20 14:38:12,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=734154.0, ans=0.125 2023-06-20 14:38:24,673 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:38:33,743 INFO [train.py:996] (1/4) Epoch 5, batch 400, loss[loss=0.2114, simple_loss=0.2708, pruned_loss=0.07596, over 21483.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3261, pruned_loss=0.09415, over 3699366.50 frames. ], batch size: 195, lr: 6.59e-03, grad_scale: 32.0 2023-06-20 14:38:58,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=734334.0, ans=0.2 2023-06-20 14:40:05,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=734514.0, ans=0.0 2023-06-20 14:40:15,212 INFO [train.py:996] (1/4) Epoch 5, batch 450, loss[loss=0.2546, simple_loss=0.3211, pruned_loss=0.09403, over 21325.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3239, pruned_loss=0.09226, over 3825310.69 frames. ], batch size: 551, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:40:15,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=734574.0, ans=0.0 2023-06-20 14:40:33,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-20 14:40:49,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=734634.0, ans=0.0 2023-06-20 14:41:05,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=734694.0, ans=0.04949747468305833 2023-06-20 14:41:11,387 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-20 14:41:20,208 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.914e+02 3.934e+02 5.626e+02 1.302e+03, threshold=7.868e+02, percent-clipped=18.0 2023-06-20 14:41:22,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=734754.0, ans=0.125 2023-06-20 14:41:23,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=734754.0, ans=0.0 2023-06-20 14:41:37,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=734754.0, ans=0.0 2023-06-20 14:41:55,811 INFO [train.py:996] (1/4) Epoch 5, batch 500, loss[loss=0.223, simple_loss=0.2849, pruned_loss=0.08057, over 21713.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3258, pruned_loss=0.09143, over 3932699.85 frames. ], batch size: 124, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:42:15,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=734934.0, ans=0.1 2023-06-20 14:43:37,132 INFO [train.py:996] (1/4) Epoch 5, batch 550, loss[loss=0.2651, simple_loss=0.3366, pruned_loss=0.09684, over 21736.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3299, pruned_loss=0.09205, over 4018580.75 frames. ], batch size: 316, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:43:53,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=735174.0, ans=0.125 2023-06-20 14:44:06,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=735234.0, ans=0.125 2023-06-20 14:44:33,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=735354.0, ans=0.0 2023-06-20 14:44:36,261 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.996e+02 3.563e+02 4.226e+02 6.619e+02, threshold=7.127e+02, percent-clipped=0.0 2023-06-20 14:44:44,262 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-20 14:45:07,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=735414.0, ans=0.125 2023-06-20 14:45:16,881 INFO [train.py:996] (1/4) Epoch 5, batch 600, loss[loss=0.2867, simple_loss=0.3654, pruned_loss=0.104, over 21605.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3327, pruned_loss=0.09287, over 4079465.88 frames. ], batch size: 230, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:45:38,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=735534.0, ans=0.1 2023-06-20 14:45:48,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=735534.0, ans=0.125 2023-06-20 14:46:32,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-20 14:46:59,135 INFO [train.py:996] (1/4) Epoch 5, batch 650, loss[loss=0.2428, simple_loss=0.3099, pruned_loss=0.08781, over 21757.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3324, pruned_loss=0.09258, over 4116419.87 frames. ], batch size: 332, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:46:59,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=735774.0, ans=0.0 2023-06-20 14:47:19,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=735834.0, ans=0.0 2023-06-20 14:47:32,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-20 14:47:36,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=735894.0, ans=0.125 2023-06-20 14:47:58,873 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.943e+02 3.470e+02 4.276e+02 7.197e+02, threshold=6.941e+02, percent-clipped=1.0 2023-06-20 14:48:03,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.58 vs. limit=15.0 2023-06-20 14:48:29,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=736014.0, ans=0.0 2023-06-20 14:48:40,841 INFO [train.py:996] (1/4) Epoch 5, batch 700, loss[loss=0.3206, simple_loss=0.3663, pruned_loss=0.1375, over 21315.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3323, pruned_loss=0.09446, over 4158045.73 frames. ], batch size: 471, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:48:41,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=736074.0, ans=0.125 2023-06-20 14:50:21,193 INFO [train.py:996] (1/4) Epoch 5, batch 750, loss[loss=0.2841, simple_loss=0.3608, pruned_loss=0.1037, over 21433.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3323, pruned_loss=0.09523, over 4194163.43 frames. ], batch size: 211, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:51:22,098 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 2.979e+02 3.412e+02 4.334e+02 7.194e+02, threshold=6.824e+02, percent-clipped=1.0 2023-06-20 14:51:42,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=736554.0, ans=0.125 2023-06-20 14:51:44,262 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-20 14:51:51,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=736614.0, ans=0.05 2023-06-20 14:52:00,290 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=22.5 2023-06-20 14:52:04,135 INFO [train.py:996] (1/4) Epoch 5, batch 800, loss[loss=0.2047, simple_loss=0.3294, pruned_loss=0.03997, over 19838.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3295, pruned_loss=0.09493, over 4209811.48 frames. ], batch size: 703, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:52:49,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-20 14:53:12,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-20 14:53:30,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.69 vs. limit=8.0 2023-06-20 14:53:46,641 INFO [train.py:996] (1/4) Epoch 5, batch 850, loss[loss=0.2715, simple_loss=0.322, pruned_loss=0.1105, over 21398.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.33, pruned_loss=0.09632, over 4230316.71 frames. ], batch size: 144, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:54:00,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=736974.0, ans=0.0 2023-06-20 14:54:19,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=737034.0, ans=0.05 2023-06-20 14:54:57,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.960e+02 3.459e+02 4.425e+02 7.988e+02, threshold=6.917e+02, percent-clipped=3.0 2023-06-20 14:55:12,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2023-06-20 14:55:32,813 INFO [train.py:996] (1/4) Epoch 5, batch 900, loss[loss=0.3021, simple_loss=0.3403, pruned_loss=0.1319, over 21637.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3268, pruned_loss=0.09534, over 4248488.01 frames. ], batch size: 508, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:55:41,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=737274.0, ans=0.0 2023-06-20 14:56:03,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=737334.0, ans=0.2 2023-06-20 14:56:23,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=737394.0, ans=15.0 2023-06-20 14:56:41,106 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:57:13,133 INFO [train.py:996] (1/4) Epoch 5, batch 950, loss[loss=0.2494, simple_loss=0.3227, pruned_loss=0.08801, over 21748.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3234, pruned_loss=0.09356, over 4258413.38 frames. ], batch size: 441, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:57:21,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=737574.0, ans=0.5 2023-06-20 14:58:13,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=737694.0, ans=0.0 2023-06-20 14:58:18,905 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.716e+02 3.192e+02 3.705e+02 5.586e+02, threshold=6.385e+02, percent-clipped=0.0 2023-06-20 14:58:22,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=737754.0, ans=0.0 2023-06-20 14:58:25,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=737754.0, ans=0.125 2023-06-20 14:58:54,031 INFO [train.py:996] (1/4) Epoch 5, batch 1000, loss[loss=0.3143, simple_loss=0.36, pruned_loss=0.1343, over 21619.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3232, pruned_loss=0.09304, over 4268265.31 frames. ], batch size: 471, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:58:58,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=737874.0, ans=0.125 2023-06-20 14:59:15,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=737934.0, ans=0.125 2023-06-20 14:59:56,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=737994.0, ans=0.125 2023-06-20 14:59:58,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.65 vs. limit=15.0 2023-06-20 15:00:29,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=738114.0, ans=0.2 2023-06-20 15:00:37,272 INFO [train.py:996] (1/4) Epoch 5, batch 1050, loss[loss=0.2925, simple_loss=0.3548, pruned_loss=0.1151, over 21798.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3217, pruned_loss=0.09272, over 4277038.97 frames. ], batch size: 441, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 15:00:53,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=738174.0, ans=0.1 2023-06-20 15:01:44,993 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.880e+02 3.329e+02 4.012e+02 6.640e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-20 15:01:54,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=738354.0, ans=0.125 2023-06-20 15:02:09,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=738414.0, ans=0.1 2023-06-20 15:02:22,416 INFO [train.py:996] (1/4) Epoch 5, batch 1100, loss[loss=0.2691, simple_loss=0.3198, pruned_loss=0.1092, over 21273.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3243, pruned_loss=0.09352, over 4279951.11 frames. ], batch size: 159, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 15:02:48,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-20 15:02:52,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=738534.0, ans=0.1 2023-06-20 15:03:09,641 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:03:11,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=738594.0, ans=0.125 2023-06-20 15:03:19,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=738594.0, ans=0.1 2023-06-20 15:03:55,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=738714.0, ans=0.5 2023-06-20 15:03:55,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=738714.0, ans=0.125 2023-06-20 15:04:10,624 INFO [train.py:996] (1/4) Epoch 5, batch 1150, loss[loss=0.2135, simple_loss=0.2915, pruned_loss=0.06774, over 21482.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3253, pruned_loss=0.09325, over 4277084.86 frames. ], batch size: 212, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:04:37,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.74 vs. limit=15.0 2023-06-20 15:05:18,290 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.853e+02 3.547e+02 4.502e+02 9.164e+02, threshold=7.095e+02, percent-clipped=7.0 2023-06-20 15:06:05,925 INFO [train.py:996] (1/4) Epoch 5, batch 1200, loss[loss=0.2318, simple_loss=0.3252, pruned_loss=0.06924, over 21836.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3269, pruned_loss=0.09392, over 4281488.92 frames. ], batch size: 316, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:06:37,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=739194.0, ans=0.2 2023-06-20 15:06:43,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-20 15:07:03,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=739254.0, ans=0.035 2023-06-20 15:07:48,960 INFO [train.py:996] (1/4) Epoch 5, batch 1250, loss[loss=0.2783, simple_loss=0.3348, pruned_loss=0.1109, over 21632.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3301, pruned_loss=0.09525, over 4284015.02 frames. ], batch size: 230, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:08:45,304 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.897e+02 3.475e+02 4.049e+02 7.365e+02, threshold=6.950e+02, percent-clipped=1.0 2023-06-20 15:09:33,270 INFO [train.py:996] (1/4) Epoch 5, batch 1300, loss[loss=0.235, simple_loss=0.3087, pruned_loss=0.08064, over 21432.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3303, pruned_loss=0.09563, over 4284968.89 frames. ], batch size: 194, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:09:43,483 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:10:03,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=739734.0, ans=0.2 2023-06-20 15:10:49,967 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:11:16,818 INFO [train.py:996] (1/4) Epoch 5, batch 1350, loss[loss=0.2864, simple_loss=0.3632, pruned_loss=0.1048, over 21419.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3332, pruned_loss=0.09691, over 4291492.10 frames. ], batch size: 194, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:11:18,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=739974.0, ans=0.125 2023-06-20 15:11:32,313 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-06-20 15:11:48,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=740094.0, ans=0.125 2023-06-20 15:12:12,794 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.977e+02 3.601e+02 4.425e+02 6.870e+02, threshold=7.202e+02, percent-clipped=0.0 2023-06-20 15:13:00,164 INFO [train.py:996] (1/4) Epoch 5, batch 1400, loss[loss=0.2247, simple_loss=0.2845, pruned_loss=0.08243, over 21772.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3301, pruned_loss=0.09632, over 4292993.30 frames. ], batch size: 112, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:13:49,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-20 15:14:03,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=740454.0, ans=0.2 2023-06-20 15:14:35,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=740514.0, ans=0.0 2023-06-20 15:14:43,268 INFO [train.py:996] (1/4) Epoch 5, batch 1450, loss[loss=0.2514, simple_loss=0.3069, pruned_loss=0.09797, over 21645.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3296, pruned_loss=0.0976, over 4285566.50 frames. ], batch size: 414, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:14:43,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=740574.0, ans=0.0 2023-06-20 15:15:05,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=740634.0, ans=0.125 2023-06-20 15:15:37,764 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 2.878e+02 3.343e+02 3.968e+02 7.161e+02, threshold=6.685e+02, percent-clipped=0.0 2023-06-20 15:16:02,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=740814.0, ans=0.125 2023-06-20 15:16:24,273 INFO [train.py:996] (1/4) Epoch 5, batch 1500, loss[loss=0.2702, simple_loss=0.334, pruned_loss=0.1032, over 21763.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3295, pruned_loss=0.09815, over 4290284.73 frames. ], batch size: 112, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:16:39,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=740934.0, ans=0.125 2023-06-20 15:17:11,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=740994.0, ans=0.125 2023-06-20 15:17:18,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=741054.0, ans=10.0 2023-06-20 15:17:30,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-20 15:18:10,002 INFO [train.py:996] (1/4) Epoch 5, batch 1550, loss[loss=0.2349, simple_loss=0.3258, pruned_loss=0.07202, over 21811.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3293, pruned_loss=0.09729, over 4290598.00 frames. ], batch size: 371, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:18:31,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=741234.0, ans=0.2 2023-06-20 15:19:09,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.929e+02 3.428e+02 4.021e+02 6.196e+02, threshold=6.855e+02, percent-clipped=0.0 2023-06-20 15:19:16,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=741354.0, ans=0.125 2023-06-20 15:19:33,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=741414.0, ans=0.0 2023-06-20 15:19:40,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=741414.0, ans=0.0 2023-06-20 15:19:51,907 INFO [train.py:996] (1/4) Epoch 5, batch 1600, loss[loss=0.2241, simple_loss=0.3331, pruned_loss=0.05754, over 20915.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.329, pruned_loss=0.09625, over 4291478.47 frames. ], batch size: 608, lr: 6.56e-03, grad_scale: 32.0 2023-06-20 15:19:59,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=741474.0, ans=0.0 2023-06-20 15:20:04,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-20 15:20:32,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-20 15:20:35,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=741594.0, ans=0.125 2023-06-20 15:20:49,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-20 15:21:25,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=741714.0, ans=0.95 2023-06-20 15:21:35,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-20 15:21:36,254 INFO [train.py:996] (1/4) Epoch 5, batch 1650, loss[loss=0.3166, simple_loss=0.3897, pruned_loss=0.1218, over 21399.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3289, pruned_loss=0.09634, over 4290221.52 frames. ], batch size: 507, lr: 6.56e-03, grad_scale: 32.0 2023-06-20 15:21:46,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=741774.0, ans=0.1 2023-06-20 15:22:07,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=741834.0, ans=0.1 2023-06-20 15:22:29,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=741894.0, ans=0.1 2023-06-20 15:22:39,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=741954.0, ans=0.0 2023-06-20 15:22:51,489 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.044e+02 3.519e+02 4.349e+02 7.461e+02, threshold=7.039e+02, percent-clipped=1.0 2023-06-20 15:23:19,931 INFO [train.py:996] (1/4) Epoch 5, batch 1700, loss[loss=0.2728, simple_loss=0.3589, pruned_loss=0.09336, over 21732.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3317, pruned_loss=0.09657, over 4285067.85 frames. ], batch size: 332, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:23:51,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.96 vs. limit=6.0 2023-06-20 15:23:57,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=742194.0, ans=0.125 2023-06-20 15:25:00,561 INFO [train.py:996] (1/4) Epoch 5, batch 1750, loss[loss=0.3074, simple_loss=0.363, pruned_loss=0.1259, over 21357.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3304, pruned_loss=0.09508, over 4283170.07 frames. ], batch size: 549, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:25:18,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-20 15:25:59,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=742494.0, ans=0.2 2023-06-20 15:26:06,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=742554.0, ans=0.125 2023-06-20 15:26:13,028 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.877e+02 3.732e+02 4.371e+02 8.077e+02, threshold=7.464e+02, percent-clipped=3.0 2023-06-20 15:26:20,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=742554.0, ans=15.0 2023-06-20 15:26:46,862 INFO [train.py:996] (1/4) Epoch 5, batch 1800, loss[loss=0.2749, simple_loss=0.3445, pruned_loss=0.1027, over 21633.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3272, pruned_loss=0.09164, over 4279999.80 frames. ], batch size: 263, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:27:35,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=742794.0, ans=0.125 2023-06-20 15:28:30,571 INFO [train.py:996] (1/4) Epoch 5, batch 1850, loss[loss=0.2145, simple_loss=0.3009, pruned_loss=0.06408, over 21359.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.328, pruned_loss=0.08938, over 4273119.53 frames. ], batch size: 548, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:28:46,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=742974.0, ans=0.125 2023-06-20 15:29:32,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=743094.0, ans=0.125 2023-06-20 15:29:40,606 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.730e+02 3.304e+02 4.070e+02 7.005e+02, threshold=6.608e+02, percent-clipped=0.0 2023-06-20 15:29:55,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=743214.0, ans=0.125 2023-06-20 15:30:18,487 INFO [train.py:996] (1/4) Epoch 5, batch 1900, loss[loss=0.2585, simple_loss=0.3179, pruned_loss=0.09956, over 21491.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3279, pruned_loss=0.08999, over 4274891.85 frames. ], batch size: 211, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:30:22,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=743274.0, ans=0.125 2023-06-20 15:30:25,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=743274.0, ans=0.0 2023-06-20 15:30:27,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-20 15:30:46,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743334.0, ans=0.1 2023-06-20 15:30:53,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=743334.0, ans=0.125 2023-06-20 15:31:13,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=743394.0, ans=0.125 2023-06-20 15:31:24,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=743454.0, ans=0.125 2023-06-20 15:31:39,917 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-20 15:32:02,673 INFO [train.py:996] (1/4) Epoch 5, batch 1950, loss[loss=0.2461, simple_loss=0.3502, pruned_loss=0.071, over 19874.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3243, pruned_loss=0.09029, over 4266127.25 frames. ], batch size: 703, lr: 6.55e-03, grad_scale: 16.0 2023-06-20 15:32:56,428 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=12.0 2023-06-20 15:32:59,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=743694.0, ans=0.0 2023-06-20 15:33:10,449 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 3.018e+02 3.404e+02 4.237e+02 8.100e+02, threshold=6.807e+02, percent-clipped=3.0 2023-06-20 15:33:49,084 INFO [train.py:996] (1/4) Epoch 5, batch 2000, loss[loss=0.403, simple_loss=0.4596, pruned_loss=0.1732, over 21454.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3179, pruned_loss=0.09015, over 4267831.63 frames. ], batch size: 507, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:33:51,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=743874.0, ans=0.025 2023-06-20 15:34:21,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=743934.0, ans=0.1 2023-06-20 15:34:28,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=743934.0, ans=0.0 2023-06-20 15:34:30,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=743934.0, ans=0.125 2023-06-20 15:35:28,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=744114.0, ans=0.125 2023-06-20 15:35:34,233 INFO [train.py:996] (1/4) Epoch 5, batch 2050, loss[loss=0.2579, simple_loss=0.3192, pruned_loss=0.09835, over 21905.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3219, pruned_loss=0.09108, over 4276123.91 frames. ], batch size: 316, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:35:34,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=744174.0, ans=0.125 2023-06-20 15:35:58,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=744234.0, ans=0.0 2023-06-20 15:36:11,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=744294.0, ans=0.1 2023-06-20 15:36:15,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=744294.0, ans=0.2 2023-06-20 15:36:39,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 2.809e+02 3.320e+02 4.171e+02 6.443e+02, threshold=6.640e+02, percent-clipped=0.0 2023-06-20 15:37:04,597 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-20 15:37:12,246 INFO [train.py:996] (1/4) Epoch 5, batch 2100, loss[loss=0.2812, simple_loss=0.3539, pruned_loss=0.1043, over 21454.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3267, pruned_loss=0.09383, over 4270060.51 frames. ], batch size: 194, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:37:27,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=744474.0, ans=0.0 2023-06-20 15:38:25,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.83 vs. limit=15.0 2023-06-20 15:38:28,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-20 15:38:29,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=744654.0, ans=0.09899494936611666 2023-06-20 15:38:48,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=744714.0, ans=0.0 2023-06-20 15:39:06,033 INFO [train.py:996] (1/4) Epoch 5, batch 2150, loss[loss=0.2661, simple_loss=0.317, pruned_loss=0.1076, over 21173.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3284, pruned_loss=0.09409, over 4276409.17 frames. ], batch size: 143, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:39:23,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-20 15:39:30,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=744834.0, ans=0.125 2023-06-20 15:39:31,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=744834.0, ans=0.95 2023-06-20 15:39:55,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=744894.0, ans=0.125 2023-06-20 15:40:10,999 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 3.145e+02 3.806e+02 5.141e+02 9.299e+02, threshold=7.611e+02, percent-clipped=10.0 2023-06-20 15:40:21,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=744954.0, ans=0.125 2023-06-20 15:40:41,723 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:40:44,289 INFO [train.py:996] (1/4) Epoch 5, batch 2200, loss[loss=0.236, simple_loss=0.3126, pruned_loss=0.07968, over 21290.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3307, pruned_loss=0.09466, over 4282613.33 frames. ], batch size: 176, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:40:55,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=745074.0, ans=0.2 2023-06-20 15:41:05,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-06-20 15:41:19,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=745134.0, ans=0.1 2023-06-20 15:41:46,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=745194.0, ans=0.125 2023-06-20 15:41:58,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=745254.0, ans=0.0 2023-06-20 15:41:58,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=745254.0, ans=0.0 2023-06-20 15:42:40,443 INFO [train.py:996] (1/4) Epoch 5, batch 2250, loss[loss=0.247, simple_loss=0.3106, pruned_loss=0.09172, over 21781.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3269, pruned_loss=0.09189, over 4277796.72 frames. ], batch size: 351, lr: 6.55e-03, grad_scale: 16.0 2023-06-20 15:43:03,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=745434.0, ans=0.04949747468305833 2023-06-20 15:43:12,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=745494.0, ans=0.1 2023-06-20 15:43:46,938 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.695e+02 3.113e+02 3.722e+02 7.366e+02, threshold=6.226e+02, percent-clipped=0.0 2023-06-20 15:43:49,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=745554.0, ans=15.0 2023-06-20 15:44:04,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=745614.0, ans=0.125 2023-06-20 15:44:23,180 INFO [train.py:996] (1/4) Epoch 5, batch 2300, loss[loss=0.2009, simple_loss=0.2579, pruned_loss=0.07195, over 21491.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.32, pruned_loss=0.09077, over 4279571.80 frames. ], batch size: 212, lr: 6.54e-03, grad_scale: 16.0 2023-06-20 15:44:26,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=745674.0, ans=0.2 2023-06-20 15:44:36,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=745674.0, ans=0.0 2023-06-20 15:45:47,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.91 vs. limit=15.0 2023-06-20 15:46:06,131 INFO [train.py:996] (1/4) Epoch 5, batch 2350, loss[loss=0.2499, simple_loss=0.3215, pruned_loss=0.08911, over 21358.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3167, pruned_loss=0.09139, over 4274179.71 frames. ], batch size: 131, lr: 6.54e-03, grad_scale: 16.0 2023-06-20 15:47:12,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.470e+02 3.194e+02 3.678e+02 4.653e+02 7.153e+02, threshold=7.356e+02, percent-clipped=3.0 2023-06-20 15:47:27,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=746214.0, ans=0.125 2023-06-20 15:47:50,561 INFO [train.py:996] (1/4) Epoch 5, batch 2400, loss[loss=0.2894, simple_loss=0.3546, pruned_loss=0.1121, over 21428.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3207, pruned_loss=0.09469, over 4270754.75 frames. ], batch size: 131, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:47:58,227 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:47:59,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=746274.0, ans=0.125 2023-06-20 15:48:04,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=746274.0, ans=0.125 2023-06-20 15:48:17,112 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:48:18,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=746334.0, ans=0.125 2023-06-20 15:49:08,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=746454.0, ans=0.125 2023-06-20 15:49:27,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=746514.0, ans=0.07 2023-06-20 15:49:32,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=746514.0, ans=6.0 2023-06-20 15:49:36,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-20 15:49:36,778 INFO [train.py:996] (1/4) Epoch 5, batch 2450, loss[loss=0.2616, simple_loss=0.3438, pruned_loss=0.08966, over 21530.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3263, pruned_loss=0.09704, over 4274921.75 frames. ], batch size: 230, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:50:24,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=746694.0, ans=0.0 2023-06-20 15:50:34,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=746754.0, ans=0.1 2023-06-20 15:50:47,927 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.072e+02 3.519e+02 4.096e+02 7.474e+02, threshold=7.039e+02, percent-clipped=1.0 2023-06-20 15:51:16,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=746814.0, ans=0.0 2023-06-20 15:51:19,727 INFO [train.py:996] (1/4) Epoch 5, batch 2500, loss[loss=0.3028, simple_loss=0.3633, pruned_loss=0.1212, over 21313.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3265, pruned_loss=0.09703, over 4257284.05 frames. ], batch size: 471, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:51:30,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=746874.0, ans=0.125 2023-06-20 15:51:34,458 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-20 15:51:50,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=746934.0, ans=0.0 2023-06-20 15:52:13,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=746994.0, ans=0.07 2023-06-20 15:53:02,334 INFO [train.py:996] (1/4) Epoch 5, batch 2550, loss[loss=0.2568, simple_loss=0.3449, pruned_loss=0.08436, over 21723.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3258, pruned_loss=0.09621, over 4259463.41 frames. ], batch size: 282, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:53:46,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=747294.0, ans=0.04949747468305833 2023-06-20 15:54:06,977 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 2.818e+02 3.202e+02 3.816e+02 6.010e+02, threshold=6.403e+02, percent-clipped=0.0 2023-06-20 15:54:17,190 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-20 15:54:20,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-20 15:54:23,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=747414.0, ans=10.0 2023-06-20 15:54:43,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=747474.0, ans=0.125 2023-06-20 15:54:44,525 INFO [train.py:996] (1/4) Epoch 5, batch 2600, loss[loss=0.2386, simple_loss=0.2948, pruned_loss=0.09117, over 21519.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3241, pruned_loss=0.09599, over 4263658.82 frames. ], batch size: 391, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:54:53,030 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:55:20,205 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=15.0 2023-06-20 15:56:27,522 INFO [train.py:996] (1/4) Epoch 5, batch 2650, loss[loss=0.2516, simple_loss=0.3108, pruned_loss=0.09622, over 20104.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3267, pruned_loss=0.09738, over 4268461.02 frames. ], batch size: 702, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:56:49,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=747834.0, ans=0.0 2023-06-20 15:57:04,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=747894.0, ans=0.125 2023-06-20 15:57:27,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=747954.0, ans=0.125 2023-06-20 15:57:39,244 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.025e+02 3.669e+02 4.481e+02 6.938e+02, threshold=7.338e+02, percent-clipped=2.0 2023-06-20 15:57:39,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=747954.0, ans=0.0 2023-06-20 15:57:42,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=747954.0, ans=0.125 2023-06-20 15:58:07,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=748014.0, ans=0.07 2023-06-20 15:58:10,845 INFO [train.py:996] (1/4) Epoch 5, batch 2700, loss[loss=0.2201, simple_loss=0.2804, pruned_loss=0.07991, over 21787.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3265, pruned_loss=0.09675, over 4278714.81 frames. ], batch size: 102, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 15:58:12,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=748074.0, ans=0.0 2023-06-20 15:58:38,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=748134.0, ans=0.125 2023-06-20 15:58:45,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=748134.0, ans=0.035 2023-06-20 15:59:02,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=748194.0, ans=0.05 2023-06-20 15:59:51,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=748374.0, ans=0.125 2023-06-20 15:59:52,537 INFO [train.py:996] (1/4) Epoch 5, batch 2750, loss[loss=0.266, simple_loss=0.3423, pruned_loss=0.09489, over 21701.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3245, pruned_loss=0.09543, over 4275545.72 frames. ], batch size: 391, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 15:59:56,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=748374.0, ans=0.2 2023-06-20 16:00:20,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=748434.0, ans=0.125 2023-06-20 16:00:43,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=748494.0, ans=0.125 2023-06-20 16:01:06,306 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.120e+02 3.768e+02 4.829e+02 8.745e+02, threshold=7.536e+02, percent-clipped=5.0 2023-06-20 16:01:17,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-20 16:01:20,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=748614.0, ans=0.125 2023-06-20 16:01:36,372 INFO [train.py:996] (1/4) Epoch 5, batch 2800, loss[loss=0.2557, simple_loss=0.3243, pruned_loss=0.09361, over 21566.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3311, pruned_loss=0.09731, over 4272218.20 frames. ], batch size: 263, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 16:02:24,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=748794.0, ans=0.125 2023-06-20 16:02:30,217 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=15.0 2023-06-20 16:02:56,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-20 16:03:02,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=748854.0, ans=0.125 2023-06-20 16:03:27,341 INFO [train.py:996] (1/4) Epoch 5, batch 2850, loss[loss=0.285, simple_loss=0.3637, pruned_loss=0.1032, over 21704.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3343, pruned_loss=0.09982, over 4264254.87 frames. ], batch size: 298, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:04:14,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=749094.0, ans=0.125 2023-06-20 16:04:16,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=749094.0, ans=0.0 2023-06-20 16:04:30,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=749154.0, ans=0.125 2023-06-20 16:04:42,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.197e+02 3.952e+02 5.010e+02 9.652e+02, threshold=7.904e+02, percent-clipped=7.0 2023-06-20 16:04:44,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=749154.0, ans=0.035 2023-06-20 16:04:52,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=749214.0, ans=0.0 2023-06-20 16:05:01,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=749214.0, ans=0.1 2023-06-20 16:05:10,308 INFO [train.py:996] (1/4) Epoch 5, batch 2900, loss[loss=0.2597, simple_loss=0.3196, pruned_loss=0.09985, over 21889.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3314, pruned_loss=0.09708, over 4267080.97 frames. ], batch size: 107, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:05:17,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=749274.0, ans=0.125 2023-06-20 16:06:20,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=749454.0, ans=0.125 2023-06-20 16:06:25,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=749454.0, ans=0.125 2023-06-20 16:06:52,772 INFO [train.py:996] (1/4) Epoch 5, batch 2950, loss[loss=0.2507, simple_loss=0.3419, pruned_loss=0.07972, over 21824.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3319, pruned_loss=0.09732, over 4270706.28 frames. ], batch size: 332, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:07:29,712 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:07:31,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=749694.0, ans=0.125 2023-06-20 16:08:10,256 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.876e+02 3.268e+02 4.025e+02 7.097e+02, threshold=6.536e+02, percent-clipped=0.0 2023-06-20 16:08:36,210 INFO [train.py:996] (1/4) Epoch 5, batch 3000, loss[loss=0.2749, simple_loss=0.3523, pruned_loss=0.09879, over 21869.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3345, pruned_loss=0.09777, over 4274909.02 frames. ], batch size: 371, lr: 6.53e-03, grad_scale: 8.0 2023-06-20 16:08:36,210 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 16:08:55,130 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2579, simple_loss=0.3533, pruned_loss=0.08129, over 1796401.00 frames. 2023-06-20 16:08:55,131 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 16:09:44,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=749994.0, ans=0.125 2023-06-20 16:09:45,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=749994.0, ans=0.125 2023-06-20 16:09:53,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=749994.0, ans=0.1 2023-06-20 16:10:25,570 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:10:28,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=750114.0, ans=0.125 2023-06-20 16:10:39,763 INFO [train.py:996] (1/4) Epoch 5, batch 3050, loss[loss=0.1969, simple_loss=0.2807, pruned_loss=0.05652, over 21225.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3343, pruned_loss=0.09639, over 4274796.48 frames. ], batch size: 176, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:11:52,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=750354.0, ans=0.125 2023-06-20 16:11:58,924 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.749e+02 3.160e+02 3.983e+02 6.617e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-20 16:12:25,599 INFO [train.py:996] (1/4) Epoch 5, batch 3100, loss[loss=0.2592, simple_loss=0.317, pruned_loss=0.1006, over 21577.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3321, pruned_loss=0.09502, over 4280556.27 frames. ], batch size: 548, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:12:46,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=750534.0, ans=0.0 2023-06-20 16:14:15,818 INFO [train.py:996] (1/4) Epoch 5, batch 3150, loss[loss=0.2707, simple_loss=0.3416, pruned_loss=0.09985, over 21747.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3331, pruned_loss=0.09496, over 4280955.48 frames. ], batch size: 332, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:15:28,142 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.671e+02 3.239e+02 3.868e+02 6.706e+02, threshold=6.479e+02, percent-clipped=2.0 2023-06-20 16:15:48,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-20 16:15:55,769 INFO [train.py:996] (1/4) Epoch 5, batch 3200, loss[loss=0.247, simple_loss=0.3289, pruned_loss=0.0825, over 21588.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3334, pruned_loss=0.09425, over 4281827.12 frames. ], batch size: 263, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:15:56,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=751074.0, ans=0.125 2023-06-20 16:16:44,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=751194.0, ans=0.125 2023-06-20 16:16:46,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-20 16:16:53,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=751194.0, ans=0.125 2023-06-20 16:17:35,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=751314.0, ans=0.125 2023-06-20 16:17:40,137 INFO [train.py:996] (1/4) Epoch 5, batch 3250, loss[loss=0.3065, simple_loss=0.3644, pruned_loss=0.1243, over 21402.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.336, pruned_loss=0.09726, over 4282188.79 frames. ], batch size: 131, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:18:06,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-20 16:18:34,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=751494.0, ans=0.2 2023-06-20 16:18:42,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.86 vs. limit=15.0 2023-06-20 16:19:02,270 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.071e+02 3.454e+02 4.020e+02 6.852e+02, threshold=6.907e+02, percent-clipped=1.0 2023-06-20 16:19:23,811 INFO [train.py:996] (1/4) Epoch 5, batch 3300, loss[loss=0.3558, simple_loss=0.4251, pruned_loss=0.1433, over 21465.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.33, pruned_loss=0.0957, over 4277766.05 frames. ], batch size: 507, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:19:24,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=751674.0, ans=0.1 2023-06-20 16:19:43,085 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-20 16:19:48,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=751734.0, ans=0.2 2023-06-20 16:20:47,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=751854.0, ans=0.125 2023-06-20 16:21:09,008 INFO [train.py:996] (1/4) Epoch 5, batch 3350, loss[loss=0.2572, simple_loss=0.308, pruned_loss=0.1032, over 20139.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3328, pruned_loss=0.09602, over 4284016.92 frames. ], batch size: 702, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:21:15,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=751974.0, ans=0.125 2023-06-20 16:22:29,150 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-20 16:22:31,289 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.245e+02 3.953e+02 4.970e+02 1.057e+03, threshold=7.906e+02, percent-clipped=6.0 2023-06-20 16:22:51,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=752214.0, ans=0.125 2023-06-20 16:22:57,020 INFO [train.py:996] (1/4) Epoch 5, batch 3400, loss[loss=0.2309, simple_loss=0.2902, pruned_loss=0.08581, over 21146.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3338, pruned_loss=0.09697, over 4284859.54 frames. ], batch size: 143, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:23:28,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=752334.0, ans=0.125 2023-06-20 16:23:34,378 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-20 16:24:20,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=752514.0, ans=0.1 2023-06-20 16:24:41,903 INFO [train.py:996] (1/4) Epoch 5, batch 3450, loss[loss=0.2752, simple_loss=0.3358, pruned_loss=0.1073, over 21586.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3288, pruned_loss=0.09646, over 4283352.54 frames. ], batch size: 230, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:24:50,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=752574.0, ans=0.09899494936611666 2023-06-20 16:25:41,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=752694.0, ans=15.0 2023-06-20 16:25:52,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.03 vs. limit=22.5 2023-06-20 16:26:05,736 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.288e+02 3.795e+02 4.947e+02 8.128e+02, threshold=7.589e+02, percent-clipped=1.0 2023-06-20 16:26:27,166 INFO [train.py:996] (1/4) Epoch 5, batch 3500, loss[loss=0.2978, simple_loss=0.3686, pruned_loss=0.1134, over 21811.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3383, pruned_loss=0.1013, over 4277812.06 frames. ], batch size: 118, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:26:31,817 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-20 16:26:48,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=752934.0, ans=0.95 2023-06-20 16:27:04,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=752934.0, ans=0.2 2023-06-20 16:27:46,657 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:27:57,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.82 vs. limit=5.0 2023-06-20 16:28:10,867 INFO [train.py:996] (1/4) Epoch 5, batch 3550, loss[loss=0.27, simple_loss=0.3261, pruned_loss=0.107, over 21433.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3407, pruned_loss=0.1024, over 4279470.75 frames. ], batch size: 389, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:28:18,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=753174.0, ans=0.125 2023-06-20 16:28:37,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-20 16:28:51,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-20 16:28:55,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=753234.0, ans=0.125 2023-06-20 16:29:05,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=753294.0, ans=0.125 2023-06-20 16:29:35,085 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.975e+02 3.456e+02 4.245e+02 7.529e+02, threshold=6.912e+02, percent-clipped=0.0 2023-06-20 16:29:35,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=753354.0, ans=0.0 2023-06-20 16:29:52,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=753414.0, ans=0.0 2023-06-20 16:30:01,822 INFO [train.py:996] (1/4) Epoch 5, batch 3600, loss[loss=0.305, simple_loss=0.3556, pruned_loss=0.1272, over 21648.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3347, pruned_loss=0.1016, over 4277596.51 frames. ], batch size: 391, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:30:54,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=22.5 2023-06-20 16:31:11,442 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-20 16:31:46,477 INFO [train.py:996] (1/4) Epoch 5, batch 3650, loss[loss=0.2624, simple_loss=0.347, pruned_loss=0.08894, over 21708.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3349, pruned_loss=0.1015, over 4276264.45 frames. ], batch size: 441, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:32:01,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-20 16:32:02,639 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-20 16:32:03,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=753774.0, ans=0.125 2023-06-20 16:32:40,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=753894.0, ans=0.125 2023-06-20 16:32:54,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=753954.0, ans=0.0 2023-06-20 16:33:02,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 3.073e+02 3.466e+02 4.352e+02 7.872e+02, threshold=6.931e+02, percent-clipped=1.0 2023-06-20 16:33:28,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-20 16:33:29,226 INFO [train.py:996] (1/4) Epoch 5, batch 3700, loss[loss=0.2313, simple_loss=0.3069, pruned_loss=0.07779, over 21656.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.333, pruned_loss=0.09976, over 4277272.28 frames. ], batch size: 230, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:34:36,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=754254.0, ans=0.125 2023-06-20 16:34:42,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-20 16:34:44,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=754254.0, ans=0.1 2023-06-20 16:34:56,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=754314.0, ans=0.125 2023-06-20 16:35:18,246 INFO [train.py:996] (1/4) Epoch 5, batch 3750, loss[loss=0.2288, simple_loss=0.2943, pruned_loss=0.08169, over 21250.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3307, pruned_loss=0.09923, over 4284975.66 frames. ], batch size: 143, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:35:43,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=754434.0, ans=0.0 2023-06-20 16:35:50,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-20 16:35:51,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=8.0 2023-06-20 16:35:55,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=754494.0, ans=0.0 2023-06-20 16:36:21,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=754554.0, ans=0.125 2023-06-20 16:36:33,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=754554.0, ans=0.0 2023-06-20 16:36:36,438 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.776e+02 3.273e+02 3.853e+02 7.611e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-20 16:37:07,420 INFO [train.py:996] (1/4) Epoch 5, batch 3800, loss[loss=0.2943, simple_loss=0.36, pruned_loss=0.1143, over 21583.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3298, pruned_loss=0.09759, over 4283946.66 frames. ], batch size: 389, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:37:08,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-20 16:37:10,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=754674.0, ans=0.0 2023-06-20 16:37:45,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=754794.0, ans=0.0 2023-06-20 16:37:49,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=754794.0, ans=0.125 2023-06-20 16:37:55,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=754794.0, ans=0.125 2023-06-20 16:37:57,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=754794.0, ans=0.2 2023-06-20 16:38:19,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-20 16:38:44,984 INFO [train.py:996] (1/4) Epoch 5, batch 3850, loss[loss=0.259, simple_loss=0.3139, pruned_loss=0.102, over 21470.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3305, pruned_loss=0.0989, over 4285622.56 frames. ], batch size: 389, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:39:36,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=755094.0, ans=0.2 2023-06-20 16:39:41,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=755094.0, ans=0.0 2023-06-20 16:39:52,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-20 16:40:01,850 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.004e+02 3.563e+02 4.477e+02 7.369e+02, threshold=7.126e+02, percent-clipped=2.0 2023-06-20 16:40:27,654 INFO [train.py:996] (1/4) Epoch 5, batch 3900, loss[loss=0.2539, simple_loss=0.3075, pruned_loss=0.1001, over 21653.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3249, pruned_loss=0.09839, over 4283516.50 frames. ], batch size: 473, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:40:32,021 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-20 16:41:33,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-20 16:42:06,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=755514.0, ans=0.0 2023-06-20 16:42:11,410 INFO [train.py:996] (1/4) Epoch 5, batch 3950, loss[loss=0.192, simple_loss=0.2811, pruned_loss=0.05141, over 21759.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3259, pruned_loss=0.09703, over 4284070.33 frames. ], batch size: 282, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:42:35,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=755634.0, ans=0.125 2023-06-20 16:43:11,253 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:43:24,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.57 vs. limit=6.0 2023-06-20 16:43:33,687 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.896e+02 3.603e+02 4.962e+02 8.484e+02, threshold=7.206e+02, percent-clipped=4.0 2023-06-20 16:43:47,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=755814.0, ans=0.0 2023-06-20 16:43:47,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=755814.0, ans=0.125 2023-06-20 16:43:48,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=755814.0, ans=0.2 2023-06-20 16:43:52,911 INFO [train.py:996] (1/4) Epoch 5, batch 4000, loss[loss=0.2081, simple_loss=0.2683, pruned_loss=0.07397, over 21658.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3181, pruned_loss=0.093, over 4280748.39 frames. ], batch size: 299, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:44:08,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=755874.0, ans=0.0 2023-06-20 16:44:23,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=755934.0, ans=0.025 2023-06-20 16:44:42,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-20 16:45:41,082 INFO [train.py:996] (1/4) Epoch 5, batch 4050, loss[loss=0.261, simple_loss=0.3395, pruned_loss=0.09126, over 21741.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3181, pruned_loss=0.09096, over 4282041.75 frames. ], batch size: 441, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:45:48,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=756174.0, ans=0.125 2023-06-20 16:46:14,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-20 16:46:57,469 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.627e+02 3.094e+02 3.740e+02 6.411e+02, threshold=6.189e+02, percent-clipped=0.0 2023-06-20 16:47:22,987 INFO [train.py:996] (1/4) Epoch 5, batch 4100, loss[loss=0.2408, simple_loss=0.3068, pruned_loss=0.08741, over 21635.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3204, pruned_loss=0.09231, over 4293706.58 frames. ], batch size: 212, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:48:18,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-20 16:49:06,407 INFO [train.py:996] (1/4) Epoch 5, batch 4150, loss[loss=0.1944, simple_loss=0.2799, pruned_loss=0.05445, over 21237.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3239, pruned_loss=0.09068, over 4290518.57 frames. ], batch size: 159, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:49:25,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=756834.0, ans=0.0 2023-06-20 16:49:30,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=756834.0, ans=0.125 2023-06-20 16:49:39,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=756834.0, ans=0.0 2023-06-20 16:50:18,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=756954.0, ans=0.125 2023-06-20 16:50:26,855 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.812e+02 3.283e+02 4.437e+02 7.520e+02, threshold=6.566e+02, percent-clipped=5.0 2023-06-20 16:50:32,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.86 vs. limit=15.0 2023-06-20 16:50:55,590 INFO [train.py:996] (1/4) Epoch 5, batch 4200, loss[loss=0.2452, simple_loss=0.3285, pruned_loss=0.08098, over 21678.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3241, pruned_loss=0.08979, over 4287996.17 frames. ], batch size: 298, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:51:02,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=757074.0, ans=0.0 2023-06-20 16:51:19,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-06-20 16:51:42,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=757194.0, ans=0.2 2023-06-20 16:51:55,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=757254.0, ans=0.125 2023-06-20 16:52:00,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=757254.0, ans=0.125 2023-06-20 16:52:40,362 INFO [train.py:996] (1/4) Epoch 5, batch 4250, loss[loss=0.316, simple_loss=0.3845, pruned_loss=0.1237, over 21434.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3321, pruned_loss=0.09281, over 4283985.25 frames. ], batch size: 471, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:52:40,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=757374.0, ans=0.125 2023-06-20 16:52:44,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=757374.0, ans=0.125 2023-06-20 16:53:56,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=757554.0, ans=0.125 2023-06-20 16:54:03,350 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 2.999e+02 3.547e+02 4.279e+02 1.014e+03, threshold=7.094e+02, percent-clipped=7.0 2023-06-20 16:54:22,740 INFO [train.py:996] (1/4) Epoch 5, batch 4300, loss[loss=0.2529, simple_loss=0.347, pruned_loss=0.07942, over 21738.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3368, pruned_loss=0.09364, over 4274280.15 frames. ], batch size: 351, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:55:46,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=757854.0, ans=0.2 2023-06-20 16:55:54,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=757914.0, ans=0.125 2023-06-20 16:56:17,253 INFO [train.py:996] (1/4) Epoch 5, batch 4350, loss[loss=0.2594, simple_loss=0.3482, pruned_loss=0.08526, over 21308.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3339, pruned_loss=0.09234, over 4274127.34 frames. ], batch size: 548, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:56:24,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=757974.0, ans=0.125 2023-06-20 16:56:29,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-20 16:57:04,426 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:57:31,424 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.864e+02 3.148e+02 3.712e+02 7.836e+02, threshold=6.297e+02, percent-clipped=1.0 2023-06-20 16:57:57,314 INFO [train.py:996] (1/4) Epoch 5, batch 4400, loss[loss=0.2719, simple_loss=0.3486, pruned_loss=0.09765, over 20009.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3286, pruned_loss=0.09108, over 4269990.08 frames. ], batch size: 702, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 16:57:57,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=758274.0, ans=0.125 2023-06-20 16:58:34,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=758334.0, ans=0.125 2023-06-20 16:59:41,585 INFO [train.py:996] (1/4) Epoch 5, batch 4450, loss[loss=0.3473, simple_loss=0.4314, pruned_loss=0.1316, over 21552.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3369, pruned_loss=0.09365, over 4268409.68 frames. ], batch size: 471, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:00:22,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=758634.0, ans=0.125 2023-06-20 17:00:39,537 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-20 17:00:51,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=758754.0, ans=0.0 2023-06-20 17:00:56,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=758754.0, ans=0.125 2023-06-20 17:00:58,685 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:01:08,060 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 2.910e+02 3.386e+02 4.171e+02 6.417e+02, threshold=6.772e+02, percent-clipped=2.0 2023-06-20 17:01:32,513 INFO [train.py:996] (1/4) Epoch 5, batch 4500, loss[loss=0.2492, simple_loss=0.3304, pruned_loss=0.08395, over 21524.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3394, pruned_loss=0.09678, over 4276170.77 frames. ], batch size: 194, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:02:12,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=758994.0, ans=0.0 2023-06-20 17:03:17,245 INFO [train.py:996] (1/4) Epoch 5, batch 4550, loss[loss=0.2583, simple_loss=0.3342, pruned_loss=0.0912, over 21250.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3443, pruned_loss=0.09769, over 4278100.64 frames. ], batch size: 159, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:03:23,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=759174.0, ans=0.0 2023-06-20 17:03:24,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=759174.0, ans=0.0 2023-06-20 17:03:27,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=759174.0, ans=0.125 2023-06-20 17:03:58,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=759294.0, ans=0.035 2023-06-20 17:04:26,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=759354.0, ans=0.125 2023-06-20 17:04:34,180 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.512e+02 3.034e+02 3.831e+02 5.015e+02 1.154e+03, threshold=7.663e+02, percent-clipped=6.0 2023-06-20 17:04:52,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=759414.0, ans=0.1 2023-06-20 17:05:00,157 INFO [train.py:996] (1/4) Epoch 5, batch 4600, loss[loss=0.2212, simple_loss=0.2965, pruned_loss=0.07297, over 21414.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3465, pruned_loss=0.09948, over 4275554.74 frames. ], batch size: 194, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 17:05:21,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=759534.0, ans=0.07 2023-06-20 17:05:23,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=759534.0, ans=0.1 2023-06-20 17:05:36,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=759594.0, ans=0.0 2023-06-20 17:06:07,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=759654.0, ans=0.125 2023-06-20 17:06:36,964 INFO [train.py:996] (1/4) Epoch 5, batch 4650, loss[loss=0.2145, simple_loss=0.2844, pruned_loss=0.07231, over 21772.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3394, pruned_loss=0.09674, over 4280949.62 frames. ], batch size: 351, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:06:39,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=759774.0, ans=0.04949747468305833 2023-06-20 17:07:57,669 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.721e+02 3.118e+02 3.617e+02 7.093e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-20 17:08:05,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=760014.0, ans=6.0 2023-06-20 17:08:14,585 INFO [train.py:996] (1/4) Epoch 5, batch 4700, loss[loss=0.2311, simple_loss=0.2863, pruned_loss=0.08791, over 21805.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3292, pruned_loss=0.09458, over 4283996.70 frames. ], batch size: 124, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:09:11,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-20 17:09:42,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=760314.0, ans=0.0 2023-06-20 17:09:56,835 INFO [train.py:996] (1/4) Epoch 5, batch 4750, loss[loss=0.2633, simple_loss=0.3244, pruned_loss=0.1011, over 21946.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3239, pruned_loss=0.09381, over 4286250.47 frames. ], batch size: 416, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:09:58,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=760374.0, ans=0.0 2023-06-20 17:10:13,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=760434.0, ans=0.1 2023-06-20 17:10:15,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=760434.0, ans=0.125 2023-06-20 17:10:23,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=760434.0, ans=0.125 2023-06-20 17:10:50,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=760554.0, ans=0.125 2023-06-20 17:11:03,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=760554.0, ans=0.5 2023-06-20 17:11:15,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=760554.0, ans=0.2 2023-06-20 17:11:18,303 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.439e+02 2.865e+02 3.322e+02 3.733e+02 5.818e+02, threshold=6.645e+02, percent-clipped=0.0 2023-06-20 17:11:23,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=760614.0, ans=0.95 2023-06-20 17:11:34,489 INFO [train.py:996] (1/4) Epoch 5, batch 4800, loss[loss=0.2404, simple_loss=0.3201, pruned_loss=0.08033, over 21283.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3255, pruned_loss=0.09526, over 4285514.14 frames. ], batch size: 176, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:11:49,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=760734.0, ans=0.125 2023-06-20 17:12:27,921 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:12:43,330 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=15.0 2023-06-20 17:13:06,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.64 vs. limit=5.0 2023-06-20 17:13:15,567 INFO [train.py:996] (1/4) Epoch 5, batch 4850, loss[loss=0.253, simple_loss=0.3036, pruned_loss=0.1011, over 21410.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3236, pruned_loss=0.0944, over 4275722.46 frames. ], batch size: 131, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:13:33,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-20 17:13:34,850 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.42 vs. limit=15.0 2023-06-20 17:13:38,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=761034.0, ans=0.0 2023-06-20 17:13:43,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=761034.0, ans=0.0 2023-06-20 17:14:09,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-20 17:14:41,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.733e+02 3.099e+02 3.561e+02 5.577e+02, threshold=6.198e+02, percent-clipped=0.0 2023-06-20 17:14:44,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=761214.0, ans=0.0 2023-06-20 17:14:50,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=761214.0, ans=0.2 2023-06-20 17:14:58,669 INFO [train.py:996] (1/4) Epoch 5, batch 4900, loss[loss=0.2668, simple_loss=0.3293, pruned_loss=0.1021, over 21797.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3251, pruned_loss=0.09562, over 4280305.91 frames. ], batch size: 124, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:14:59,047 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:15:53,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=761394.0, ans=0.125 2023-06-20 17:16:29,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=761514.0, ans=0.0 2023-06-20 17:16:30,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=761514.0, ans=0.025 2023-06-20 17:16:40,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=761574.0, ans=0.5 2023-06-20 17:16:41,587 INFO [train.py:996] (1/4) Epoch 5, batch 4950, loss[loss=0.2412, simple_loss=0.3177, pruned_loss=0.08233, over 20775.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3262, pruned_loss=0.09325, over 4281656.72 frames. ], batch size: 607, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:16:43,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=761574.0, ans=0.125 2023-06-20 17:17:06,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=761634.0, ans=0.0 2023-06-20 17:17:27,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-06-20 17:17:57,362 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-20 17:17:58,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=761754.0, ans=10.0 2023-06-20 17:18:02,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.74 vs. limit=22.5 2023-06-20 17:18:08,216 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.799e+02 3.225e+02 3.689e+02 6.231e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-20 17:18:22,778 INFO [train.py:996] (1/4) Epoch 5, batch 5000, loss[loss=0.2393, simple_loss=0.3136, pruned_loss=0.08251, over 21651.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3239, pruned_loss=0.08953, over 4285794.83 frames. ], batch size: 263, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:18:42,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=761934.0, ans=10.0 2023-06-20 17:20:03,387 INFO [train.py:996] (1/4) Epoch 5, batch 5050, loss[loss=0.263, simple_loss=0.3277, pruned_loss=0.09915, over 21546.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3258, pruned_loss=0.09091, over 4285393.14 frames. ], batch size: 194, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:20:26,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=762234.0, ans=0.0 2023-06-20 17:21:26,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=762354.0, ans=0.04949747468305833 2023-06-20 17:21:31,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.896e+02 3.588e+02 4.285e+02 7.263e+02, threshold=7.176e+02, percent-clipped=2.0 2023-06-20 17:21:45,589 INFO [train.py:996] (1/4) Epoch 5, batch 5100, loss[loss=0.2315, simple_loss=0.2968, pruned_loss=0.08308, over 21649.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3254, pruned_loss=0.09156, over 4290987.87 frames. ], batch size: 263, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:21:47,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=762474.0, ans=0.125 2023-06-20 17:22:02,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=762534.0, ans=0.1 2023-06-20 17:22:36,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-20 17:23:29,408 INFO [train.py:996] (1/4) Epoch 5, batch 5150, loss[loss=0.2574, simple_loss=0.3237, pruned_loss=0.09551, over 21845.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3245, pruned_loss=0.09236, over 4286024.04 frames. ], batch size: 124, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:24:51,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=762954.0, ans=0.0 2023-06-20 17:24:57,421 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.960e+02 3.348e+02 3.858e+02 5.752e+02, threshold=6.696e+02, percent-clipped=0.0 2023-06-20 17:25:13,085 INFO [train.py:996] (1/4) Epoch 5, batch 5200, loss[loss=0.2053, simple_loss=0.2821, pruned_loss=0.06431, over 21367.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3277, pruned_loss=0.09358, over 4286578.54 frames. ], batch size: 194, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:25:28,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=763074.0, ans=0.0 2023-06-20 17:25:39,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-20 17:25:52,072 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-20 17:25:56,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=763194.0, ans=0.1 2023-06-20 17:26:21,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-20 17:26:42,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=763314.0, ans=0.1 2023-06-20 17:26:47,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=763314.0, ans=0.0 2023-06-20 17:26:54,727 INFO [train.py:996] (1/4) Epoch 5, batch 5250, loss[loss=0.2448, simple_loss=0.3107, pruned_loss=0.0894, over 21584.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3299, pruned_loss=0.09185, over 4284271.92 frames. ], batch size: 548, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:27:05,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=763374.0, ans=0.2 2023-06-20 17:27:23,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-06-20 17:27:55,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=763494.0, ans=0.125 2023-06-20 17:28:06,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=763554.0, ans=0.0 2023-06-20 17:28:11,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=12.0 2023-06-20 17:28:15,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=763554.0, ans=0.0 2023-06-20 17:28:21,655 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.952e+02 3.364e+02 4.524e+02 6.907e+02, threshold=6.729e+02, percent-clipped=2.0 2023-06-20 17:28:36,588 INFO [train.py:996] (1/4) Epoch 5, batch 5300, loss[loss=0.2528, simple_loss=0.3185, pruned_loss=0.09353, over 21870.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3276, pruned_loss=0.09234, over 4286368.28 frames. ], batch size: 351, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:28:51,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=763674.0, ans=0.0 2023-06-20 17:29:07,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=763734.0, ans=0.05 2023-06-20 17:29:53,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=763854.0, ans=0.0 2023-06-20 17:30:14,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=763914.0, ans=0.125 2023-06-20 17:30:22,050 INFO [train.py:996] (1/4) Epoch 5, batch 5350, loss[loss=0.2532, simple_loss=0.3147, pruned_loss=0.09585, over 21272.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3264, pruned_loss=0.09383, over 4284967.05 frames. ], batch size: 176, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:30:25,365 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:31:29,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=764154.0, ans=0.125 2023-06-20 17:31:44,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.105e+02 3.554e+02 4.280e+02 7.043e+02, threshold=7.109e+02, percent-clipped=1.0 2023-06-20 17:31:57,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=764214.0, ans=0.2 2023-06-20 17:32:03,829 INFO [train.py:996] (1/4) Epoch 5, batch 5400, loss[loss=0.2235, simple_loss=0.2968, pruned_loss=0.07511, over 21786.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3267, pruned_loss=0.0942, over 4289942.22 frames. ], batch size: 282, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:32:06,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=764274.0, ans=0.125 2023-06-20 17:32:06,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.65 vs. limit=6.0 2023-06-20 17:32:12,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=764274.0, ans=0.95 2023-06-20 17:32:22,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=764274.0, ans=0.125 2023-06-20 17:32:24,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=764334.0, ans=0.035 2023-06-20 17:33:10,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=764454.0, ans=0.2 2023-06-20 17:33:21,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=764454.0, ans=0.125 2023-06-20 17:33:44,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=764574.0, ans=0.0 2023-06-20 17:33:44,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=764574.0, ans=0.0 2023-06-20 17:33:45,579 INFO [train.py:996] (1/4) Epoch 5, batch 5450, loss[loss=0.286, simple_loss=0.3773, pruned_loss=0.09733, over 21654.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3298, pruned_loss=0.09301, over 4283080.61 frames. ], batch size: 389, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:34:03,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=764574.0, ans=0.125 2023-06-20 17:34:05,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-20 17:34:54,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=764754.0, ans=0.09899494936611666 2023-06-20 17:35:13,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.553e+02 3.012e+02 3.713e+02 8.478e+02, threshold=6.025e+02, percent-clipped=4.0 2023-06-20 17:35:23,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.17 vs. limit=15.0 2023-06-20 17:35:34,654 INFO [train.py:996] (1/4) Epoch 5, batch 5500, loss[loss=0.1979, simple_loss=0.2924, pruned_loss=0.05176, over 21629.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3294, pruned_loss=0.08887, over 4273058.17 frames. ], batch size: 247, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:35:56,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2023-06-20 17:35:58,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.51 vs. limit=22.5 2023-06-20 17:36:41,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=12.0 2023-06-20 17:36:56,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-20 17:37:07,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=765114.0, ans=0.125 2023-06-20 17:37:11,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=765114.0, ans=0.125 2023-06-20 17:37:17,292 INFO [train.py:996] (1/4) Epoch 5, batch 5550, loss[loss=0.2062, simple_loss=0.2983, pruned_loss=0.05704, over 21774.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3283, pruned_loss=0.08527, over 4273212.56 frames. ], batch size: 282, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:38:48,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.754e+02 3.445e+02 4.644e+02 7.344e+02, threshold=6.889e+02, percent-clipped=6.0 2023-06-20 17:39:05,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=765414.0, ans=0.0 2023-06-20 17:39:13,789 INFO [train.py:996] (1/4) Epoch 5, batch 5600, loss[loss=0.2429, simple_loss=0.3371, pruned_loss=0.07428, over 21812.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3292, pruned_loss=0.084, over 4269939.50 frames. ], batch size: 316, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:39:51,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=765594.0, ans=0.0 2023-06-20 17:40:14,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=765654.0, ans=0.125 2023-06-20 17:40:23,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=765654.0, ans=0.125 2023-06-20 17:40:55,601 INFO [train.py:996] (1/4) Epoch 5, batch 5650, loss[loss=0.2846, simple_loss=0.3427, pruned_loss=0.1132, over 21870.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.333, pruned_loss=0.0868, over 4279175.25 frames. ], batch size: 414, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:40:56,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-20 17:41:00,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=765774.0, ans=0.125 2023-06-20 17:41:09,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-20 17:41:10,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=765834.0, ans=0.04949747468305833 2023-06-20 17:41:17,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=765834.0, ans=0.1 2023-06-20 17:41:18,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=765834.0, ans=0.0 2023-06-20 17:41:26,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=765834.0, ans=0.0 2023-06-20 17:41:44,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-20 17:41:44,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=765894.0, ans=0.125 2023-06-20 17:42:00,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=765954.0, ans=0.125 2023-06-20 17:42:17,846 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 3.200e+02 3.767e+02 5.001e+02 8.912e+02, threshold=7.534e+02, percent-clipped=5.0 2023-06-20 17:42:34,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=766014.0, ans=0.2 2023-06-20 17:42:38,877 INFO [train.py:996] (1/4) Epoch 5, batch 5700, loss[loss=0.2202, simple_loss=0.3087, pruned_loss=0.06589, over 21775.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3319, pruned_loss=0.08848, over 4282216.72 frames. ], batch size: 282, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:42:40,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=766074.0, ans=0.04949747468305833 2023-06-20 17:42:54,491 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:43:01,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=766134.0, ans=0.1 2023-06-20 17:43:41,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=766194.0, ans=0.125 2023-06-20 17:44:16,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=766314.0, ans=0.0 2023-06-20 17:44:28,595 INFO [train.py:996] (1/4) Epoch 5, batch 5750, loss[loss=0.2815, simple_loss=0.3565, pruned_loss=0.1032, over 21480.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3278, pruned_loss=0.08526, over 4271368.69 frames. ], batch size: 508, lr: 6.46e-03, grad_scale: 16.0 2023-06-20 17:44:40,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=766374.0, ans=10.0 2023-06-20 17:45:11,066 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-20 17:45:13,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=766494.0, ans=0.125 2023-06-20 17:45:13,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=766494.0, ans=0.125 2023-06-20 17:45:53,253 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.737e+02 3.307e+02 4.353e+02 7.537e+02, threshold=6.613e+02, percent-clipped=1.0 2023-06-20 17:46:11,476 INFO [train.py:996] (1/4) Epoch 5, batch 5800, loss[loss=0.2358, simple_loss=0.3184, pruned_loss=0.07658, over 21704.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3271, pruned_loss=0.0842, over 4273544.96 frames. ], batch size: 247, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:46:33,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=766734.0, ans=0.125 2023-06-20 17:46:33,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=766734.0, ans=0.125 2023-06-20 17:47:23,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-20 17:47:54,113 INFO [train.py:996] (1/4) Epoch 5, batch 5850, loss[loss=0.1953, simple_loss=0.2915, pruned_loss=0.04958, over 21696.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3231, pruned_loss=0.07932, over 4270734.15 frames. ], batch size: 247, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:48:24,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-20 17:49:21,575 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 2.196e+02 2.438e+02 2.861e+02 4.189e+02, threshold=4.877e+02, percent-clipped=0.0 2023-06-20 17:49:21,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=767214.0, ans=0.125 2023-06-20 17:49:32,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=767274.0, ans=0.125 2023-06-20 17:49:34,133 INFO [train.py:996] (1/4) Epoch 5, batch 5900, loss[loss=0.1994, simple_loss=0.2742, pruned_loss=0.06231, over 21803.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3168, pruned_loss=0.07398, over 4278075.81 frames. ], batch size: 298, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:49:56,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=767274.0, ans=0.125 2023-06-20 17:49:56,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=767274.0, ans=0.125 2023-06-20 17:50:42,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=767454.0, ans=0.125 2023-06-20 17:50:42,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=767454.0, ans=0.125 2023-06-20 17:50:59,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=767514.0, ans=0.2 2023-06-20 17:51:14,334 INFO [train.py:996] (1/4) Epoch 5, batch 5950, loss[loss=0.2219, simple_loss=0.2772, pruned_loss=0.08325, over 21588.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3168, pruned_loss=0.07746, over 4280097.52 frames. ], batch size: 230, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:52:42,891 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.088e+02 3.712e+02 4.428e+02 7.411e+02, threshold=7.424e+02, percent-clipped=12.0 2023-06-20 17:52:48,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-20 17:53:01,225 INFO [train.py:996] (1/4) Epoch 5, batch 6000, loss[loss=0.2152, simple_loss=0.2734, pruned_loss=0.07844, over 21633.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3135, pruned_loss=0.08117, over 4280368.44 frames. ], batch size: 282, lr: 6.45e-03, grad_scale: 32.0 2023-06-20 17:53:01,226 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 17:53:19,507 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2687, simple_loss=0.3621, pruned_loss=0.08766, over 1796401.00 frames. 2023-06-20 17:53:19,508 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 17:53:34,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-20 17:53:48,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=767934.0, ans=0.0 2023-06-20 17:53:52,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=767934.0, ans=0.125 2023-06-20 17:55:11,120 INFO [train.py:996] (1/4) Epoch 5, batch 6050, loss[loss=0.251, simple_loss=0.3066, pruned_loss=0.09771, over 21849.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3083, pruned_loss=0.08305, over 4278619.29 frames. ], batch size: 98, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:56:28,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=768414.0, ans=0.0 2023-06-20 17:56:30,088 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.566e+02 3.006e+02 3.910e+02 6.691e+02, threshold=6.013e+02, percent-clipped=0.0 2023-06-20 17:56:46,408 INFO [train.py:996] (1/4) Epoch 5, batch 6100, loss[loss=0.1864, simple_loss=0.2826, pruned_loss=0.04507, over 21691.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3052, pruned_loss=0.0811, over 4281448.49 frames. ], batch size: 298, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:57:08,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-20 17:58:08,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-06-20 17:58:20,837 INFO [train.py:996] (1/4) Epoch 5, batch 6150, loss[loss=0.212, simple_loss=0.2814, pruned_loss=0.0713, over 21709.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3101, pruned_loss=0.08375, over 4282823.76 frames. ], batch size: 247, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:58:49,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=768834.0, ans=0.0 2023-06-20 17:59:41,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=768954.0, ans=0.0 2023-06-20 17:59:51,304 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.776e+02 3.188e+02 3.842e+02 5.972e+02, threshold=6.377e+02, percent-clipped=0.0 2023-06-20 18:00:08,561 INFO [train.py:996] (1/4) Epoch 5, batch 6200, loss[loss=0.2308, simple_loss=0.2975, pruned_loss=0.08208, over 21838.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3108, pruned_loss=0.08318, over 4281688.76 frames. ], batch size: 107, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:00:45,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-20 18:00:56,171 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-20 18:01:25,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=769314.0, ans=0.0 2023-06-20 18:01:38,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=769314.0, ans=0.125 2023-06-20 18:01:41,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=769314.0, ans=0.04949747468305833 2023-06-20 18:01:47,592 INFO [train.py:996] (1/4) Epoch 5, batch 6250, loss[loss=0.286, simple_loss=0.3752, pruned_loss=0.09834, over 21635.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3178, pruned_loss=0.08399, over 4280159.85 frames. ], batch size: 441, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:01:47,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=769374.0, ans=0.125 2023-06-20 18:01:59,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=769374.0, ans=0.125 2023-06-20 18:03:04,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=769554.0, ans=0.2 2023-06-20 18:03:05,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=769614.0, ans=0.125 2023-06-20 18:03:11,655 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.664e+02 3.120e+02 3.845e+02 7.013e+02, threshold=6.240e+02, percent-clipped=3.0 2023-06-20 18:03:28,144 INFO [train.py:996] (1/4) Epoch 5, batch 6300, loss[loss=0.2469, simple_loss=0.3187, pruned_loss=0.0876, over 21777.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3221, pruned_loss=0.08371, over 4279720.85 frames. ], batch size: 441, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:03:30,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=769674.0, ans=0.2 2023-06-20 18:03:56,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=769734.0, ans=0.125 2023-06-20 18:04:04,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=769794.0, ans=0.0 2023-06-20 18:04:12,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-20 18:04:53,276 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:05:06,341 INFO [train.py:996] (1/4) Epoch 5, batch 6350, loss[loss=0.2833, simple_loss=0.3445, pruned_loss=0.111, over 21949.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3259, pruned_loss=0.08889, over 4282858.48 frames. ], batch size: 316, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:05:15,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=769974.0, ans=0.125 2023-06-20 18:05:26,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=770034.0, ans=0.0 2023-06-20 18:05:55,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=770094.0, ans=0.5 2023-06-20 18:06:17,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=770154.0, ans=0.125 2023-06-20 18:06:17,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=770154.0, ans=0.125 2023-06-20 18:06:34,760 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.998e+02 3.510e+02 4.011e+02 7.678e+02, threshold=7.020e+02, percent-clipped=3.0 2023-06-20 18:06:51,033 INFO [train.py:996] (1/4) Epoch 5, batch 6400, loss[loss=0.272, simple_loss=0.3317, pruned_loss=0.1062, over 21473.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3315, pruned_loss=0.09318, over 4281839.48 frames. ], batch size: 194, lr: 6.44e-03, grad_scale: 32.0 2023-06-20 18:07:33,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=770394.0, ans=0.125 2023-06-20 18:08:33,228 INFO [train.py:996] (1/4) Epoch 5, batch 6450, loss[loss=0.2109, simple_loss=0.287, pruned_loss=0.06739, over 21729.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3342, pruned_loss=0.09267, over 4279770.67 frames. ], batch size: 112, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:09:37,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=770754.0, ans=0.125 2023-06-20 18:10:05,740 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.819e+02 3.379e+02 4.009e+02 7.496e+02, threshold=6.759e+02, percent-clipped=3.0 2023-06-20 18:10:15,966 INFO [train.py:996] (1/4) Epoch 5, batch 6500, loss[loss=0.2284, simple_loss=0.305, pruned_loss=0.07588, over 21514.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3269, pruned_loss=0.09126, over 4271223.04 frames. ], batch size: 389, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:10:16,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=770874.0, ans=0.2 2023-06-20 18:10:40,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=770934.0, ans=0.0 2023-06-20 18:11:48,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=771114.0, ans=0.125 2023-06-20 18:11:53,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=12.0 2023-06-20 18:12:02,371 INFO [train.py:996] (1/4) Epoch 5, batch 6550, loss[loss=0.2576, simple_loss=0.3258, pruned_loss=0.09466, over 21632.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3263, pruned_loss=0.09099, over 4269227.15 frames. ], batch size: 441, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:12:44,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=771294.0, ans=0.125 2023-06-20 18:13:05,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=771354.0, ans=0.125 2023-06-20 18:13:26,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=771414.0, ans=0.035 2023-06-20 18:13:31,235 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.772e+02 3.430e+02 4.140e+02 7.576e+02, threshold=6.860e+02, percent-clipped=2.0 2023-06-20 18:13:49,223 INFO [train.py:996] (1/4) Epoch 5, batch 6600, loss[loss=0.2135, simple_loss=0.2697, pruned_loss=0.07869, over 21373.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3218, pruned_loss=0.09012, over 4262506.22 frames. ], batch size: 144, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:14:42,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=771654.0, ans=0.1 2023-06-20 18:15:31,866 INFO [train.py:996] (1/4) Epoch 5, batch 6650, loss[loss=0.2087, simple_loss=0.3422, pruned_loss=0.03761, over 20774.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3163, pruned_loss=0.08698, over 4266748.34 frames. ], batch size: 607, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:15:32,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=12.0 2023-06-20 18:15:59,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=771834.0, ans=0.125 2023-06-20 18:16:06,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=771834.0, ans=0.125 2023-06-20 18:16:10,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=771894.0, ans=0.0 2023-06-20 18:16:30,536 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-06-20 18:17:04,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.627e+02 3.141e+02 4.437e+02 8.167e+02, threshold=6.282e+02, percent-clipped=6.0 2023-06-20 18:17:12,606 INFO [train.py:996] (1/4) Epoch 5, batch 6700, loss[loss=0.2403, simple_loss=0.3111, pruned_loss=0.08473, over 21794.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3124, pruned_loss=0.08635, over 4266337.55 frames. ], batch size: 352, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:17:18,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=15.0 2023-06-20 18:18:02,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=772194.0, ans=0.1 2023-06-20 18:18:52,281 INFO [train.py:996] (1/4) Epoch 5, batch 6750, loss[loss=0.2067, simple_loss=0.2783, pruned_loss=0.06759, over 21639.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3094, pruned_loss=0.08696, over 4264530.51 frames. ], batch size: 298, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:18:59,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=772374.0, ans=0.07 2023-06-20 18:19:37,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=772494.0, ans=0.125 2023-06-20 18:19:52,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=772554.0, ans=0.95 2023-06-20 18:19:59,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=772554.0, ans=0.035 2023-06-20 18:20:15,316 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 2.981e+02 3.524e+02 4.393e+02 7.808e+02, threshold=7.048e+02, percent-clipped=4.0 2023-06-20 18:20:33,901 INFO [train.py:996] (1/4) Epoch 5, batch 6800, loss[loss=0.2452, simple_loss=0.3014, pruned_loss=0.09451, over 21160.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3105, pruned_loss=0.08999, over 4276450.64 frames. ], batch size: 159, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:20:34,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=772674.0, ans=0.0 2023-06-20 18:20:45,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=772674.0, ans=0.125 2023-06-20 18:22:04,504 INFO [train.py:996] (1/4) Epoch 5, batch 6850, loss[loss=0.259, simple_loss=0.3194, pruned_loss=0.09935, over 21459.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3093, pruned_loss=0.09137, over 4273984.63 frames. ], batch size: 131, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:22:40,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=773034.0, ans=0.125 2023-06-20 18:22:41,375 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-20 18:23:26,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-20 18:23:35,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.843e+02 3.245e+02 3.952e+02 6.473e+02, threshold=6.489e+02, percent-clipped=0.0 2023-06-20 18:23:53,367 INFO [train.py:996] (1/4) Epoch 5, batch 6900, loss[loss=0.2331, simple_loss=0.3222, pruned_loss=0.07198, over 21691.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3105, pruned_loss=0.09116, over 4274931.24 frames. ], batch size: 389, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:24:18,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=773334.0, ans=0.125 2023-06-20 18:24:38,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-20 18:25:37,564 INFO [train.py:996] (1/4) Epoch 5, batch 6950, loss[loss=0.2387, simple_loss=0.3031, pruned_loss=0.08713, over 20157.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3113, pruned_loss=0.08878, over 4274201.24 frames. ], batch size: 702, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:25:59,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=773634.0, ans=0.0 2023-06-20 18:26:19,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=773694.0, ans=0.125 2023-06-20 18:27:11,056 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.901e+02 2.952e+02 3.293e+02 4.286e+02 8.056e+02, threshold=6.585e+02, percent-clipped=5.0 2023-06-20 18:27:12,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=8.0 2023-06-20 18:27:19,071 INFO [train.py:996] (1/4) Epoch 5, batch 7000, loss[loss=0.2279, simple_loss=0.2846, pruned_loss=0.08563, over 21624.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3142, pruned_loss=0.0914, over 4278454.45 frames. ], batch size: 247, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:27:55,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=773934.0, ans=0.125 2023-06-20 18:28:36,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.51 vs. limit=6.0 2023-06-20 18:28:55,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=774114.0, ans=0.0 2023-06-20 18:28:58,955 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:29:08,814 INFO [train.py:996] (1/4) Epoch 5, batch 7050, loss[loss=0.2342, simple_loss=0.318, pruned_loss=0.07524, over 21778.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3121, pruned_loss=0.08953, over 4274392.79 frames. ], batch size: 332, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:29:41,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=774294.0, ans=0.0 2023-06-20 18:30:22,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=774354.0, ans=0.0 2023-06-20 18:30:44,366 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.864e+02 3.382e+02 4.312e+02 8.915e+02, threshold=6.764e+02, percent-clipped=3.0 2023-06-20 18:30:52,663 INFO [train.py:996] (1/4) Epoch 5, batch 7100, loss[loss=0.2152, simple_loss=0.3056, pruned_loss=0.06241, over 21703.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3159, pruned_loss=0.08985, over 4271747.28 frames. ], batch size: 415, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:31:09,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=774534.0, ans=0.1 2023-06-20 18:31:15,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=774534.0, ans=0.2 2023-06-20 18:31:44,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=774594.0, ans=0.125 2023-06-20 18:31:57,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=774654.0, ans=0.125 2023-06-20 18:32:08,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=774654.0, ans=0.125 2023-06-20 18:32:22,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-20 18:32:24,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=774714.0, ans=0.95 2023-06-20 18:32:34,547 INFO [train.py:996] (1/4) Epoch 5, batch 7150, loss[loss=0.2545, simple_loss=0.322, pruned_loss=0.09353, over 21591.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3159, pruned_loss=0.08899, over 4270869.23 frames. ], batch size: 263, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:32:53,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=774834.0, ans=0.09899494936611666 2023-06-20 18:34:08,267 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.836e+02 3.355e+02 3.927e+02 6.037e+02, threshold=6.711e+02, percent-clipped=0.0 2023-06-20 18:34:16,151 INFO [train.py:996] (1/4) Epoch 5, batch 7200, loss[loss=0.2613, simple_loss=0.3196, pruned_loss=0.1015, over 21326.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3199, pruned_loss=0.09138, over 4266699.20 frames. ], batch size: 144, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:34:23,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-20 18:34:28,713 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2023-06-20 18:35:17,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-20 18:35:34,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=775254.0, ans=0.0 2023-06-20 18:35:58,229 INFO [train.py:996] (1/4) Epoch 5, batch 7250, loss[loss=0.2158, simple_loss=0.2777, pruned_loss=0.07695, over 21464.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3151, pruned_loss=0.09066, over 4259741.99 frames. ], batch size: 132, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:36:06,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=775374.0, ans=0.1 2023-06-20 18:37:30,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=775614.0, ans=0.1 2023-06-20 18:37:30,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=775614.0, ans=0.125 2023-06-20 18:37:32,030 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 2.761e+02 3.389e+02 4.055e+02 6.932e+02, threshold=6.778e+02, percent-clipped=1.0 2023-06-20 18:37:40,432 INFO [train.py:996] (1/4) Epoch 5, batch 7300, loss[loss=0.2327, simple_loss=0.288, pruned_loss=0.08864, over 21973.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.308, pruned_loss=0.08868, over 4262610.50 frames. ], batch size: 113, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:37:42,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=775674.0, ans=0.125 2023-06-20 18:37:57,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=775674.0, ans=0.0 2023-06-20 18:38:36,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=775794.0, ans=0.125 2023-06-20 18:38:50,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=775854.0, ans=0.125 2023-06-20 18:39:25,162 INFO [train.py:996] (1/4) Epoch 5, batch 7350, loss[loss=0.2922, simple_loss=0.3436, pruned_loss=0.1205, over 21367.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3049, pruned_loss=0.08945, over 4269589.67 frames. ], batch size: 176, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:39:39,901 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-20 18:40:08,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=776034.0, ans=0.125 2023-06-20 18:40:35,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=776154.0, ans=0.2 2023-06-20 18:40:48,632 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:41:01,634 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 3.040e+02 3.793e+02 4.434e+02 6.655e+02, threshold=7.586e+02, percent-clipped=0.0 2023-06-20 18:41:04,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=776214.0, ans=0.0 2023-06-20 18:41:09,279 INFO [train.py:996] (1/4) Epoch 5, batch 7400, loss[loss=0.2325, simple_loss=0.3074, pruned_loss=0.07879, over 21562.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3108, pruned_loss=0.09271, over 4269568.41 frames. ], batch size: 230, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:41:42,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=776334.0, ans=0.125 2023-06-20 18:42:19,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=776454.0, ans=0.125 2023-06-20 18:42:41,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=776514.0, ans=0.125 2023-06-20 18:42:58,389 INFO [train.py:996] (1/4) Epoch 5, batch 7450, loss[loss=0.2149, simple_loss=0.2811, pruned_loss=0.07432, over 21661.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3099, pruned_loss=0.09113, over 4248994.33 frames. ], batch size: 282, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:44:18,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=776754.0, ans=0.125 2023-06-20 18:44:28,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=776814.0, ans=0.125 2023-06-20 18:44:34,828 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 2.956e+02 3.297e+02 4.311e+02 7.109e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-20 18:44:35,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-20 18:44:48,778 INFO [train.py:996] (1/4) Epoch 5, batch 7500, loss[loss=0.3144, simple_loss=0.4019, pruned_loss=0.1135, over 21733.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3163, pruned_loss=0.0932, over 4255085.77 frames. ], batch size: 351, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:45:28,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=776934.0, ans=0.0 2023-06-20 18:45:58,998 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-20 18:46:23,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-20 18:46:33,801 INFO [train.py:996] (1/4) Epoch 5, batch 7550, loss[loss=0.2088, simple_loss=0.2866, pruned_loss=0.06552, over 21166.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3241, pruned_loss=0.09175, over 4263527.07 frames. ], batch size: 143, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:47:15,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=777294.0, ans=0.2 2023-06-20 18:47:40,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=777354.0, ans=0.125 2023-06-20 18:47:48,691 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:48:01,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.804e+02 3.228e+02 4.067e+02 8.299e+02, threshold=6.455e+02, percent-clipped=1.0 2023-06-20 18:48:14,639 INFO [train.py:996] (1/4) Epoch 5, batch 7600, loss[loss=0.2584, simple_loss=0.3188, pruned_loss=0.09898, over 21566.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3249, pruned_loss=0.09059, over 4264211.31 frames. ], batch size: 548, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:48:58,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-20 18:49:04,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=777594.0, ans=0.0 2023-06-20 18:49:33,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=777714.0, ans=0.07 2023-06-20 18:50:01,075 INFO [train.py:996] (1/4) Epoch 5, batch 7650, loss[loss=0.2313, simple_loss=0.2941, pruned_loss=0.08427, over 21583.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3223, pruned_loss=0.09195, over 4273525.36 frames. ], batch size: 212, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:50:16,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=777834.0, ans=0.2 2023-06-20 18:50:28,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=777834.0, ans=0.125 2023-06-20 18:51:36,495 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 3.076e+02 3.625e+02 4.419e+02 8.627e+02, threshold=7.249e+02, percent-clipped=2.0 2023-06-20 18:51:44,637 INFO [train.py:996] (1/4) Epoch 5, batch 7700, loss[loss=0.2597, simple_loss=0.3301, pruned_loss=0.0946, over 21869.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3264, pruned_loss=0.09627, over 4282986.06 frames. ], batch size: 371, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:51:45,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-20 18:52:26,718 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:53:12,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=778314.0, ans=0.015 2023-06-20 18:53:35,122 INFO [train.py:996] (1/4) Epoch 5, batch 7750, loss[loss=0.2472, simple_loss=0.3194, pruned_loss=0.08752, over 21102.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.332, pruned_loss=0.09552, over 4279610.61 frames. ], batch size: 143, lr: 6.41e-03, grad_scale: 16.0 2023-06-20 18:53:35,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=778374.0, ans=0.04949747468305833 2023-06-20 18:53:45,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=778374.0, ans=0.125 2023-06-20 18:53:47,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.54 vs. limit=10.0 2023-06-20 18:54:06,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=778434.0, ans=0.1 2023-06-20 18:55:04,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=778614.0, ans=0.125 2023-06-20 18:55:13,709 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 2.997e+02 3.414e+02 4.185e+02 6.372e+02, threshold=6.827e+02, percent-clipped=0.0 2023-06-20 18:55:19,846 INFO [train.py:996] (1/4) Epoch 5, batch 7800, loss[loss=0.2273, simple_loss=0.279, pruned_loss=0.08781, over 21250.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3336, pruned_loss=0.09551, over 4275772.19 frames. ], batch size: 176, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:56:00,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=778794.0, ans=0.0 2023-06-20 18:56:50,962 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:56:55,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=778914.0, ans=0.1 2023-06-20 18:57:03,281 INFO [train.py:996] (1/4) Epoch 5, batch 7850, loss[loss=0.2333, simple_loss=0.2851, pruned_loss=0.09074, over 21944.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3264, pruned_loss=0.09489, over 4280528.39 frames. ], batch size: 113, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:57:24,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.01 vs. limit=22.5 2023-06-20 18:57:36,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=779034.0, ans=0.2 2023-06-20 18:57:45,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=15.0 2023-06-20 18:58:04,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=779154.0, ans=0.0 2023-06-20 18:58:41,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.807e+02 3.191e+02 4.000e+02 6.084e+02, threshold=6.382e+02, percent-clipped=0.0 2023-06-20 18:58:48,746 INFO [train.py:996] (1/4) Epoch 5, batch 7900, loss[loss=0.2788, simple_loss=0.3719, pruned_loss=0.0928, over 21754.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.323, pruned_loss=0.09456, over 4273770.76 frames. ], batch size: 351, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:58:57,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779274.0, ans=0.1 2023-06-20 18:59:30,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-20 19:00:31,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-20 19:00:34,064 INFO [train.py:996] (1/4) Epoch 5, batch 7950, loss[loss=0.2305, simple_loss=0.3027, pruned_loss=0.07918, over 21435.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.324, pruned_loss=0.09328, over 4272367.92 frames. ], batch size: 211, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:01:15,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779634.0, ans=0.1 2023-06-20 19:01:56,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=779814.0, ans=0.125 2023-06-20 19:02:08,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 3.095e+02 3.551e+02 4.583e+02 8.567e+02, threshold=7.102e+02, percent-clipped=4.0 2023-06-20 19:02:14,708 INFO [train.py:996] (1/4) Epoch 5, batch 8000, loss[loss=0.264, simple_loss=0.3479, pruned_loss=0.09006, over 21691.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3285, pruned_loss=0.09627, over 4267770.15 frames. ], batch size: 351, lr: 6.40e-03, grad_scale: 32.0 2023-06-20 19:03:48,579 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-20 19:03:53,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=780114.0, ans=0.2 2023-06-20 19:04:02,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=780114.0, ans=0.125 2023-06-20 19:04:05,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=780114.0, ans=0.125 2023-06-20 19:04:08,242 INFO [train.py:996] (1/4) Epoch 5, batch 8050, loss[loss=0.3657, simple_loss=0.4315, pruned_loss=0.15, over 21489.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3338, pruned_loss=0.09629, over 4270652.04 frames. ], batch size: 507, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:04:10,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-20 19:04:14,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-20 19:05:03,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=780294.0, ans=0.2 2023-06-20 19:05:19,399 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-20 19:05:46,952 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.429e+02 3.996e+02 5.275e+02 1.132e+03, threshold=7.992e+02, percent-clipped=3.0 2023-06-20 19:05:52,173 INFO [train.py:996] (1/4) Epoch 5, batch 8100, loss[loss=0.2775, simple_loss=0.3464, pruned_loss=0.1043, over 21760.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3332, pruned_loss=0.09734, over 4279919.51 frames. ], batch size: 112, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:06:06,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=780474.0, ans=0.125 2023-06-20 19:06:37,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=780594.0, ans=0.02 2023-06-20 19:07:48,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=780774.0, ans=0.125 2023-06-20 19:07:49,538 INFO [train.py:996] (1/4) Epoch 5, batch 8150, loss[loss=0.2697, simple_loss=0.363, pruned_loss=0.08822, over 21726.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3373, pruned_loss=0.09684, over 4276457.29 frames. ], batch size: 351, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:07:54,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=780774.0, ans=0.125 2023-06-20 19:08:10,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=780834.0, ans=0.0 2023-06-20 19:08:28,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=780894.0, ans=0.125 2023-06-20 19:09:27,390 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.029e+02 3.506e+02 4.177e+02 7.466e+02, threshold=7.011e+02, percent-clipped=0.0 2023-06-20 19:09:32,490 INFO [train.py:996] (1/4) Epoch 5, batch 8200, loss[loss=0.2375, simple_loss=0.2903, pruned_loss=0.09233, over 21266.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3293, pruned_loss=0.0936, over 4266053.09 frames. ], batch size: 551, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:10:29,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=781194.0, ans=0.125 2023-06-20 19:10:30,003 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:10:39,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-20 19:11:03,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-20 19:11:15,872 INFO [train.py:996] (1/4) Epoch 5, batch 8250, loss[loss=0.2343, simple_loss=0.3238, pruned_loss=0.07243, over 21387.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3274, pruned_loss=0.09284, over 4263219.03 frames. ], batch size: 194, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:11:19,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=781374.0, ans=0.125 2023-06-20 19:11:57,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=781434.0, ans=0.125 2023-06-20 19:12:19,744 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.90 vs. limit=8.0 2023-06-20 19:12:52,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.60 vs. limit=15.0 2023-06-20 19:12:56,436 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.737e+02 3.159e+02 4.146e+02 7.904e+02, threshold=6.318e+02, percent-clipped=1.0 2023-06-20 19:12:56,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=781614.0, ans=0.125 2023-06-20 19:12:59,855 INFO [train.py:996] (1/4) Epoch 5, batch 8300, loss[loss=0.2151, simple_loss=0.2853, pruned_loss=0.07249, over 21239.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3253, pruned_loss=0.09071, over 4266929.71 frames. ], batch size: 608, lr: 6.39e-03, grad_scale: 8.0 2023-06-20 19:13:38,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=781734.0, ans=0.1 2023-06-20 19:13:40,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=781734.0, ans=0.2 2023-06-20 19:13:45,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=781794.0, ans=0.125 2023-06-20 19:14:37,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=781914.0, ans=0.125 2023-06-20 19:14:43,839 INFO [train.py:996] (1/4) Epoch 5, batch 8350, loss[loss=0.2567, simple_loss=0.3298, pruned_loss=0.09176, over 21704.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.325, pruned_loss=0.08878, over 4268022.88 frames. ], batch size: 282, lr: 6.39e-03, grad_scale: 8.0 2023-06-20 19:14:46,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=781974.0, ans=0.125 2023-06-20 19:14:49,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=781974.0, ans=0.0 2023-06-20 19:14:59,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=782034.0, ans=0.2 2023-06-20 19:15:24,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=782034.0, ans=0.07 2023-06-20 19:16:18,273 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.671e+02 3.158e+02 4.298e+02 8.367e+02, threshold=6.316e+02, percent-clipped=9.0 2023-06-20 19:16:21,592 INFO [train.py:996] (1/4) Epoch 5, batch 8400, loss[loss=0.2344, simple_loss=0.2703, pruned_loss=0.09921, over 20165.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3236, pruned_loss=0.08659, over 4255479.12 frames. ], batch size: 704, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:16:30,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=782274.0, ans=0.125 2023-06-20 19:16:42,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=782334.0, ans=0.125 2023-06-20 19:16:51,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=782334.0, ans=0.125 2023-06-20 19:17:40,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-20 19:18:06,484 INFO [train.py:996] (1/4) Epoch 5, batch 8450, loss[loss=0.2447, simple_loss=0.2979, pruned_loss=0.09573, over 21392.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.322, pruned_loss=0.0874, over 4264557.75 frames. ], batch size: 194, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:18:19,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=782574.0, ans=0.125 2023-06-20 19:18:53,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=782694.0, ans=0.125 2023-06-20 19:19:01,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-20 19:19:02,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-20 19:19:17,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=782754.0, ans=0.2 2023-06-20 19:19:23,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2023-06-20 19:19:25,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=782754.0, ans=0.125 2023-06-20 19:19:35,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=782814.0, ans=0.125 2023-06-20 19:19:44,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=782814.0, ans=15.0 2023-06-20 19:19:46,753 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.787e+02 3.267e+02 4.084e+02 6.258e+02, threshold=6.534e+02, percent-clipped=0.0 2023-06-20 19:19:49,893 INFO [train.py:996] (1/4) Epoch 5, batch 8500, loss[loss=0.2377, simple_loss=0.2889, pruned_loss=0.09327, over 21493.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3187, pruned_loss=0.08905, over 4272452.81 frames. ], batch size: 212, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:20:01,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=782874.0, ans=0.125 2023-06-20 19:20:48,313 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.88 vs. limit=10.0 2023-06-20 19:21:18,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=783114.0, ans=0.0 2023-06-20 19:21:22,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783114.0, ans=0.1 2023-06-20 19:21:33,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=783174.0, ans=0.2 2023-06-20 19:21:35,048 INFO [train.py:996] (1/4) Epoch 5, batch 8550, loss[loss=0.3199, simple_loss=0.3881, pruned_loss=0.1259, over 21734.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3255, pruned_loss=0.09321, over 4271826.17 frames. ], batch size: 351, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:21:41,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=783174.0, ans=0.125 2023-06-20 19:22:52,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783354.0, ans=0.1 2023-06-20 19:23:06,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=783414.0, ans=0.125 2023-06-20 19:23:16,127 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 2.988e+02 3.421e+02 4.230e+02 6.048e+02, threshold=6.842e+02, percent-clipped=0.0 2023-06-20 19:23:19,459 INFO [train.py:996] (1/4) Epoch 5, batch 8600, loss[loss=0.3025, simple_loss=0.3681, pruned_loss=0.1185, over 21741.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3345, pruned_loss=0.09643, over 4268664.27 frames. ], batch size: 332, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:23:36,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=783474.0, ans=0.04949747468305833 2023-06-20 19:24:57,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=783714.0, ans=0.125 2023-06-20 19:25:14,163 INFO [train.py:996] (1/4) Epoch 5, batch 8650, loss[loss=0.2573, simple_loss=0.3447, pruned_loss=0.08497, over 21532.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3384, pruned_loss=0.09639, over 4272572.17 frames. ], batch size: 471, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:25:21,701 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-20 19:26:08,556 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-20 19:26:09,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=783894.0, ans=0.2 2023-06-20 19:26:48,606 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.897e+02 3.428e+02 4.241e+02 7.600e+02, threshold=6.856e+02, percent-clipped=1.0 2023-06-20 19:26:51,747 INFO [train.py:996] (1/4) Epoch 5, batch 8700, loss[loss=0.2952, simple_loss=0.4095, pruned_loss=0.09052, over 20750.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3293, pruned_loss=0.09187, over 4270740.99 frames. ], batch size: 607, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:28:31,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=784314.0, ans=0.2 2023-06-20 19:28:34,529 INFO [train.py:996] (1/4) Epoch 5, batch 8750, loss[loss=0.265, simple_loss=0.3195, pruned_loss=0.1053, over 21327.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3236, pruned_loss=0.09194, over 4281018.04 frames. ], batch size: 176, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:28:46,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=784374.0, ans=0.125 2023-06-20 19:29:00,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=784434.0, ans=0.125 2023-06-20 19:29:34,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=784554.0, ans=0.125 2023-06-20 19:29:48,825 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-20 19:30:01,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=784614.0, ans=0.125 2023-06-20 19:30:14,568 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.095e+02 3.629e+02 5.257e+02 8.550e+02, threshold=7.257e+02, percent-clipped=6.0 2023-06-20 19:30:18,134 INFO [train.py:996] (1/4) Epoch 5, batch 8800, loss[loss=0.3054, simple_loss=0.3748, pruned_loss=0.1179, over 21490.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3314, pruned_loss=0.09528, over 4282917.50 frames. ], batch size: 211, lr: 6.38e-03, grad_scale: 32.0 2023-06-20 19:31:14,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=784794.0, ans=0.2 2023-06-20 19:31:47,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=784914.0, ans=0.2 2023-06-20 19:32:01,522 INFO [train.py:996] (1/4) Epoch 5, batch 8850, loss[loss=0.2517, simple_loss=0.3139, pruned_loss=0.09473, over 21534.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3407, pruned_loss=0.09737, over 4286672.21 frames. ], batch size: 441, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:33:25,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=785214.0, ans=0.125 2023-06-20 19:33:45,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.064e+02 3.643e+02 4.711e+02 6.430e+02, threshold=7.286e+02, percent-clipped=0.0 2023-06-20 19:33:47,237 INFO [train.py:996] (1/4) Epoch 5, batch 8900, loss[loss=0.2474, simple_loss=0.3302, pruned_loss=0.08232, over 21566.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3356, pruned_loss=0.09637, over 4270839.44 frames. ], batch size: 441, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:33:57,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=785274.0, ans=0.1 2023-06-20 19:34:01,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=785274.0, ans=10.0 2023-06-20 19:34:18,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=785334.0, ans=0.07 2023-06-20 19:34:51,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=785394.0, ans=0.2 2023-06-20 19:34:59,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.17 vs. limit=12.0 2023-06-20 19:35:37,862 INFO [train.py:996] (1/4) Epoch 5, batch 8950, loss[loss=0.2341, simple_loss=0.3012, pruned_loss=0.08352, over 21572.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3327, pruned_loss=0.09442, over 4263261.51 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:35:54,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=785634.0, ans=0.1 2023-06-20 19:35:58,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-20 19:36:19,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=785694.0, ans=0.125 2023-06-20 19:36:56,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=785754.0, ans=0.125 2023-06-20 19:37:18,308 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.139e+02 3.697e+02 4.422e+02 7.989e+02, threshold=7.395e+02, percent-clipped=1.0 2023-06-20 19:37:19,862 INFO [train.py:996] (1/4) Epoch 5, batch 9000, loss[loss=0.2195, simple_loss=0.2876, pruned_loss=0.07566, over 21247.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3272, pruned_loss=0.0936, over 4264571.65 frames. ], batch size: 159, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:37:19,863 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 19:37:36,466 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2656, simple_loss=0.3627, pruned_loss=0.0843, over 1796401.00 frames. 2023-06-20 19:37:36,467 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 19:37:40,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=785874.0, ans=0.125 2023-06-20 19:37:44,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=785874.0, ans=10.0 2023-06-20 19:38:00,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=785934.0, ans=0.125 2023-06-20 19:38:55,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=786054.0, ans=0.125 2023-06-20 19:38:56,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-20 19:39:20,730 INFO [train.py:996] (1/4) Epoch 5, batch 9050, loss[loss=0.2681, simple_loss=0.3376, pruned_loss=0.09925, over 21339.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3223, pruned_loss=0.08887, over 4266427.95 frames. ], batch size: 549, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:39:49,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=786234.0, ans=0.125 2023-06-20 19:39:57,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-06-20 19:40:10,959 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-20 19:40:20,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=786354.0, ans=0.125 2023-06-20 19:40:29,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=786354.0, ans=0.125 2023-06-20 19:40:37,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=786354.0, ans=0.2 2023-06-20 19:40:52,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=786414.0, ans=0.125 2023-06-20 19:40:58,939 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.876e+02 3.189e+02 3.875e+02 6.604e+02, threshold=6.378e+02, percent-clipped=0.0 2023-06-20 19:41:00,738 INFO [train.py:996] (1/4) Epoch 5, batch 9100, loss[loss=0.2462, simple_loss=0.3486, pruned_loss=0.07192, over 21697.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3285, pruned_loss=0.0905, over 4268391.82 frames. ], batch size: 414, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:41:21,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=786474.0, ans=0.125 2023-06-20 19:41:50,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=22.5 2023-06-20 19:42:02,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=786594.0, ans=0.2 2023-06-20 19:42:20,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-20 19:42:45,934 INFO [train.py:996] (1/4) Epoch 5, batch 9150, loss[loss=0.3822, simple_loss=0.4417, pruned_loss=0.1613, over 21555.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3308, pruned_loss=0.08872, over 4272573.73 frames. ], batch size: 508, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:42:54,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=786774.0, ans=0.2 2023-06-20 19:44:20,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=787014.0, ans=0.0 2023-06-20 19:44:35,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=787014.0, ans=0.0 2023-06-20 19:44:38,226 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.853e+02 3.445e+02 4.216e+02 8.485e+02, threshold=6.890e+02, percent-clipped=4.0 2023-06-20 19:44:39,946 INFO [train.py:996] (1/4) Epoch 5, batch 9200, loss[loss=0.2036, simple_loss=0.2924, pruned_loss=0.05736, over 21319.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3332, pruned_loss=0.08806, over 4267046.81 frames. ], batch size: 176, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:45:08,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=787134.0, ans=0.0 2023-06-20 19:45:25,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=787194.0, ans=0.2 2023-06-20 19:46:16,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=787314.0, ans=0.125 2023-06-20 19:46:23,472 INFO [train.py:996] (1/4) Epoch 5, batch 9250, loss[loss=0.2319, simple_loss=0.2949, pruned_loss=0.08443, over 21681.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3362, pruned_loss=0.09211, over 4268838.72 frames. ], batch size: 333, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:46:31,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=787374.0, ans=0.125 2023-06-20 19:46:32,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=787374.0, ans=0.125 2023-06-20 19:46:41,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=787374.0, ans=0.125 2023-06-20 19:47:19,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-20 19:47:57,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=787614.0, ans=0.0 2023-06-20 19:48:05,892 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.152e+02 3.844e+02 5.008e+02 7.995e+02, threshold=7.688e+02, percent-clipped=5.0 2023-06-20 19:48:12,327 INFO [train.py:996] (1/4) Epoch 5, batch 9300, loss[loss=0.247, simple_loss=0.3437, pruned_loss=0.07516, over 21576.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3308, pruned_loss=0.09236, over 4270238.26 frames. ], batch size: 389, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:48:24,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=787674.0, ans=0.125 2023-06-20 19:48:28,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=787734.0, ans=0.0 2023-06-20 19:48:54,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-20 19:48:58,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=787794.0, ans=0.125 2023-06-20 19:49:24,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=787854.0, ans=0.0 2023-06-20 19:49:51,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.16 vs. limit=8.0 2023-06-20 19:49:54,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-20 19:49:57,390 INFO [train.py:996] (1/4) Epoch 5, batch 9350, loss[loss=0.2709, simple_loss=0.3422, pruned_loss=0.09978, over 21538.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.337, pruned_loss=0.09387, over 4271037.20 frames. ], batch size: 194, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:50:01,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=787974.0, ans=0.1 2023-06-20 19:50:12,487 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.94 vs. limit=10.0 2023-06-20 19:50:17,495 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-20 19:50:55,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=788094.0, ans=0.0 2023-06-20 19:51:32,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=788214.0, ans=0.1 2023-06-20 19:51:41,579 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.821e+02 3.195e+02 3.693e+02 6.028e+02, threshold=6.390e+02, percent-clipped=0.0 2023-06-20 19:51:41,599 INFO [train.py:996] (1/4) Epoch 5, batch 9400, loss[loss=0.3802, simple_loss=0.4888, pruned_loss=0.1357, over 19943.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3385, pruned_loss=0.09516, over 4279369.58 frames. ], batch size: 702, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:51:49,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.14 vs. limit=6.0 2023-06-20 19:52:15,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=788334.0, ans=0.125 2023-06-20 19:52:15,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=788334.0, ans=0.1 2023-06-20 19:52:50,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=788454.0, ans=0.015 2023-06-20 19:53:31,430 INFO [train.py:996] (1/4) Epoch 5, batch 9450, loss[loss=0.2283, simple_loss=0.2875, pruned_loss=0.08458, over 21548.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.331, pruned_loss=0.0944, over 4272150.95 frames. ], batch size: 391, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:53:59,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=788634.0, ans=0.1 2023-06-20 19:54:53,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=788814.0, ans=0.1 2023-06-20 19:55:02,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=788814.0, ans=0.0 2023-06-20 19:55:04,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=788814.0, ans=0.0 2023-06-20 19:55:15,210 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.307e+02 4.000e+02 4.833e+02 8.255e+02, threshold=8.000e+02, percent-clipped=8.0 2023-06-20 19:55:15,241 INFO [train.py:996] (1/4) Epoch 5, batch 9500, loss[loss=0.2566, simple_loss=0.329, pruned_loss=0.0921, over 21735.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3248, pruned_loss=0.09171, over 4256239.19 frames. ], batch size: 332, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:56:07,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=788994.0, ans=0.125 2023-06-20 19:56:58,662 INFO [train.py:996] (1/4) Epoch 5, batch 9550, loss[loss=0.302, simple_loss=0.3786, pruned_loss=0.1126, over 21920.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3278, pruned_loss=0.09399, over 4262413.38 frames. ], batch size: 372, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:58:26,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.82 vs. limit=10.0 2023-06-20 19:58:42,203 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 2.917e+02 3.406e+02 3.917e+02 8.592e+02, threshold=6.812e+02, percent-clipped=1.0 2023-06-20 19:58:42,224 INFO [train.py:996] (1/4) Epoch 5, batch 9600, loss[loss=0.2164, simple_loss=0.291, pruned_loss=0.07088, over 21797.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3316, pruned_loss=0.09523, over 4266154.44 frames. ], batch size: 298, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 19:58:50,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=789474.0, ans=0.2 2023-06-20 19:58:55,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=789474.0, ans=0.125 2023-06-20 19:59:00,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=789534.0, ans=0.2 2023-06-20 19:59:30,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.67 vs. limit=15.0 2023-06-20 19:59:49,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=789654.0, ans=0.125 2023-06-20 20:00:27,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=789774.0, ans=0.125 2023-06-20 20:00:28,514 INFO [train.py:996] (1/4) Epoch 5, batch 9650, loss[loss=0.2947, simple_loss=0.3705, pruned_loss=0.1094, over 21795.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3311, pruned_loss=0.0957, over 4271968.26 frames. ], batch size: 124, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 20:00:57,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-20 20:01:46,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=789954.0, ans=0.125 2023-06-20 20:02:07,772 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-20 20:02:13,487 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.873e+02 3.424e+02 4.153e+02 6.817e+02, threshold=6.847e+02, percent-clipped=1.0 2023-06-20 20:02:13,508 INFO [train.py:996] (1/4) Epoch 5, batch 9700, loss[loss=0.2656, simple_loss=0.3308, pruned_loss=0.1002, over 21592.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.334, pruned_loss=0.09661, over 4273697.01 frames. ], batch size: 389, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 20:03:07,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=790194.0, ans=0.1 2023-06-20 20:03:21,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-20 20:03:56,223 INFO [train.py:996] (1/4) Epoch 5, batch 9750, loss[loss=0.2975, simple_loss=0.3738, pruned_loss=0.1106, over 21869.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3287, pruned_loss=0.09468, over 4275164.28 frames. ], batch size: 98, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 20:04:09,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=790374.0, ans=0.125 2023-06-20 20:04:46,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=790494.0, ans=10.0 2023-06-20 20:04:59,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=790554.0, ans=0.125 2023-06-20 20:05:09,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=790554.0, ans=0.125 2023-06-20 20:05:16,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=790614.0, ans=0.1 2023-06-20 20:05:20,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=790614.0, ans=0.125 2023-06-20 20:05:34,144 INFO [train.py:996] (1/4) Epoch 5, batch 9800, loss[loss=0.2497, simple_loss=0.3193, pruned_loss=0.09007, over 15323.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3265, pruned_loss=0.09461, over 4277191.22 frames. ], batch size: 60, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 20:05:35,778 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.031e+02 3.637e+02 4.540e+02 9.363e+02, threshold=7.274e+02, percent-clipped=7.0 2023-06-20 20:06:25,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=790794.0, ans=0.2 2023-06-20 20:06:53,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=790914.0, ans=0.1 2023-06-20 20:06:58,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-20 20:07:12,174 INFO [train.py:996] (1/4) Epoch 5, batch 9850, loss[loss=0.3104, simple_loss=0.3697, pruned_loss=0.1256, over 20666.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3236, pruned_loss=0.09466, over 4278927.11 frames. ], batch size: 607, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:07:44,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-20 20:07:54,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-20 20:08:10,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=791154.0, ans=0.025 2023-06-20 20:08:51,601 INFO [train.py:996] (1/4) Epoch 5, batch 9900, loss[loss=0.2599, simple_loss=0.3336, pruned_loss=0.09312, over 21412.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3191, pruned_loss=0.09374, over 4267521.27 frames. ], batch size: 131, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:08:53,111 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.770e+02 3.187e+02 3.744e+02 7.656e+02, threshold=6.375e+02, percent-clipped=1.0 2023-06-20 20:10:04,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=791454.0, ans=0.2 2023-06-20 20:10:10,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-20 20:10:29,340 INFO [train.py:996] (1/4) Epoch 5, batch 9950, loss[loss=0.25, simple_loss=0.3118, pruned_loss=0.09409, over 21831.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3234, pruned_loss=0.09588, over 4273310.83 frames. ], batch size: 118, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:10:51,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=791634.0, ans=0.0 2023-06-20 20:11:29,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=791694.0, ans=0.0 2023-06-20 20:12:11,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=791874.0, ans=0.0 2023-06-20 20:12:13,170 INFO [train.py:996] (1/4) Epoch 5, batch 10000, loss[loss=0.2365, simple_loss=0.3074, pruned_loss=0.08284, over 21939.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3186, pruned_loss=0.09328, over 4275681.60 frames. ], batch size: 317, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:12:14,863 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.860e+02 3.256e+02 3.834e+02 6.756e+02, threshold=6.512e+02, percent-clipped=1.0 2023-06-20 20:12:24,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=791874.0, ans=0.05 2023-06-20 20:13:05,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-20 20:14:04,452 INFO [train.py:996] (1/4) Epoch 5, batch 10050, loss[loss=0.2441, simple_loss=0.3093, pruned_loss=0.08944, over 21613.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3205, pruned_loss=0.09313, over 4277476.85 frames. ], batch size: 391, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:14:12,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=792174.0, ans=0.2 2023-06-20 20:14:54,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-20 20:15:48,181 INFO [train.py:996] (1/4) Epoch 5, batch 10100, loss[loss=0.2918, simple_loss=0.3593, pruned_loss=0.1122, over 21882.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3177, pruned_loss=0.09063, over 4277502.75 frames. ], batch size: 371, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:15:49,940 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.949e+02 3.459e+02 3.930e+02 6.580e+02, threshold=6.918e+02, percent-clipped=2.0 2023-06-20 20:16:05,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=792474.0, ans=0.0 2023-06-20 20:17:27,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=792714.0, ans=0.1 2023-06-20 20:17:30,450 INFO [train.py:996] (1/4) Epoch 5, batch 10150, loss[loss=0.2619, simple_loss=0.3294, pruned_loss=0.09724, over 21498.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3254, pruned_loss=0.09446, over 4276551.51 frames. ], batch size: 389, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:17:38,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-06-20 20:18:04,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=792834.0, ans=0.125 2023-06-20 20:18:21,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=792894.0, ans=0.1 2023-06-20 20:19:12,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=793014.0, ans=0.0 2023-06-20 20:19:19,856 INFO [train.py:996] (1/4) Epoch 5, batch 10200, loss[loss=0.2277, simple_loss=0.3049, pruned_loss=0.0752, over 21263.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3247, pruned_loss=0.09227, over 4280108.72 frames. ], batch size: 176, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:19:21,485 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.868e+02 3.276e+02 4.054e+02 7.472e+02, threshold=6.552e+02, percent-clipped=1.0 2023-06-20 20:19:54,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=793134.0, ans=0.2 2023-06-20 20:20:31,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=793254.0, ans=0.125 2023-06-20 20:21:02,673 INFO [train.py:996] (1/4) Epoch 5, batch 10250, loss[loss=0.1613, simple_loss=0.234, pruned_loss=0.04425, over 21180.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3181, pruned_loss=0.08617, over 4274628.88 frames. ], batch size: 143, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:21:27,163 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:21:51,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=793494.0, ans=0.0 2023-06-20 20:21:51,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=793494.0, ans=0.1 2023-06-20 20:22:01,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=793494.0, ans=0.2 2023-06-20 20:22:16,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=793554.0, ans=0.1 2023-06-20 20:22:21,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=793554.0, ans=0.125 2023-06-20 20:22:53,284 INFO [train.py:996] (1/4) Epoch 5, batch 10300, loss[loss=0.2279, simple_loss=0.3009, pruned_loss=0.07742, over 21798.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3193, pruned_loss=0.08615, over 4270135.51 frames. ], batch size: 247, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:22:54,762 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 2.621e+02 3.179e+02 4.486e+02 7.082e+02, threshold=6.359e+02, percent-clipped=5.0 2023-06-20 20:23:06,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=793674.0, ans=0.0 2023-06-20 20:23:18,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=793734.0, ans=0.125 2023-06-20 20:23:35,637 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:23:38,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=793794.0, ans=0.125 2023-06-20 20:24:25,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=793914.0, ans=0.0 2023-06-20 20:24:32,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=793914.0, ans=0.0 2023-06-20 20:24:37,238 INFO [train.py:996] (1/4) Epoch 5, batch 10350, loss[loss=0.2596, simple_loss=0.3663, pruned_loss=0.07647, over 21255.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3254, pruned_loss=0.08736, over 4269452.00 frames. ], batch size: 549, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:25:29,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=794094.0, ans=0.125 2023-06-20 20:25:30,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-20 20:26:06,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=794214.0, ans=0.125 2023-06-20 20:26:06,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.92 vs. limit=15.0 2023-06-20 20:26:16,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=794214.0, ans=0.0 2023-06-20 20:26:22,027 INFO [train.py:996] (1/4) Epoch 5, batch 10400, loss[loss=0.177, simple_loss=0.2274, pruned_loss=0.06326, over 21220.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3183, pruned_loss=0.0866, over 4263656.31 frames. ], batch size: 159, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:26:22,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=794274.0, ans=0.125 2023-06-20 20:26:23,725 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.112e+02 3.947e+02 5.007e+02 1.010e+03, threshold=7.895e+02, percent-clipped=9.0 2023-06-20 20:27:53,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=794514.0, ans=0.0 2023-06-20 20:28:13,681 INFO [train.py:996] (1/4) Epoch 5, batch 10450, loss[loss=0.2433, simple_loss=0.3164, pruned_loss=0.08506, over 21662.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3213, pruned_loss=0.08997, over 4260125.26 frames. ], batch size: 230, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:29:05,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=794694.0, ans=0.0 2023-06-20 20:29:47,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-20 20:29:57,610 INFO [train.py:996] (1/4) Epoch 5, batch 10500, loss[loss=0.2357, simple_loss=0.2943, pruned_loss=0.08859, over 21541.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3192, pruned_loss=0.08872, over 4259970.11 frames. ], batch size: 414, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:29:59,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.789e+02 3.342e+02 3.915e+02 9.640e+02, threshold=6.684e+02, percent-clipped=1.0 2023-06-20 20:30:29,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=794934.0, ans=0.125 2023-06-20 20:30:32,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=794934.0, ans=0.015 2023-06-20 20:30:47,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=794994.0, ans=0.0 2023-06-20 20:30:47,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=794994.0, ans=0.125 2023-06-20 20:31:42,470 INFO [train.py:996] (1/4) Epoch 5, batch 10550, loss[loss=0.1989, simple_loss=0.2637, pruned_loss=0.06706, over 21603.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3142, pruned_loss=0.08889, over 4260672.39 frames. ], batch size: 231, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:32:26,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.41 vs. limit=10.0 2023-06-20 20:32:57,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-20 20:33:03,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=795414.0, ans=0.95 2023-06-20 20:33:03,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=795414.0, ans=0.025 2023-06-20 20:33:06,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=795414.0, ans=0.125 2023-06-20 20:33:28,201 INFO [train.py:996] (1/4) Epoch 5, batch 10600, loss[loss=0.1951, simple_loss=0.2896, pruned_loss=0.05027, over 21790.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3097, pruned_loss=0.08651, over 4261225.82 frames. ], batch size: 316, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:33:29,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.099e+02 3.993e+02 4.741e+02 9.586e+02, threshold=7.985e+02, percent-clipped=4.0 2023-06-20 20:33:36,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=795474.0, ans=0.5 2023-06-20 20:34:03,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-20 20:34:35,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=795654.0, ans=0.125 2023-06-20 20:35:23,415 INFO [train.py:996] (1/4) Epoch 5, batch 10650, loss[loss=0.1844, simple_loss=0.2635, pruned_loss=0.05267, over 21673.00 frames. ], tot_loss[loss=0.242, simple_loss=0.313, pruned_loss=0.08553, over 4267194.64 frames. ], batch size: 247, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:37:06,856 INFO [train.py:996] (1/4) Epoch 5, batch 10700, loss[loss=0.3383, simple_loss=0.395, pruned_loss=0.1408, over 21803.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3109, pruned_loss=0.08545, over 4261886.73 frames. ], batch size: 118, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:37:08,778 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.769e+02 3.121e+02 4.008e+02 5.487e+02, threshold=6.241e+02, percent-clipped=0.0 2023-06-20 20:38:31,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=796314.0, ans=0.1 2023-06-20 20:38:51,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=796374.0, ans=0.5 2023-06-20 20:38:53,060 INFO [train.py:996] (1/4) Epoch 5, batch 10750, loss[loss=0.2597, simple_loss=0.3452, pruned_loss=0.08714, over 21438.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3221, pruned_loss=0.09006, over 4259437.16 frames. ], batch size: 211, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:38:53,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=796374.0, ans=0.125 2023-06-20 20:38:58,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=796374.0, ans=0.125 2023-06-20 20:39:00,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=796374.0, ans=0.125 2023-06-20 20:39:44,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=796494.0, ans=0.0 2023-06-20 20:40:22,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=796614.0, ans=0.0 2023-06-20 20:40:30,127 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-20 20:40:44,423 INFO [train.py:996] (1/4) Epoch 5, batch 10800, loss[loss=0.283, simple_loss=0.36, pruned_loss=0.103, over 21360.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3262, pruned_loss=0.09126, over 4262220.95 frames. ], batch size: 131, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:40:47,898 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.089e+02 3.673e+02 4.196e+02 7.308e+02, threshold=7.346e+02, percent-clipped=3.0 2023-06-20 20:42:06,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=796914.0, ans=0.125 2023-06-20 20:42:29,741 INFO [train.py:996] (1/4) Epoch 5, batch 10850, loss[loss=0.2745, simple_loss=0.3306, pruned_loss=0.1092, over 21374.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3273, pruned_loss=0.09222, over 4261692.77 frames. ], batch size: 471, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:43:27,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=797154.0, ans=0.0 2023-06-20 20:44:11,614 INFO [train.py:996] (1/4) Epoch 5, batch 10900, loss[loss=0.2138, simple_loss=0.2951, pruned_loss=0.06625, over 21361.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3204, pruned_loss=0.08989, over 4256577.41 frames. ], batch size: 211, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:44:16,175 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.440e+02 2.886e+02 3.394e+02 4.121e+02 7.095e+02, threshold=6.789e+02, percent-clipped=0.0 2023-06-20 20:44:23,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=797274.0, ans=0.125 2023-06-20 20:44:51,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=797394.0, ans=0.1 2023-06-20 20:45:31,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=797514.0, ans=0.125 2023-06-20 20:45:32,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=797514.0, ans=0.125 2023-06-20 20:45:53,479 INFO [train.py:996] (1/4) Epoch 5, batch 10950, loss[loss=0.1932, simple_loss=0.2609, pruned_loss=0.06273, over 21617.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3163, pruned_loss=0.08759, over 4248003.13 frames. ], batch size: 247, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:45:55,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=797574.0, ans=0.0 2023-06-20 20:46:07,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=797634.0, ans=0.2 2023-06-20 20:46:26,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=797634.0, ans=0.125 2023-06-20 20:46:34,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=797694.0, ans=0.1 2023-06-20 20:46:37,841 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:47:35,102 INFO [train.py:996] (1/4) Epoch 5, batch 11000, loss[loss=0.2697, simple_loss=0.3323, pruned_loss=0.1035, over 21865.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3155, pruned_loss=0.08866, over 4248246.28 frames. ], batch size: 124, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:47:39,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.761e+02 3.277e+02 3.770e+02 5.855e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-20 20:48:28,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=797994.0, ans=0.125 2023-06-20 20:49:14,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=798114.0, ans=0.0 2023-06-20 20:49:17,446 INFO [train.py:996] (1/4) Epoch 5, batch 11050, loss[loss=0.2148, simple_loss=0.2728, pruned_loss=0.07839, over 21642.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3127, pruned_loss=0.0902, over 4257711.73 frames. ], batch size: 264, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:50:08,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-20 20:50:19,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=798354.0, ans=0.025 2023-06-20 20:50:59,766 INFO [train.py:996] (1/4) Epoch 5, batch 11100, loss[loss=0.236, simple_loss=0.2975, pruned_loss=0.08723, over 21993.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3113, pruned_loss=0.08996, over 4261688.27 frames. ], batch size: 103, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:51:04,509 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 2.972e+02 3.396e+02 4.031e+02 6.791e+02, threshold=6.791e+02, percent-clipped=1.0 2023-06-20 20:51:35,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=798534.0, ans=0.1 2023-06-20 20:51:55,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=798594.0, ans=0.0 2023-06-20 20:52:43,895 INFO [train.py:996] (1/4) Epoch 5, batch 11150, loss[loss=0.2306, simple_loss=0.2985, pruned_loss=0.08139, over 21704.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3093, pruned_loss=0.08947, over 4265157.31 frames. ], batch size: 124, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 20:52:57,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=798774.0, ans=0.125 2023-06-20 20:53:00,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=798834.0, ans=0.125 2023-06-20 20:53:27,514 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:54:06,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-20 20:54:15,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=799014.0, ans=0.125 2023-06-20 20:54:27,924 INFO [train.py:996] (1/4) Epoch 5, batch 11200, loss[loss=0.2194, simple_loss=0.2757, pruned_loss=0.08149, over 21866.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.308, pruned_loss=0.08865, over 4270102.57 frames. ], batch size: 107, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:54:33,044 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.645e+02 2.968e+02 3.525e+02 6.155e+02, threshold=5.936e+02, percent-clipped=0.0 2023-06-20 20:54:34,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=799074.0, ans=0.125 2023-06-20 20:55:57,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=799314.0, ans=0.0 2023-06-20 20:55:58,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=799314.0, ans=0.125 2023-06-20 20:56:11,022 INFO [train.py:996] (1/4) Epoch 5, batch 11250, loss[loss=0.2232, simple_loss=0.3126, pruned_loss=0.06691, over 21311.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3071, pruned_loss=0.08848, over 4269837.18 frames. ], batch size: 176, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:57:03,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=799494.0, ans=0.125 2023-06-20 20:57:16,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=799554.0, ans=0.1 2023-06-20 20:57:20,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-20 20:57:32,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.16 vs. limit=22.5 2023-06-20 20:57:43,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=799614.0, ans=0.125 2023-06-20 20:57:52,792 INFO [train.py:996] (1/4) Epoch 5, batch 11300, loss[loss=0.2378, simple_loss=0.2906, pruned_loss=0.09243, over 20245.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3084, pruned_loss=0.08838, over 4279740.43 frames. ], batch size: 703, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:57:57,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.794e+02 3.076e+02 3.460e+02 4.900e+02, threshold=6.152e+02, percent-clipped=0.0 2023-06-20 20:58:53,998 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.84 vs. limit=22.5 2023-06-20 20:59:38,212 INFO [train.py:996] (1/4) Epoch 5, batch 11350, loss[loss=0.3152, simple_loss=0.3891, pruned_loss=0.1207, over 21654.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3112, pruned_loss=0.08835, over 4279606.72 frames. ], batch size: 415, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:59:39,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=22.5 2023-06-20 20:59:39,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-20 20:59:44,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=799974.0, ans=0.125 2023-06-20 21:00:32,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800094.0, ans=0.1 2023-06-20 21:00:51,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=800154.0, ans=0.0 2023-06-20 21:01:00,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=800154.0, ans=0.125 2023-06-20 21:01:06,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=800214.0, ans=0.0 2023-06-20 21:01:07,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=800214.0, ans=0.125 2023-06-20 21:01:21,796 INFO [train.py:996] (1/4) Epoch 5, batch 11400, loss[loss=0.2227, simple_loss=0.3007, pruned_loss=0.07232, over 21500.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3167, pruned_loss=0.0904, over 4266780.85 frames. ], batch size: 195, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 21:01:26,747 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.159e+02 3.655e+02 4.619e+02 8.867e+02, threshold=7.309e+02, percent-clipped=8.0 2023-06-20 21:02:07,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=800394.0, ans=0.125 2023-06-20 21:02:14,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=800394.0, ans=0.125 2023-06-20 21:03:10,006 INFO [train.py:996] (1/4) Epoch 5, batch 11450, loss[loss=0.2328, simple_loss=0.2999, pruned_loss=0.08287, over 20063.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3192, pruned_loss=0.08961, over 4265231.29 frames. ], batch size: 704, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 21:03:10,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=22.5 2023-06-20 21:03:13,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=800574.0, ans=0.1 2023-06-20 21:03:19,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.49 vs. limit=15.0 2023-06-20 21:04:36,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=800814.0, ans=0.1 2023-06-20 21:04:50,151 INFO [train.py:996] (1/4) Epoch 5, batch 11500, loss[loss=0.2158, simple_loss=0.2993, pruned_loss=0.06617, over 21243.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3238, pruned_loss=0.09118, over 4269887.68 frames. ], batch size: 159, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 21:04:56,798 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.797e+02 3.137e+02 3.643e+02 6.427e+02, threshold=6.273e+02, percent-clipped=0.0 2023-06-20 21:05:04,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.86 vs. limit=12.0 2023-06-20 21:05:06,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=22.5 2023-06-20 21:05:50,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=800994.0, ans=0.0 2023-06-20 21:05:54,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=801054.0, ans=0.125 2023-06-20 21:06:39,669 INFO [train.py:996] (1/4) Epoch 5, batch 11550, loss[loss=0.2832, simple_loss=0.383, pruned_loss=0.09165, over 21773.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3296, pruned_loss=0.09131, over 4275288.43 frames. ], batch size: 351, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:07:11,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=801234.0, ans=0.125 2023-06-20 21:08:24,262 INFO [train.py:996] (1/4) Epoch 5, batch 11600, loss[loss=0.2486, simple_loss=0.3434, pruned_loss=0.07688, over 21446.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3443, pruned_loss=0.09371, over 4274582.26 frames. ], batch size: 211, lr: 6.31e-03, grad_scale: 32.0 2023-06-20 21:08:30,827 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.937e+02 3.426e+02 4.222e+02 6.279e+02, threshold=6.853e+02, percent-clipped=1.0 2023-06-20 21:08:52,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801534.0, ans=0.1 2023-06-20 21:09:11,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.63 vs. limit=10.0 2023-06-20 21:09:54,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=801714.0, ans=0.125 2023-06-20 21:10:08,650 INFO [train.py:996] (1/4) Epoch 5, batch 11650, loss[loss=0.2572, simple_loss=0.3369, pruned_loss=0.08875, over 21880.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3484, pruned_loss=0.09349, over 4268111.19 frames. ], batch size: 372, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:10:24,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=801774.0, ans=0.0 2023-06-20 21:10:41,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-20 21:11:30,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=802014.0, ans=0.0 2023-06-20 21:11:37,330 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:11:51,273 INFO [train.py:996] (1/4) Epoch 5, batch 11700, loss[loss=0.3108, simple_loss=0.3284, pruned_loss=0.1466, over 21508.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3403, pruned_loss=0.09396, over 4258781.25 frames. ], batch size: 512, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:12:03,970 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.808e+02 3.203e+02 3.779e+02 6.898e+02, threshold=6.406e+02, percent-clipped=1.0 2023-06-20 21:13:05,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=802254.0, ans=0.125 2023-06-20 21:13:16,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=802314.0, ans=0.125 2023-06-20 21:13:28,453 INFO [train.py:996] (1/4) Epoch 5, batch 11750, loss[loss=0.2731, simple_loss=0.338, pruned_loss=0.1041, over 21675.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3306, pruned_loss=0.09381, over 4258757.16 frames. ], batch size: 298, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:13:44,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=802374.0, ans=0.125 2023-06-20 21:13:48,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-20 21:14:34,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=802554.0, ans=0.1 2023-06-20 21:14:47,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=802554.0, ans=0.0 2023-06-20 21:15:18,409 INFO [train.py:996] (1/4) Epoch 5, batch 11800, loss[loss=0.2444, simple_loss=0.3486, pruned_loss=0.07005, over 21876.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3329, pruned_loss=0.09524, over 4264745.94 frames. ], batch size: 372, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:15:26,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 2.905e+02 3.538e+02 4.499e+02 8.498e+02, threshold=7.075e+02, percent-clipped=3.0 2023-06-20 21:15:32,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=802674.0, ans=0.1 2023-06-20 21:16:09,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=802794.0, ans=0.0 2023-06-20 21:16:53,699 INFO [train.py:996] (1/4) Epoch 5, batch 11850, loss[loss=0.2875, simple_loss=0.3775, pruned_loss=0.09876, over 21547.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3332, pruned_loss=0.09448, over 4268913.86 frames. ], batch size: 471, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:16:58,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=802974.0, ans=0.0 2023-06-20 21:17:03,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-20 21:17:38,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=803094.0, ans=0.2 2023-06-20 21:18:33,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=803214.0, ans=0.02 2023-06-20 21:18:40,290 INFO [train.py:996] (1/4) Epoch 5, batch 11900, loss[loss=0.2178, simple_loss=0.2801, pruned_loss=0.0778, over 21764.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3335, pruned_loss=0.09217, over 4263966.81 frames. ], batch size: 124, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:18:48,879 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.635e+02 2.949e+02 3.429e+02 6.903e+02, threshold=5.899e+02, percent-clipped=0.0 2023-06-20 21:19:27,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=803394.0, ans=0.07 2023-06-20 21:20:25,263 INFO [train.py:996] (1/4) Epoch 5, batch 11950, loss[loss=0.1894, simple_loss=0.2798, pruned_loss=0.04954, over 21666.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3366, pruned_loss=0.08883, over 4266108.74 frames. ], batch size: 247, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:20:28,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=803574.0, ans=0.125 2023-06-20 21:21:02,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=803634.0, ans=0.2 2023-06-20 21:21:07,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=803694.0, ans=0.2 2023-06-20 21:21:49,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=803754.0, ans=0.1 2023-06-20 21:21:53,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=803814.0, ans=0.125 2023-06-20 21:22:04,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=803814.0, ans=0.2 2023-06-20 21:22:08,843 INFO [train.py:996] (1/4) Epoch 5, batch 12000, loss[loss=0.2288, simple_loss=0.2866, pruned_loss=0.08552, over 21502.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3297, pruned_loss=0.08726, over 4265885.65 frames. ], batch size: 212, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:22:08,843 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 21:22:26,103 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2641, simple_loss=0.3594, pruned_loss=0.08443, over 1796401.00 frames. 2023-06-20 21:22:26,104 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 21:22:33,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=803874.0, ans=0.2 2023-06-20 21:22:34,434 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 3.012e+02 3.779e+02 4.599e+02 7.953e+02, threshold=7.557e+02, percent-clipped=8.0 2023-06-20 21:23:01,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=803934.0, ans=0.125 2023-06-20 21:23:02,101 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-20 21:24:10,194 INFO [train.py:996] (1/4) Epoch 5, batch 12050, loss[loss=0.2515, simple_loss=0.313, pruned_loss=0.09497, over 21904.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3255, pruned_loss=0.08978, over 4269733.83 frames. ], batch size: 316, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:24:12,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=804174.0, ans=0.125 2023-06-20 21:24:27,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=804174.0, ans=0.2 2023-06-20 21:25:07,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=804294.0, ans=0.125 2023-06-20 21:25:33,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=804354.0, ans=0.2 2023-06-20 21:25:56,409 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-20 21:26:00,495 INFO [train.py:996] (1/4) Epoch 5, batch 12100, loss[loss=0.2522, simple_loss=0.3496, pruned_loss=0.07736, over 19840.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3298, pruned_loss=0.09356, over 4278036.04 frames. ], batch size: 702, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:26:12,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-20 21:26:14,366 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.887e+02 3.241e+02 3.794e+02 5.961e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 21:26:15,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=804474.0, ans=0.125 2023-06-20 21:26:21,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=804534.0, ans=0.2 2023-06-20 21:26:27,652 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-20 21:26:54,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=804594.0, ans=0.125 2023-06-20 21:27:07,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=804654.0, ans=0.125 2023-06-20 21:27:10,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=804654.0, ans=0.125 2023-06-20 21:27:40,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=804714.0, ans=0.0 2023-06-20 21:27:47,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=804714.0, ans=0.1 2023-06-20 21:27:54,328 INFO [train.py:996] (1/4) Epoch 5, batch 12150, loss[loss=0.3635, simple_loss=0.4408, pruned_loss=0.1431, over 21478.00 frames. ], tot_loss[loss=0.257, simple_loss=0.33, pruned_loss=0.09199, over 4268999.11 frames. ], batch size: 507, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:28:17,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-20 21:28:18,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=804834.0, ans=0.025 2023-06-20 21:28:28,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=804834.0, ans=0.125 2023-06-20 21:28:46,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-20 21:29:03,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=8.0 2023-06-20 21:29:17,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-20 21:29:24,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=805014.0, ans=0.0 2023-06-20 21:29:30,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-20 21:29:38,003 INFO [train.py:996] (1/4) Epoch 5, batch 12200, loss[loss=0.2283, simple_loss=0.2806, pruned_loss=0.08794, over 21449.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3296, pruned_loss=0.09145, over 4255522.25 frames. ], batch size: 212, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:29:38,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=805074.0, ans=0.125 2023-06-20 21:29:51,299 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 2.953e+02 3.466e+02 4.601e+02 9.385e+02, threshold=6.933e+02, percent-clipped=9.0 2023-06-20 21:30:17,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=805134.0, ans=0.2 2023-06-20 21:30:46,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.36 vs. limit=10.0 2023-06-20 21:31:22,563 INFO [train.py:996] (1/4) Epoch 5, batch 12250, loss[loss=0.2102, simple_loss=0.2904, pruned_loss=0.06497, over 21760.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3219, pruned_loss=0.08854, over 4256707.40 frames. ], batch size: 371, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:32:14,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.33 vs. limit=6.0 2023-06-20 21:32:30,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=805554.0, ans=0.2 2023-06-20 21:32:44,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=805554.0, ans=0.0 2023-06-20 21:33:06,427 INFO [train.py:996] (1/4) Epoch 5, batch 12300, loss[loss=0.1735, simple_loss=0.2385, pruned_loss=0.0542, over 16758.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.312, pruned_loss=0.08243, over 4245473.97 frames. ], batch size: 63, lr: 6.30e-03, grad_scale: 16.0 2023-06-20 21:33:20,982 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 2.636e+02 3.177e+02 4.024e+02 7.253e+02, threshold=6.354e+02, percent-clipped=1.0 2023-06-20 21:33:54,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-20 21:34:16,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=805854.0, ans=0.0 2023-06-20 21:34:45,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-20 21:34:49,562 INFO [train.py:996] (1/4) Epoch 5, batch 12350, loss[loss=0.2993, simple_loss=0.3651, pruned_loss=0.1167, over 21587.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3149, pruned_loss=0.08224, over 4248909.38 frames. ], batch size: 471, lr: 6.30e-03, grad_scale: 16.0 2023-06-20 21:35:01,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=805974.0, ans=0.2 2023-06-20 21:36:26,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=806214.0, ans=0.0 2023-06-20 21:36:34,904 INFO [train.py:996] (1/4) Epoch 5, batch 12400, loss[loss=0.2639, simple_loss=0.3251, pruned_loss=0.1013, over 20077.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3188, pruned_loss=0.0866, over 4254986.86 frames. ], batch size: 703, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:36:49,180 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.789e+02 3.195e+02 3.703e+02 5.558e+02, threshold=6.391e+02, percent-clipped=0.0 2023-06-20 21:37:16,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=806334.0, ans=0.1 2023-06-20 21:38:14,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=806514.0, ans=0.125 2023-06-20 21:38:17,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=806514.0, ans=0.125 2023-06-20 21:38:25,339 INFO [train.py:996] (1/4) Epoch 5, batch 12450, loss[loss=0.2611, simple_loss=0.3772, pruned_loss=0.07249, over 20762.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3236, pruned_loss=0.09027, over 4264811.40 frames. ], batch size: 607, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:38:46,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=806634.0, ans=0.0 2023-06-20 21:40:17,342 INFO [train.py:996] (1/4) Epoch 5, batch 12500, loss[loss=0.2731, simple_loss=0.3693, pruned_loss=0.08842, over 21706.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3358, pruned_loss=0.09412, over 4264019.96 frames. ], batch size: 351, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:40:24,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-20 21:40:27,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 2.818e+02 3.353e+02 4.175e+02 5.969e+02, threshold=6.707e+02, percent-clipped=0.0 2023-06-20 21:40:47,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=806934.0, ans=0.125 2023-06-20 21:40:56,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=806934.0, ans=0.0 2023-06-20 21:41:13,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=806994.0, ans=0.0 2023-06-20 21:42:04,530 INFO [train.py:996] (1/4) Epoch 5, batch 12550, loss[loss=0.276, simple_loss=0.3553, pruned_loss=0.09831, over 21589.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.339, pruned_loss=0.09574, over 4267345.40 frames. ], batch size: 414, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:42:30,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=807234.0, ans=0.125 2023-06-20 21:42:49,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=807294.0, ans=0.0 2023-06-20 21:43:28,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=807354.0, ans=0.0 2023-06-20 21:43:47,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=807474.0, ans=0.1 2023-06-20 21:43:53,029 INFO [train.py:996] (1/4) Epoch 5, batch 12600, loss[loss=0.215, simple_loss=0.303, pruned_loss=0.06349, over 21737.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3382, pruned_loss=0.09429, over 4270447.32 frames. ], batch size: 332, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:44:08,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.858e+02 3.271e+02 3.869e+02 6.376e+02, threshold=6.541e+02, percent-clipped=0.0 2023-06-20 21:44:12,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=807474.0, ans=10.0 2023-06-20 21:44:14,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=807534.0, ans=0.125 2023-06-20 21:44:45,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=807594.0, ans=0.0 2023-06-20 21:45:17,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=807714.0, ans=0.0 2023-06-20 21:45:18,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=807714.0, ans=0.1 2023-06-20 21:45:24,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=807714.0, ans=0.07 2023-06-20 21:45:26,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=807714.0, ans=0.125 2023-06-20 21:45:35,583 INFO [train.py:996] (1/4) Epoch 5, batch 12650, loss[loss=0.2585, simple_loss=0.323, pruned_loss=0.09703, over 21651.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3304, pruned_loss=0.09008, over 4267878.66 frames. ], batch size: 473, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:46:42,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=807954.0, ans=0.1 2023-06-20 21:47:20,041 INFO [train.py:996] (1/4) Epoch 5, batch 12700, loss[loss=0.2757, simple_loss=0.3252, pruned_loss=0.1131, over 21202.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3296, pruned_loss=0.09289, over 4272240.47 frames. ], batch size: 608, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:47:35,838 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.923e+02 3.430e+02 4.123e+02 8.274e+02, threshold=6.860e+02, percent-clipped=2.0 2023-06-20 21:48:30,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=808254.0, ans=0.125 2023-06-20 21:49:03,348 INFO [train.py:996] (1/4) Epoch 5, batch 12750, loss[loss=0.2317, simple_loss=0.3083, pruned_loss=0.0776, over 21754.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3312, pruned_loss=0.09357, over 4272630.29 frames. ], batch size: 124, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:49:18,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=808374.0, ans=0.125 2023-06-20 21:49:38,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-20 21:49:52,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=808494.0, ans=0.125 2023-06-20 21:50:15,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=808554.0, ans=0.125 2023-06-20 21:50:32,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=808614.0, ans=0.125 2023-06-20 21:50:52,076 INFO [train.py:996] (1/4) Epoch 5, batch 12800, loss[loss=0.2397, simple_loss=0.3043, pruned_loss=0.08762, over 21869.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3309, pruned_loss=0.09431, over 4276983.97 frames. ], batch size: 298, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:50:52,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=808674.0, ans=0.125 2023-06-20 21:51:03,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.761e+02 3.177e+02 3.732e+02 6.852e+02, threshold=6.353e+02, percent-clipped=0.0 2023-06-20 21:51:12,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=808734.0, ans=0.125 2023-06-20 21:51:19,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=808734.0, ans=0.2 2023-06-20 21:51:33,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=808794.0, ans=0.125 2023-06-20 21:51:50,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=808794.0, ans=0.09899494936611666 2023-06-20 21:52:08,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=808854.0, ans=0.0 2023-06-20 21:52:08,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=808854.0, ans=0.125 2023-06-20 21:52:19,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=808914.0, ans=0.2 2023-06-20 21:52:32,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=808914.0, ans=0.0 2023-06-20 21:52:37,034 INFO [train.py:996] (1/4) Epoch 5, batch 12850, loss[loss=0.232, simple_loss=0.3088, pruned_loss=0.07756, over 19945.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.334, pruned_loss=0.09531, over 4273654.06 frames. ], batch size: 703, lr: 6.28e-03, grad_scale: 32.0 2023-06-20 21:52:45,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=808974.0, ans=0.0 2023-06-20 21:52:54,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=15.0 2023-06-20 21:53:13,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=809034.0, ans=0.025 2023-06-20 21:53:33,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809094.0, ans=0.1 2023-06-20 21:53:53,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=809154.0, ans=0.125 2023-06-20 21:54:22,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=809214.0, ans=0.0 2023-06-20 21:54:26,579 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-20 21:54:27,282 INFO [train.py:996] (1/4) Epoch 5, batch 12900, loss[loss=0.2218, simple_loss=0.2905, pruned_loss=0.07661, over 21202.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.331, pruned_loss=0.09197, over 4265970.80 frames. ], batch size: 159, lr: 6.28e-03, grad_scale: 32.0 2023-06-20 21:54:32,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=809274.0, ans=0.0 2023-06-20 21:54:45,408 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.796e+02 3.210e+02 3.770e+02 8.746e+02, threshold=6.419e+02, percent-clipped=1.0 2023-06-20 21:55:05,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=809334.0, ans=0.125 2023-06-20 21:55:09,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=809334.0, ans=0.125 2023-06-20 21:55:19,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=809394.0, ans=0.2 2023-06-20 21:56:17,447 INFO [train.py:996] (1/4) Epoch 5, batch 12950, loss[loss=0.2545, simple_loss=0.3257, pruned_loss=0.09167, over 21718.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3298, pruned_loss=0.08951, over 4266167.16 frames. ], batch size: 332, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 21:57:40,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=809814.0, ans=0.125 2023-06-20 21:58:00,340 INFO [train.py:996] (1/4) Epoch 5, batch 13000, loss[loss=0.2373, simple_loss=0.3154, pruned_loss=0.07957, over 21713.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3302, pruned_loss=0.09025, over 4268756.47 frames. ], batch size: 298, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 21:58:07,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=809874.0, ans=0.125 2023-06-20 21:58:14,623 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-06-20 21:58:15,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.939e+02 3.382e+02 4.149e+02 6.832e+02, threshold=6.764e+02, percent-clipped=3.0 2023-06-20 21:58:33,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=809934.0, ans=0.125 2023-06-20 21:58:47,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=809994.0, ans=0.125 2023-06-20 21:58:52,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=809994.0, ans=0.0 2023-06-20 21:58:52,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-20 21:59:42,934 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-20 21:59:45,260 INFO [train.py:996] (1/4) Epoch 5, batch 13050, loss[loss=0.2559, simple_loss=0.3246, pruned_loss=0.09363, over 21869.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3255, pruned_loss=0.08697, over 4272274.29 frames. ], batch size: 371, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 21:59:45,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=810174.0, ans=0.125 2023-06-20 21:59:47,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=810174.0, ans=0.2 2023-06-20 21:59:57,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=810174.0, ans=0.125 2023-06-20 22:00:00,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=810234.0, ans=0.1 2023-06-20 22:00:15,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-20 22:01:29,072 INFO [train.py:996] (1/4) Epoch 5, batch 13100, loss[loss=0.2917, simple_loss=0.3585, pruned_loss=0.1124, over 21901.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3281, pruned_loss=0.08826, over 4274264.58 frames. ], batch size: 371, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:01:49,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.828e+02 3.544e+02 4.567e+02 8.084e+02, threshold=7.089e+02, percent-clipped=1.0 2023-06-20 22:02:22,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=810594.0, ans=0.1 2023-06-20 22:02:24,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=810594.0, ans=0.125 2023-06-20 22:02:38,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-20 22:02:56,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=810714.0, ans=0.2 2023-06-20 22:02:57,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=810714.0, ans=0.125 2023-06-20 22:03:02,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=810714.0, ans=0.2 2023-06-20 22:03:19,519 INFO [train.py:996] (1/4) Epoch 5, batch 13150, loss[loss=0.272, simple_loss=0.3263, pruned_loss=0.1089, over 21586.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3289, pruned_loss=0.09145, over 4274796.58 frames. ], batch size: 263, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:03:19,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=810774.0, ans=0.0 2023-06-20 22:03:21,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=810774.0, ans=0.0 2023-06-20 22:03:29,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=810774.0, ans=0.1 2023-06-20 22:05:00,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=811014.0, ans=0.125 2023-06-20 22:05:03,337 INFO [train.py:996] (1/4) Epoch 5, batch 13200, loss[loss=0.2582, simple_loss=0.3224, pruned_loss=0.09702, over 21432.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3281, pruned_loss=0.0917, over 4277150.63 frames. ], batch size: 194, lr: 6.28e-03, grad_scale: 16.0 2023-06-20 22:05:17,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=811074.0, ans=0.2 2023-06-20 22:05:18,453 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.901e+02 3.344e+02 4.017e+02 7.205e+02, threshold=6.688e+02, percent-clipped=1.0 2023-06-20 22:05:43,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=811134.0, ans=0.04949747468305833 2023-06-20 22:05:56,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=811194.0, ans=0.125 2023-06-20 22:06:05,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-20 22:06:15,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=811254.0, ans=0.125 2023-06-20 22:06:48,140 INFO [train.py:996] (1/4) Epoch 5, batch 13250, loss[loss=0.2344, simple_loss=0.3121, pruned_loss=0.07832, over 21399.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3294, pruned_loss=0.09306, over 4278028.59 frames. ], batch size: 211, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:07:06,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=811374.0, ans=0.125 2023-06-20 22:07:22,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=811434.0, ans=0.09899494936611666 2023-06-20 22:07:31,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=811494.0, ans=15.0 2023-06-20 22:07:43,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=811494.0, ans=0.1 2023-06-20 22:08:03,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=811554.0, ans=0.0 2023-06-20 22:08:27,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=811614.0, ans=0.125 2023-06-20 22:08:39,109 INFO [train.py:996] (1/4) Epoch 5, batch 13300, loss[loss=0.3013, simple_loss=0.3616, pruned_loss=0.1205, over 21476.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3331, pruned_loss=0.09351, over 4283250.48 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:08:54,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=811674.0, ans=0.1 2023-06-20 22:08:59,085 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.849e+02 3.375e+02 4.056e+02 7.431e+02, threshold=6.749e+02, percent-clipped=1.0 2023-06-20 22:09:20,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=811734.0, ans=0.0 2023-06-20 22:09:43,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=811854.0, ans=0.1 2023-06-20 22:09:44,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=811854.0, ans=0.95 2023-06-20 22:09:52,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-20 22:10:00,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=811914.0, ans=0.1 2023-06-20 22:10:28,578 INFO [train.py:996] (1/4) Epoch 5, batch 13350, loss[loss=0.2456, simple_loss=0.3223, pruned_loss=0.08449, over 21630.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3364, pruned_loss=0.09613, over 4283117.90 frames. ], batch size: 263, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:11:19,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=812094.0, ans=0.0 2023-06-20 22:12:08,361 INFO [train.py:996] (1/4) Epoch 5, batch 13400, loss[loss=0.2624, simple_loss=0.3272, pruned_loss=0.09879, over 21474.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3364, pruned_loss=0.09759, over 4284937.82 frames. ], batch size: 548, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:12:22,372 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.994e+02 3.475e+02 4.105e+02 5.675e+02, threshold=6.951e+02, percent-clipped=0.0 2023-06-20 22:12:34,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=22.5 2023-06-20 22:12:58,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=812394.0, ans=0.125 2023-06-20 22:13:10,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=812454.0, ans=0.1 2023-06-20 22:13:12,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=812454.0, ans=0.125 2023-06-20 22:13:46,629 INFO [train.py:996] (1/4) Epoch 5, batch 13450, loss[loss=0.258, simple_loss=0.3153, pruned_loss=0.1004, over 21633.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3377, pruned_loss=0.09997, over 4289005.67 frames. ], batch size: 263, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:13:50,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=812574.0, ans=0.125 2023-06-20 22:14:07,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=812634.0, ans=0.2 2023-06-20 22:14:37,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=812694.0, ans=0.125 2023-06-20 22:14:53,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=812754.0, ans=0.0 2023-06-20 22:15:06,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=812754.0, ans=0.1 2023-06-20 22:15:21,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-20 22:15:24,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=812814.0, ans=0.125 2023-06-20 22:15:33,798 INFO [train.py:996] (1/4) Epoch 5, batch 13500, loss[loss=0.3577, simple_loss=0.4086, pruned_loss=0.1535, over 21432.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3286, pruned_loss=0.09656, over 4279440.14 frames. ], batch size: 471, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:15:38,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=812874.0, ans=0.04949747468305833 2023-06-20 22:15:53,676 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.173e+02 3.616e+02 4.505e+02 8.152e+02, threshold=7.232e+02, percent-clipped=1.0 2023-06-20 22:16:41,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=813054.0, ans=0.125 2023-06-20 22:16:50,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=813054.0, ans=0.1 2023-06-20 22:17:15,883 INFO [train.py:996] (1/4) Epoch 5, batch 13550, loss[loss=0.2264, simple_loss=0.3167, pruned_loss=0.06803, over 21430.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3309, pruned_loss=0.09536, over 4277567.91 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 8.0 2023-06-20 22:17:25,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=813174.0, ans=0.1 2023-06-20 22:17:56,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=813294.0, ans=0.0 2023-06-20 22:18:54,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-20 22:18:55,144 INFO [train.py:996] (1/4) Epoch 5, batch 13600, loss[loss=0.2795, simple_loss=0.3413, pruned_loss=0.1088, over 21728.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3335, pruned_loss=0.0965, over 4283620.82 frames. ], batch size: 389, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:19:16,168 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.895e+02 3.506e+02 4.425e+02 7.285e+02, threshold=7.012e+02, percent-clipped=2.0 2023-06-20 22:20:39,906 INFO [train.py:996] (1/4) Epoch 5, batch 13650, loss[loss=0.2376, simple_loss=0.2929, pruned_loss=0.0912, over 21996.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3286, pruned_loss=0.09306, over 4268549.24 frames. ], batch size: 103, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:20:40,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=813774.0, ans=0.1 2023-06-20 22:20:46,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=813774.0, ans=0.1 2023-06-20 22:21:58,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=814014.0, ans=0.1 2023-06-20 22:22:19,138 INFO [train.py:996] (1/4) Epoch 5, batch 13700, loss[loss=0.1919, simple_loss=0.2297, pruned_loss=0.07705, over 16244.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3201, pruned_loss=0.09134, over 4260680.10 frames. ], batch size: 60, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:22:23,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=814074.0, ans=0.0 2023-06-20 22:22:41,389 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.806e+02 3.342e+02 4.306e+02 8.545e+02, threshold=6.684e+02, percent-clipped=2.0 2023-06-20 22:24:01,291 INFO [train.py:996] (1/4) Epoch 5, batch 13750, loss[loss=0.2801, simple_loss=0.348, pruned_loss=0.1061, over 20693.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3193, pruned_loss=0.09166, over 4262473.34 frames. ], batch size: 607, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:24:05,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=814374.0, ans=0.1 2023-06-20 22:25:24,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=814554.0, ans=0.0 2023-06-20 22:25:30,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-20 22:25:48,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=814674.0, ans=0.05 2023-06-20 22:25:49,698 INFO [train.py:996] (1/4) Epoch 5, batch 13800, loss[loss=0.3092, simple_loss=0.4072, pruned_loss=0.1056, over 21661.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3235, pruned_loss=0.09033, over 4268050.46 frames. ], batch size: 414, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:26:06,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.993e+02 3.321e+02 4.024e+02 5.976e+02, threshold=6.643e+02, percent-clipped=0.0 2023-06-20 22:26:08,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=814734.0, ans=0.125 2023-06-20 22:26:56,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=814854.0, ans=0.125 2023-06-20 22:27:09,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=814914.0, ans=0.125 2023-06-20 22:27:26,012 INFO [train.py:996] (1/4) Epoch 5, batch 13850, loss[loss=0.318, simple_loss=0.3876, pruned_loss=0.1243, over 21837.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3308, pruned_loss=0.09173, over 4262365.14 frames. ], batch size: 371, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:28:37,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-20 22:29:01,707 INFO [train.py:996] (1/4) Epoch 5, batch 13900, loss[loss=0.2469, simple_loss=0.3148, pruned_loss=0.08951, over 21817.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3355, pruned_loss=0.09551, over 4264582.16 frames. ], batch size: 351, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:29:05,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=815274.0, ans=0.1 2023-06-20 22:29:27,668 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 2.909e+02 3.378e+02 3.977e+02 7.082e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-20 22:29:29,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=815334.0, ans=0.1 2023-06-20 22:29:29,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=815334.0, ans=0.0 2023-06-20 22:29:35,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=815334.0, ans=0.0 2023-06-20 22:29:37,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.82 vs. limit=22.5 2023-06-20 22:30:06,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=815454.0, ans=0.2 2023-06-20 22:30:41,761 INFO [train.py:996] (1/4) Epoch 5, batch 13950, loss[loss=0.2424, simple_loss=0.3132, pruned_loss=0.08583, over 21782.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3357, pruned_loss=0.09723, over 4270922.87 frames. ], batch size: 298, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:31:07,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-20 22:31:18,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=815634.0, ans=0.09899494936611666 2023-06-20 22:31:38,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=815694.0, ans=0.07 2023-06-20 22:31:38,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=815694.0, ans=0.2 2023-06-20 22:31:43,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=815754.0, ans=0.0 2023-06-20 22:31:48,493 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.94 vs. limit=22.5 2023-06-20 22:31:52,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=815754.0, ans=0.125 2023-06-20 22:32:12,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=815814.0, ans=0.1 2023-06-20 22:32:24,842 INFO [train.py:996] (1/4) Epoch 5, batch 14000, loss[loss=0.148, simple_loss=0.2164, pruned_loss=0.03977, over 16416.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3336, pruned_loss=0.09452, over 4261788.90 frames. ], batch size: 62, lr: 6.26e-03, grad_scale: 32.0 2023-06-20 22:32:33,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=815874.0, ans=0.0 2023-06-20 22:32:37,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.54 vs. limit=10.0 2023-06-20 22:32:40,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=815874.0, ans=0.1 2023-06-20 22:32:45,986 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.619e+02 3.139e+02 3.881e+02 6.690e+02, threshold=6.278e+02, percent-clipped=0.0 2023-06-20 22:32:47,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=815934.0, ans=0.2 2023-06-20 22:33:57,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=816114.0, ans=0.1 2023-06-20 22:34:05,224 INFO [train.py:996] (1/4) Epoch 5, batch 14050, loss[loss=0.2532, simple_loss=0.3569, pruned_loss=0.07475, over 20763.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3276, pruned_loss=0.08968, over 4263194.40 frames. ], batch size: 608, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:34:20,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=816174.0, ans=0.1 2023-06-20 22:34:20,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=816174.0, ans=0.0 2023-06-20 22:34:33,618 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-20 22:34:52,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=816294.0, ans=0.0 2023-06-20 22:35:21,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=816354.0, ans=0.015 2023-06-20 22:35:27,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=816414.0, ans=0.0 2023-06-20 22:35:37,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=816414.0, ans=0.125 2023-06-20 22:35:44,093 INFO [train.py:996] (1/4) Epoch 5, batch 14100, loss[loss=0.2874, simple_loss=0.338, pruned_loss=0.1184, over 21359.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3207, pruned_loss=0.08917, over 4263659.02 frames. ], batch size: 471, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:35:48,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=816474.0, ans=0.125 2023-06-20 22:36:07,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.771e+02 3.154e+02 4.028e+02 6.108e+02, threshold=6.308e+02, percent-clipped=0.0 2023-06-20 22:36:23,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=816594.0, ans=0.0 2023-06-20 22:37:12,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=816714.0, ans=0.1 2023-06-20 22:37:18,620 INFO [train.py:996] (1/4) Epoch 5, batch 14150, loss[loss=0.2538, simple_loss=0.3498, pruned_loss=0.07888, over 21201.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3253, pruned_loss=0.09129, over 4255870.99 frames. ], batch size: 548, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:37:29,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=816774.0, ans=0.1 2023-06-20 22:38:57,649 INFO [train.py:996] (1/4) Epoch 5, batch 14200, loss[loss=0.248, simple_loss=0.3067, pruned_loss=0.09462, over 21672.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3229, pruned_loss=0.08916, over 4263857.42 frames. ], batch size: 230, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:39:19,845 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.484e+02 2.963e+02 3.702e+02 8.044e+02, threshold=5.927e+02, percent-clipped=3.0 2023-06-20 22:39:45,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=817194.0, ans=0.125 2023-06-20 22:40:01,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-20 22:40:22,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.32 vs. limit=15.0 2023-06-20 22:40:36,146 INFO [train.py:996] (1/4) Epoch 5, batch 14250, loss[loss=0.2378, simple_loss=0.3063, pruned_loss=0.08463, over 21028.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3187, pruned_loss=0.08997, over 4261207.12 frames. ], batch size: 608, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:40:57,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=817434.0, ans=0.125 2023-06-20 22:40:58,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=817434.0, ans=0.125 2023-06-20 22:42:22,701 INFO [train.py:996] (1/4) Epoch 5, batch 14300, loss[loss=0.2262, simple_loss=0.3243, pruned_loss=0.06408, over 21599.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3191, pruned_loss=0.08933, over 4248862.17 frames. ], batch size: 389, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:42:46,147 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 3.147e+02 3.840e+02 5.009e+02 9.347e+02, threshold=7.680e+02, percent-clipped=16.0 2023-06-20 22:44:02,687 INFO [train.py:996] (1/4) Epoch 5, batch 14350, loss[loss=0.2403, simple_loss=0.3224, pruned_loss=0.07909, over 21844.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3276, pruned_loss=0.09103, over 4254835.31 frames. ], batch size: 282, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:44:03,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=817974.0, ans=0.0 2023-06-20 22:44:40,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-20 22:45:40,618 INFO [train.py:996] (1/4) Epoch 5, batch 14400, loss[loss=0.2767, simple_loss=0.3796, pruned_loss=0.08697, over 19799.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3265, pruned_loss=0.09085, over 4252366.29 frames. ], batch size: 703, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:45:58,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.774e+02 3.108e+02 3.689e+02 4.790e+02, threshold=6.217e+02, percent-clipped=0.0 2023-06-20 22:45:58,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=818334.0, ans=0.1 2023-06-20 22:46:15,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-20 22:46:35,387 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:46:40,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.46 vs. limit=15.0 2023-06-20 22:46:56,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=818514.0, ans=0.1 2023-06-20 22:47:20,538 INFO [train.py:996] (1/4) Epoch 5, batch 14450, loss[loss=0.2415, simple_loss=0.3065, pruned_loss=0.08824, over 22020.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3202, pruned_loss=0.09062, over 4248050.96 frames. ], batch size: 103, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:47:31,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=818574.0, ans=0.125 2023-06-20 22:48:05,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=818694.0, ans=0.0 2023-06-20 22:48:17,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=818754.0, ans=0.2 2023-06-20 22:48:28,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=818754.0, ans=0.95 2023-06-20 22:48:58,540 INFO [train.py:996] (1/4) Epoch 5, batch 14500, loss[loss=0.2529, simple_loss=0.3133, pruned_loss=0.09624, over 21199.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3163, pruned_loss=0.09017, over 4259012.90 frames. ], batch size: 159, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:49:16,436 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.793e+02 3.259e+02 3.991e+02 5.427e+02, threshold=6.518e+02, percent-clipped=0.0 2023-06-20 22:49:37,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=818994.0, ans=0.0 2023-06-20 22:49:49,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-20 22:50:07,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=819054.0, ans=0.0 2023-06-20 22:50:34,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=819114.0, ans=0.2 2023-06-20 22:50:40,030 INFO [train.py:996] (1/4) Epoch 5, batch 14550, loss[loss=0.2346, simple_loss=0.3173, pruned_loss=0.07592, over 21806.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.321, pruned_loss=0.09223, over 4260475.57 frames. ], batch size: 316, lr: 6.24e-03, grad_scale: 32.0 2023-06-20 22:51:29,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=819294.0, ans=0.035 2023-06-20 22:51:34,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=819294.0, ans=0.2 2023-06-20 22:51:37,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=819354.0, ans=0.125 2023-06-20 22:51:40,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=819354.0, ans=0.0 2023-06-20 22:52:12,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=819414.0, ans=0.07 2023-06-20 22:52:20,164 INFO [train.py:996] (1/4) Epoch 5, batch 14600, loss[loss=0.2751, simple_loss=0.352, pruned_loss=0.09909, over 21281.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.33, pruned_loss=0.09764, over 4266840.43 frames. ], batch size: 159, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:52:25,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=819474.0, ans=0.125 2023-06-20 22:52:31,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=819474.0, ans=0.125 2023-06-20 22:52:44,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.146e+02 3.577e+02 4.655e+02 8.854e+02, threshold=7.154e+02, percent-clipped=8.0 2023-06-20 22:53:09,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=819594.0, ans=0.2 2023-06-20 22:54:00,086 INFO [train.py:996] (1/4) Epoch 5, batch 14650, loss[loss=0.1808, simple_loss=0.2615, pruned_loss=0.05001, over 21276.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3317, pruned_loss=0.09623, over 4268654.64 frames. ], batch size: 159, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:54:05,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-20 22:55:01,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=819954.0, ans=0.2 2023-06-20 22:55:05,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=819954.0, ans=22.5 2023-06-20 22:55:27,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=820014.0, ans=0.125 2023-06-20 22:55:29,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-20 22:55:41,413 INFO [train.py:996] (1/4) Epoch 5, batch 14700, loss[loss=0.2385, simple_loss=0.3246, pruned_loss=0.07622, over 21735.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3232, pruned_loss=0.08944, over 4261240.90 frames. ], batch size: 247, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:56:11,178 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.525e+02 2.974e+02 3.979e+02 6.680e+02, threshold=5.948e+02, percent-clipped=0.0 2023-06-20 22:56:35,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=820194.0, ans=0.125 2023-06-20 22:57:29,224 INFO [train.py:996] (1/4) Epoch 5, batch 14750, loss[loss=0.3078, simple_loss=0.3845, pruned_loss=0.1156, over 21874.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3283, pruned_loss=0.09166, over 4260763.12 frames. ], batch size: 316, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:58:21,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=820494.0, ans=0.07 2023-06-20 22:58:44,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-20 22:59:11,129 INFO [train.py:996] (1/4) Epoch 5, batch 14800, loss[loss=0.2878, simple_loss=0.3345, pruned_loss=0.1206, over 21859.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3415, pruned_loss=0.09829, over 4267728.06 frames. ], batch size: 98, lr: 6.24e-03, grad_scale: 32.0 2023-06-20 22:59:18,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=820674.0, ans=0.125 2023-06-20 22:59:30,584 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.410e+02 3.152e+02 3.633e+02 4.425e+02 1.058e+03, threshold=7.266e+02, percent-clipped=3.0 2023-06-20 22:59:34,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=820734.0, ans=0.125 2023-06-20 22:59:58,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=820794.0, ans=0.0 2023-06-20 23:00:06,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=820794.0, ans=0.1 2023-06-20 23:00:45,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=820914.0, ans=0.0 2023-06-20 23:00:54,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=820974.0, ans=0.1 2023-06-20 23:00:55,535 INFO [train.py:996] (1/4) Epoch 5, batch 14850, loss[loss=0.2633, simple_loss=0.3456, pruned_loss=0.09057, over 19925.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3355, pruned_loss=0.0975, over 4271244.05 frames. ], batch size: 702, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:02:11,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=821154.0, ans=0.125 2023-06-20 23:02:37,257 INFO [train.py:996] (1/4) Epoch 5, batch 14900, loss[loss=0.2842, simple_loss=0.3486, pruned_loss=0.11, over 21374.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3397, pruned_loss=0.1006, over 4277932.95 frames. ], batch size: 159, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:03:08,770 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 3.108e+02 3.722e+02 4.348e+02 7.688e+02, threshold=7.444e+02, percent-clipped=1.0 2023-06-20 23:03:14,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-20 23:03:23,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-20 23:03:31,674 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=22.5 2023-06-20 23:03:47,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=821454.0, ans=0.125 2023-06-20 23:03:59,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=821454.0, ans=0.2 2023-06-20 23:04:02,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=821514.0, ans=0.125 2023-06-20 23:04:29,708 INFO [train.py:996] (1/4) Epoch 5, batch 14950, loss[loss=0.2147, simple_loss=0.3059, pruned_loss=0.06176, over 21592.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3401, pruned_loss=0.099, over 4278531.85 frames. ], batch size: 230, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:04:45,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.43 vs. limit=6.0 2023-06-20 23:05:04,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=821634.0, ans=0.125 2023-06-20 23:05:12,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=821694.0, ans=0.0 2023-06-20 23:05:33,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=821754.0, ans=0.0 2023-06-20 23:06:10,022 INFO [train.py:996] (1/4) Epoch 5, batch 15000, loss[loss=0.2531, simple_loss=0.3165, pruned_loss=0.09488, over 21483.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3413, pruned_loss=0.1005, over 4287021.86 frames. ], batch size: 194, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:06:10,023 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-20 23:06:26,229 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2595, simple_loss=0.3578, pruned_loss=0.08055, over 1796401.00 frames. 2023-06-20 23:06:26,230 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-20 23:06:28,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=821874.0, ans=0.125 2023-06-20 23:07:00,527 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.991e+02 3.617e+02 4.837e+02 7.610e+02, threshold=7.234e+02, percent-clipped=2.0 2023-06-20 23:07:12,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=821994.0, ans=0.0 2023-06-20 23:07:34,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=822054.0, ans=0.1 2023-06-20 23:07:53,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=822114.0, ans=0.0 2023-06-20 23:07:55,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=822114.0, ans=0.2 2023-06-20 23:08:03,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=822114.0, ans=0.0 2023-06-20 23:08:12,411 INFO [train.py:996] (1/4) Epoch 5, batch 15050, loss[loss=0.2899, simple_loss=0.3669, pruned_loss=0.1064, over 21696.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3442, pruned_loss=0.1021, over 4278120.27 frames. ], batch size: 389, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:08:17,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=822174.0, ans=0.125 2023-06-20 23:09:01,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-20 23:09:14,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=822294.0, ans=15.0 2023-06-20 23:09:59,482 INFO [train.py:996] (1/4) Epoch 5, batch 15100, loss[loss=0.2547, simple_loss=0.3248, pruned_loss=0.09227, over 21415.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.345, pruned_loss=0.101, over 4279193.80 frames. ], batch size: 159, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:10:12,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=822474.0, ans=0.125 2023-06-20 23:10:25,326 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.218e+02 4.050e+02 5.256e+02 8.500e+02, threshold=8.100e+02, percent-clipped=5.0 2023-06-20 23:10:34,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-20 23:11:03,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.26 vs. limit=10.0 2023-06-20 23:11:23,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=822714.0, ans=0.0 2023-06-20 23:11:44,941 INFO [train.py:996] (1/4) Epoch 5, batch 15150, loss[loss=0.2381, simple_loss=0.3022, pruned_loss=0.08704, over 21392.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3416, pruned_loss=0.1012, over 4276267.47 frames. ], batch size: 131, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:12:05,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=822834.0, ans=0.125 2023-06-20 23:13:09,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=823014.0, ans=0.2 2023-06-20 23:13:21,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=823014.0, ans=0.1 2023-06-20 23:13:24,116 INFO [train.py:996] (1/4) Epoch 5, batch 15200, loss[loss=0.2021, simple_loss=0.2864, pruned_loss=0.0589, over 21673.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3322, pruned_loss=0.09674, over 4277889.27 frames. ], batch size: 415, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:13:45,049 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.736e+02 3.206e+02 4.003e+02 7.087e+02, threshold=6.412e+02, percent-clipped=0.0 2023-06-20 23:14:11,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=823194.0, ans=0.0 2023-06-20 23:14:12,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=823194.0, ans=0.125 2023-06-20 23:15:01,154 INFO [train.py:996] (1/4) Epoch 5, batch 15250, loss[loss=0.2399, simple_loss=0.3014, pruned_loss=0.08924, over 21830.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3254, pruned_loss=0.09549, over 4282866.65 frames. ], batch size: 372, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:15:10,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.59 vs. limit=22.5 2023-06-20 23:15:44,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=823494.0, ans=0.125 2023-06-20 23:15:44,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-20 23:15:48,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=823494.0, ans=0.125 2023-06-20 23:15:57,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=823494.0, ans=0.0 2023-06-20 23:16:30,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-20 23:16:31,502 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:16:42,961 INFO [train.py:996] (1/4) Epoch 5, batch 15300, loss[loss=0.3312, simple_loss=0.4269, pruned_loss=0.1177, over 19714.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3302, pruned_loss=0.0994, over 4279595.40 frames. ], batch size: 703, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:17:02,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=823734.0, ans=0.1 2023-06-20 23:17:04,268 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.998e+02 3.594e+02 4.256e+02 7.669e+02, threshold=7.187e+02, percent-clipped=3.0 2023-06-20 23:17:25,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=823794.0, ans=0.035 2023-06-20 23:17:43,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=823854.0, ans=0.025 2023-06-20 23:17:44,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=823854.0, ans=0.125 2023-06-20 23:17:48,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.33 vs. limit=6.0 2023-06-20 23:18:09,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=823914.0, ans=0.04949747468305833 2023-06-20 23:18:09,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=823914.0, ans=0.0 2023-06-20 23:18:23,674 INFO [train.py:996] (1/4) Epoch 5, batch 15350, loss[loss=0.2737, simple_loss=0.3506, pruned_loss=0.09847, over 21899.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3355, pruned_loss=0.1005, over 4275476.32 frames. ], batch size: 371, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:18:26,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-20 23:18:38,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=824034.0, ans=0.2 2023-06-20 23:20:03,469 INFO [train.py:996] (1/4) Epoch 5, batch 15400, loss[loss=0.2437, simple_loss=0.3147, pruned_loss=0.08634, over 21876.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3361, pruned_loss=0.09866, over 4280440.00 frames. ], batch size: 414, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:20:21,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=824334.0, ans=0.0 2023-06-20 23:20:25,807 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.899e+02 3.241e+02 4.047e+02 6.361e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 23:21:05,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.76 vs. limit=22.5 2023-06-20 23:21:39,514 INFO [train.py:996] (1/4) Epoch 5, batch 15450, loss[loss=0.2997, simple_loss=0.3568, pruned_loss=0.1214, over 20623.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3337, pruned_loss=0.09807, over 4275747.48 frames. ], batch size: 607, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:21:41,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=824574.0, ans=0.125 2023-06-20 23:22:13,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=824634.0, ans=0.125 2023-06-20 23:22:53,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=824754.0, ans=0.125 2023-06-20 23:23:19,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=824874.0, ans=0.125 2023-06-20 23:23:20,708 INFO [train.py:996] (1/4) Epoch 5, batch 15500, loss[loss=0.3098, simple_loss=0.3745, pruned_loss=0.1225, over 21557.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3347, pruned_loss=0.09768, over 4266114.47 frames. ], batch size: 414, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:23:53,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=12.0 2023-06-20 23:23:54,385 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.818e+02 3.290e+02 3.883e+02 6.635e+02, threshold=6.579e+02, percent-clipped=1.0 2023-06-20 23:24:09,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=824994.0, ans=0.125 2023-06-20 23:24:37,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=825054.0, ans=0.0 2023-06-20 23:24:44,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.20 vs. limit=15.0 2023-06-20 23:25:02,094 INFO [train.py:996] (1/4) Epoch 5, batch 15550, loss[loss=0.1929, simple_loss=0.2545, pruned_loss=0.06564, over 21852.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3321, pruned_loss=0.09488, over 4266130.81 frames. ], batch size: 98, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:25:33,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=825234.0, ans=0.125 2023-06-20 23:26:15,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=825354.0, ans=0.05 2023-06-20 23:26:42,173 INFO [train.py:996] (1/4) Epoch 5, batch 15600, loss[loss=0.2597, simple_loss=0.3015, pruned_loss=0.1089, over 21367.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3256, pruned_loss=0.09345, over 4266529.29 frames. ], batch size: 508, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:27:05,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-20 23:27:09,682 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.848e+02 3.319e+02 3.887e+02 5.745e+02, threshold=6.638e+02, percent-clipped=0.0 2023-06-20 23:27:21,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=825594.0, ans=0.125 2023-06-20 23:27:23,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=825594.0, ans=0.125 2023-06-20 23:28:03,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=825714.0, ans=0.05 2023-06-20 23:28:11,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=825714.0, ans=0.2 2023-06-20 23:28:13,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.03 vs. limit=15.0 2023-06-20 23:28:17,185 INFO [train.py:996] (1/4) Epoch 5, batch 15650, loss[loss=0.2825, simple_loss=0.3613, pruned_loss=0.1018, over 21618.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3243, pruned_loss=0.09317, over 4258530.25 frames. ], batch size: 441, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:28:30,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=825774.0, ans=0.2 2023-06-20 23:30:01,338 INFO [train.py:996] (1/4) Epoch 5, batch 15700, loss[loss=0.258, simple_loss=0.3315, pruned_loss=0.09221, over 21526.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3198, pruned_loss=0.09201, over 4258949.61 frames. ], batch size: 389, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:30:13,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=826074.0, ans=0.125 2023-06-20 23:30:15,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=826074.0, ans=0.2 2023-06-20 23:30:28,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=826134.0, ans=0.125 2023-06-20 23:30:29,893 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.764e+02 3.253e+02 4.322e+02 6.346e+02, threshold=6.507e+02, percent-clipped=0.0 2023-06-20 23:30:52,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=826194.0, ans=0.125 2023-06-20 23:31:41,389 INFO [train.py:996] (1/4) Epoch 5, batch 15750, loss[loss=0.2213, simple_loss=0.2831, pruned_loss=0.07975, over 21824.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3149, pruned_loss=0.09121, over 4261542.46 frames. ], batch size: 372, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:31:41,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=826374.0, ans=0.125 2023-06-20 23:32:08,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=826434.0, ans=0.125 2023-06-20 23:32:13,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=826434.0, ans=0.125 2023-06-20 23:32:21,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=826494.0, ans=0.1 2023-06-20 23:32:23,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=826494.0, ans=0.5 2023-06-20 23:32:23,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=826494.0, ans=0.125 2023-06-20 23:32:30,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=826494.0, ans=0.0 2023-06-20 23:32:41,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=826494.0, ans=0.125 2023-06-20 23:33:13,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=826614.0, ans=0.0 2023-06-20 23:33:15,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=826614.0, ans=0.0 2023-06-20 23:33:15,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=826614.0, ans=0.125 2023-06-20 23:33:21,174 INFO [train.py:996] (1/4) Epoch 5, batch 15800, loss[loss=0.2304, simple_loss=0.2891, pruned_loss=0.08589, over 21645.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3112, pruned_loss=0.09065, over 4253138.44 frames. ], batch size: 231, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:33:27,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=826674.0, ans=0.125 2023-06-20 23:33:39,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=826674.0, ans=0.1 2023-06-20 23:33:50,784 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.927e+02 3.607e+02 4.746e+02 7.598e+02, threshold=7.214e+02, percent-clipped=2.0 2023-06-20 23:33:51,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=826734.0, ans=0.125 2023-06-20 23:33:59,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=826794.0, ans=0.02 2023-06-20 23:34:18,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=826794.0, ans=0.1 2023-06-20 23:34:42,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=826914.0, ans=15.0 2023-06-20 23:34:52,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=826914.0, ans=0.2 2023-06-20 23:34:59,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=826914.0, ans=0.0 2023-06-20 23:35:01,892 INFO [train.py:996] (1/4) Epoch 5, batch 15850, loss[loss=0.2918, simple_loss=0.3415, pruned_loss=0.121, over 21369.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3169, pruned_loss=0.09409, over 4259377.87 frames. ], batch size: 471, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:35:12,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=826974.0, ans=0.0 2023-06-20 23:35:15,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=826974.0, ans=0.125 2023-06-20 23:35:49,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=827094.0, ans=0.0 2023-06-20 23:36:41,680 INFO [train.py:996] (1/4) Epoch 5, batch 15900, loss[loss=0.2208, simple_loss=0.2872, pruned_loss=0.07718, over 15815.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3158, pruned_loss=0.09424, over 4254400.59 frames. ], batch size: 60, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:37:11,676 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.862e+02 3.189e+02 4.240e+02 8.969e+02, threshold=6.379e+02, percent-clipped=1.0 2023-06-20 23:37:31,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=827394.0, ans=15.0 2023-06-20 23:38:04,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-20 23:38:07,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=827514.0, ans=15.0 2023-06-20 23:38:17,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=827514.0, ans=0.0 2023-06-20 23:38:22,868 INFO [train.py:996] (1/4) Epoch 5, batch 15950, loss[loss=0.2229, simple_loss=0.2984, pruned_loss=0.07375, over 21721.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.314, pruned_loss=0.0915, over 4252496.16 frames. ], batch size: 247, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:39:19,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=827754.0, ans=0.125 2023-06-20 23:39:57,217 INFO [train.py:996] (1/4) Epoch 5, batch 16000, loss[loss=0.2247, simple_loss=0.3114, pruned_loss=0.06902, over 21720.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3157, pruned_loss=0.08854, over 4256880.09 frames. ], batch size: 247, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:40:06,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=827874.0, ans=0.0 2023-06-20 23:40:30,939 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.522e+02 3.012e+02 3.700e+02 7.317e+02, threshold=6.025e+02, percent-clipped=2.0 2023-06-20 23:40:43,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=827994.0, ans=0.2 2023-06-20 23:40:54,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.28 vs. limit=15.0 2023-06-20 23:40:58,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=827994.0, ans=0.035 2023-06-20 23:41:12,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=828054.0, ans=0.125 2023-06-20 23:41:23,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=828114.0, ans=0.125 2023-06-20 23:41:28,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=828114.0, ans=0.1 2023-06-20 23:41:29,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=828114.0, ans=0.0 2023-06-20 23:41:43,627 INFO [train.py:996] (1/4) Epoch 5, batch 16050, loss[loss=0.2101, simple_loss=0.282, pruned_loss=0.06911, over 21159.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3182, pruned_loss=0.08661, over 4266940.02 frames. ], batch size: 143, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:42:23,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=828294.0, ans=0.125 2023-06-20 23:42:23,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=828294.0, ans=0.125 2023-06-20 23:42:53,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=828354.0, ans=0.125 2023-06-20 23:43:21,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=828414.0, ans=0.2 2023-06-20 23:43:23,959 INFO [train.py:996] (1/4) Epoch 5, batch 16100, loss[loss=0.2901, simple_loss=0.3415, pruned_loss=0.1193, over 21477.00 frames. ], tot_loss[loss=0.249, simple_loss=0.321, pruned_loss=0.08847, over 4271186.69 frames. ], batch size: 131, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:43:52,549 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.759e+02 3.248e+02 4.030e+02 6.532e+02, threshold=6.496e+02, percent-clipped=1.0 2023-06-20 23:43:56,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-20 23:44:52,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=828714.0, ans=0.2 2023-06-20 23:44:57,666 INFO [train.py:996] (1/4) Epoch 5, batch 16150, loss[loss=0.2361, simple_loss=0.3107, pruned_loss=0.08069, over 21846.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.321, pruned_loss=0.09081, over 4275980.58 frames. ], batch size: 282, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:45:08,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-20 23:45:19,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=10.0 2023-06-20 23:46:23,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=829014.0, ans=0.0 2023-06-20 23:46:23,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=829014.0, ans=0.1 2023-06-20 23:46:34,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-20 23:46:40,194 INFO [train.py:996] (1/4) Epoch 5, batch 16200, loss[loss=0.247, simple_loss=0.3305, pruned_loss=0.08175, over 21654.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3258, pruned_loss=0.09282, over 4279376.53 frames. ], batch size: 263, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:47:09,184 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 2.854e+02 3.310e+02 3.979e+02 8.024e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-20 23:48:02,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=829314.0, ans=0.125 2023-06-20 23:48:19,430 INFO [train.py:996] (1/4) Epoch 5, batch 16250, loss[loss=0.2721, simple_loss=0.3415, pruned_loss=0.1013, over 21334.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3263, pruned_loss=0.09243, over 4271189.00 frames. ], batch size: 548, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:49:43,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=829614.0, ans=0.0 2023-06-20 23:50:03,639 INFO [train.py:996] (1/4) Epoch 5, batch 16300, loss[loss=0.2612, simple_loss=0.3402, pruned_loss=0.09109, over 19961.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3205, pruned_loss=0.08789, over 4261115.98 frames. ], batch size: 702, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:50:10,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=829674.0, ans=0.125 2023-06-20 23:50:29,419 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.495e+02 2.799e+02 3.333e+02 5.849e+02, threshold=5.597e+02, percent-clipped=0.0 2023-06-20 23:50:35,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=829734.0, ans=0.125 2023-06-20 23:50:41,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=829794.0, ans=0.1 2023-06-20 23:51:18,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=829854.0, ans=0.0 2023-06-20 23:51:44,328 INFO [train.py:996] (1/4) Epoch 5, batch 16350, loss[loss=0.286, simple_loss=0.3697, pruned_loss=0.1011, over 21497.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3209, pruned_loss=0.08875, over 4256083.47 frames. ], batch size: 131, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:53:03,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-20 23:53:09,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=830214.0, ans=0.5 2023-06-20 23:53:23,348 INFO [train.py:996] (1/4) Epoch 5, batch 16400, loss[loss=0.2628, simple_loss=0.3282, pruned_loss=0.09869, over 21914.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3253, pruned_loss=0.09051, over 4262443.24 frames. ], batch size: 351, lr: 6.20e-03, grad_scale: 32.0 2023-06-20 23:53:33,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=830274.0, ans=0.0 2023-06-20 23:53:52,848 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 2.889e+02 3.302e+02 3.961e+02 7.962e+02, threshold=6.603e+02, percent-clipped=4.0 2023-06-20 23:54:13,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=830394.0, ans=0.0 2023-06-20 23:55:02,575 INFO [train.py:996] (1/4) Epoch 5, batch 16450, loss[loss=0.2682, simple_loss=0.3441, pruned_loss=0.09612, over 21449.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3257, pruned_loss=0.09177, over 4269264.90 frames. ], batch size: 548, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:55:09,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=830574.0, ans=0.0 2023-06-20 23:55:23,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=830634.0, ans=0.125 2023-06-20 23:55:24,073 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-20 23:55:44,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=830694.0, ans=0.125 2023-06-20 23:56:41,433 INFO [train.py:996] (1/4) Epoch 5, batch 16500, loss[loss=0.2802, simple_loss=0.3513, pruned_loss=0.1045, over 21694.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3264, pruned_loss=0.0933, over 4268315.80 frames. ], batch size: 414, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:57:19,055 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.018e+02 3.661e+02 4.243e+02 1.006e+03, threshold=7.323e+02, percent-clipped=9.0 2023-06-20 23:57:22,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=830994.0, ans=0.125 2023-06-20 23:57:37,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=830994.0, ans=0.1 2023-06-20 23:57:50,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=831054.0, ans=0.125 2023-06-20 23:58:06,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=831114.0, ans=0.125 2023-06-20 23:58:23,176 INFO [train.py:996] (1/4) Epoch 5, batch 16550, loss[loss=0.2697, simple_loss=0.3293, pruned_loss=0.1051, over 21778.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3231, pruned_loss=0.08989, over 4267909.12 frames. ], batch size: 124, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:58:34,507 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:58:44,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=831174.0, ans=0.125 2023-06-20 23:59:33,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=831354.0, ans=0.0 2023-06-20 23:59:40,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=831354.0, ans=0.125 2023-06-21 00:00:14,131 INFO [train.py:996] (1/4) Epoch 5, batch 16600, loss[loss=0.3209, simple_loss=0.4171, pruned_loss=0.1123, over 21618.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3307, pruned_loss=0.09309, over 4277551.10 frames. ], batch size: 389, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:00:39,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=831534.0, ans=0.1 2023-06-21 00:00:41,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=831534.0, ans=0.125 2023-06-21 00:00:42,887 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.262e+02 3.858e+02 4.542e+02 8.769e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-21 00:01:31,054 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:01:57,512 INFO [train.py:996] (1/4) Epoch 5, batch 16650, loss[loss=0.3022, simple_loss=0.3624, pruned_loss=0.121, over 21279.00 frames. ], tot_loss[loss=0.265, simple_loss=0.339, pruned_loss=0.09548, over 4281891.40 frames. ], batch size: 176, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:02:34,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=831834.0, ans=0.125 2023-06-21 00:03:40,740 INFO [train.py:996] (1/4) Epoch 5, batch 16700, loss[loss=0.254, simple_loss=0.3357, pruned_loss=0.08613, over 21724.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3401, pruned_loss=0.0966, over 4282978.14 frames. ], batch size: 351, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:04:18,913 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 2.932e+02 3.507e+02 4.315e+02 8.242e+02, threshold=7.013e+02, percent-clipped=1.0 2023-06-21 00:04:54,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=832254.0, ans=0.125 2023-06-21 00:05:29,481 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-21 00:05:30,375 INFO [train.py:996] (1/4) Epoch 5, batch 16750, loss[loss=0.2817, simple_loss=0.3592, pruned_loss=0.1021, over 21704.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3438, pruned_loss=0.09936, over 4283979.93 frames. ], batch size: 332, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:05:32,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=832374.0, ans=0.2 2023-06-21 00:06:08,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=832434.0, ans=0.125 2023-06-21 00:06:57,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=832614.0, ans=0.125 2023-06-21 00:07:16,823 INFO [train.py:996] (1/4) Epoch 5, batch 16800, loss[loss=0.2552, simple_loss=0.3406, pruned_loss=0.08492, over 21407.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.348, pruned_loss=0.09975, over 4282922.31 frames. ], batch size: 548, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:07:23,513 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:07:48,837 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.643e+02 3.479e+02 3.935e+02 4.857e+02 8.503e+02, threshold=7.870e+02, percent-clipped=2.0 2023-06-21 00:08:27,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=832854.0, ans=0.125 2023-06-21 00:08:30,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=832854.0, ans=0.0 2023-06-21 00:08:37,584 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-21 00:08:55,640 INFO [train.py:996] (1/4) Epoch 5, batch 16850, loss[loss=0.2031, simple_loss=0.2512, pruned_loss=0.07746, over 20020.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3426, pruned_loss=0.09882, over 4277826.98 frames. ], batch size: 704, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:09:00,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=832974.0, ans=0.125 2023-06-21 00:09:02,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=832974.0, ans=0.125 2023-06-21 00:09:32,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=833094.0, ans=0.1 2023-06-21 00:10:25,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-21 00:10:31,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=833214.0, ans=0.1 2023-06-21 00:10:35,967 INFO [train.py:996] (1/4) Epoch 5, batch 16900, loss[loss=0.1974, simple_loss=0.2672, pruned_loss=0.06383, over 21673.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3365, pruned_loss=0.09742, over 4276853.19 frames. ], batch size: 247, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:10:57,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.58 vs. limit=10.0 2023-06-21 00:11:07,325 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.951e+02 3.440e+02 4.010e+02 6.855e+02, threshold=6.879e+02, percent-clipped=0.0 2023-06-21 00:11:36,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-21 00:11:42,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=833454.0, ans=0.2 2023-06-21 00:11:55,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=833514.0, ans=0.04949747468305833 2023-06-21 00:12:09,925 INFO [train.py:996] (1/4) Epoch 5, batch 16950, loss[loss=0.2415, simple_loss=0.3056, pruned_loss=0.08874, over 21696.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3295, pruned_loss=0.09485, over 4268249.67 frames. ], batch size: 263, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:12:26,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-21 00:13:59,416 INFO [train.py:996] (1/4) Epoch 5, batch 17000, loss[loss=0.2466, simple_loss=0.3133, pruned_loss=0.08998, over 21811.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3285, pruned_loss=0.09634, over 4282106.58 frames. ], batch size: 124, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:14:09,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=833874.0, ans=0.1 2023-06-21 00:14:14,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=833934.0, ans=0.125 2023-06-21 00:14:22,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=833934.0, ans=0.125 2023-06-21 00:14:27,670 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 2.869e+02 3.423e+02 4.013e+02 9.065e+02, threshold=6.846e+02, percent-clipped=1.0 2023-06-21 00:14:44,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-21 00:15:36,726 INFO [train.py:996] (1/4) Epoch 5, batch 17050, loss[loss=0.2658, simple_loss=0.3423, pruned_loss=0.09463, over 21910.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3356, pruned_loss=0.09779, over 4277724.97 frames. ], batch size: 316, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:15:40,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-21 00:16:11,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=834294.0, ans=0.125 2023-06-21 00:16:18,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=834294.0, ans=0.1 2023-06-21 00:16:43,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=834354.0, ans=0.125 2023-06-21 00:16:48,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=834414.0, ans=0.1 2023-06-21 00:17:14,710 INFO [train.py:996] (1/4) Epoch 5, batch 17100, loss[loss=0.2468, simple_loss=0.3178, pruned_loss=0.08786, over 21462.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3351, pruned_loss=0.0991, over 4285162.69 frames. ], batch size: 131, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:17:23,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=834474.0, ans=0.125 2023-06-21 00:17:28,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.93 vs. limit=22.5 2023-06-21 00:17:43,335 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.091e+02 3.634e+02 4.796e+02 1.009e+03, threshold=7.268e+02, percent-clipped=8.0 2023-06-21 00:17:50,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-21 00:18:06,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=834654.0, ans=0.125 2023-06-21 00:18:23,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-21 00:18:53,489 INFO [train.py:996] (1/4) Epoch 5, batch 17150, loss[loss=0.3065, simple_loss=0.4014, pruned_loss=0.1058, over 19784.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3311, pruned_loss=0.09863, over 4286496.59 frames. ], batch size: 702, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:18:58,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=834774.0, ans=0.1 2023-06-21 00:19:31,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=834894.0, ans=0.125 2023-06-21 00:19:31,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=22.5 2023-06-21 00:19:32,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=834894.0, ans=0.125 2023-06-21 00:19:53,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=834954.0, ans=0.125 2023-06-21 00:20:33,447 INFO [train.py:996] (1/4) Epoch 5, batch 17200, loss[loss=0.3245, simple_loss=0.386, pruned_loss=0.1315, over 21536.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3306, pruned_loss=0.09848, over 4288382.14 frames. ], batch size: 131, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:20:55,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=835134.0, ans=0.0 2023-06-21 00:21:12,569 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 2.764e+02 3.023e+02 3.387e+02 5.035e+02, threshold=6.046e+02, percent-clipped=0.0 2023-06-21 00:21:55,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-06-21 00:22:19,266 INFO [train.py:996] (1/4) Epoch 5, batch 17250, loss[loss=0.2412, simple_loss=0.2902, pruned_loss=0.0961, over 19988.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3349, pruned_loss=0.1004, over 4286484.85 frames. ], batch size: 702, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:22:34,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=835434.0, ans=0.2 2023-06-21 00:22:52,000 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:23:50,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=835614.0, ans=0.125 2023-06-21 00:23:52,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=835614.0, ans=0.125 2023-06-21 00:24:02,202 INFO [train.py:996] (1/4) Epoch 5, batch 17300, loss[loss=0.2721, simple_loss=0.3425, pruned_loss=0.1009, over 21928.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3402, pruned_loss=0.1032, over 4280890.89 frames. ], batch size: 372, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:24:04,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=835674.0, ans=0.125 2023-06-21 00:24:37,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=835734.0, ans=0.125 2023-06-21 00:24:41,693 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.738e+02 3.630e+02 4.657e+02 6.212e+02 1.066e+03, threshold=9.314e+02, percent-clipped=26.0 2023-06-21 00:25:10,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-21 00:25:32,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=835914.0, ans=0.125 2023-06-21 00:25:48,508 INFO [train.py:996] (1/4) Epoch 5, batch 17350, loss[loss=0.2387, simple_loss=0.3301, pruned_loss=0.07364, over 21805.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3416, pruned_loss=0.103, over 4281497.19 frames. ], batch size: 371, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:26:42,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=836094.0, ans=10.0 2023-06-21 00:26:55,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=836154.0, ans=0.125 2023-06-21 00:27:29,133 INFO [train.py:996] (1/4) Epoch 5, batch 17400, loss[loss=0.3455, simple_loss=0.4115, pruned_loss=0.1398, over 21532.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3386, pruned_loss=0.09888, over 4272106.41 frames. ], batch size: 508, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:28:04,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=836334.0, ans=0.0 2023-06-21 00:28:10,150 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 2.783e+02 3.227e+02 3.615e+02 5.491e+02, threshold=6.454e+02, percent-clipped=0.0 2023-06-21 00:28:26,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=836394.0, ans=0.125 2023-06-21 00:28:34,931 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-06-21 00:28:38,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=836454.0, ans=0.125 2023-06-21 00:28:44,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=836454.0, ans=0.125 2023-06-21 00:29:00,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=836514.0, ans=0.1 2023-06-21 00:29:16,077 INFO [train.py:996] (1/4) Epoch 5, batch 17450, loss[loss=0.2069, simple_loss=0.3037, pruned_loss=0.05506, over 21716.00 frames. ], tot_loss[loss=0.264, simple_loss=0.336, pruned_loss=0.09601, over 4268283.37 frames. ], batch size: 351, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:29:51,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=836634.0, ans=0.125 2023-06-21 00:30:43,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=15.0 2023-06-21 00:31:00,454 INFO [train.py:996] (1/4) Epoch 5, batch 17500, loss[loss=0.2175, simple_loss=0.2848, pruned_loss=0.07514, over 16402.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3318, pruned_loss=0.09291, over 4268748.56 frames. ], batch size: 60, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:31:00,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=836874.0, ans=0.0 2023-06-21 00:31:34,505 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.759e+02 3.126e+02 4.015e+02 6.726e+02, threshold=6.252e+02, percent-clipped=1.0 2023-06-21 00:31:34,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=836994.0, ans=0.125 2023-06-21 00:31:34,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=836994.0, ans=0.04949747468305833 2023-06-21 00:31:47,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=836994.0, ans=0.1 2023-06-21 00:31:50,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=837054.0, ans=0.125 2023-06-21 00:32:03,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=837054.0, ans=0.0 2023-06-21 00:32:32,775 INFO [train.py:996] (1/4) Epoch 5, batch 17550, loss[loss=0.2399, simple_loss=0.3306, pruned_loss=0.07464, over 21428.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3312, pruned_loss=0.09182, over 4274514.87 frames. ], batch size: 131, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:33:42,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=837354.0, ans=0.1 2023-06-21 00:34:18,778 INFO [train.py:996] (1/4) Epoch 5, batch 17600, loss[loss=0.2383, simple_loss=0.3223, pruned_loss=0.07717, over 21335.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3319, pruned_loss=0.09067, over 4274904.17 frames. ], batch size: 176, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:34:22,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=837474.0, ans=0.05 2023-06-21 00:34:53,981 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.862e+02 3.527e+02 4.406e+02 6.176e+02, threshold=7.053e+02, percent-clipped=0.0 2023-06-21 00:35:03,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-21 00:35:30,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=837654.0, ans=0.125 2023-06-21 00:35:33,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=837654.0, ans=0.2 2023-06-21 00:35:36,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=837714.0, ans=0.125 2023-06-21 00:35:59,199 INFO [train.py:996] (1/4) Epoch 5, batch 17650, loss[loss=0.1747, simple_loss=0.2408, pruned_loss=0.05426, over 21241.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3297, pruned_loss=0.09114, over 4269316.41 frames. ], batch size: 176, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:36:19,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=837834.0, ans=0.0 2023-06-21 00:36:30,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=837834.0, ans=0.0 2023-06-21 00:36:38,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=837894.0, ans=0.125 2023-06-21 00:36:46,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=837894.0, ans=0.0 2023-06-21 00:37:08,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-21 00:37:21,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-21 00:37:42,000 INFO [train.py:996] (1/4) Epoch 5, batch 17700, loss[loss=0.2784, simple_loss=0.3529, pruned_loss=0.1019, over 21435.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3233, pruned_loss=0.08836, over 4264340.35 frames. ], batch size: 194, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:37:43,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=838074.0, ans=0.2 2023-06-21 00:37:45,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=838074.0, ans=0.125 2023-06-21 00:37:49,824 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-21 00:38:11,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-21 00:38:16,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.09 vs. limit=15.0 2023-06-21 00:38:17,255 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.950e+02 3.482e+02 4.668e+02 9.100e+02, threshold=6.963e+02, percent-clipped=4.0 2023-06-21 00:38:41,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=838194.0, ans=0.125 2023-06-21 00:38:54,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=838254.0, ans=0.2 2023-06-21 00:39:21,459 INFO [train.py:996] (1/4) Epoch 5, batch 17750, loss[loss=0.2825, simple_loss=0.3494, pruned_loss=0.1078, over 21374.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3312, pruned_loss=0.09287, over 4268877.81 frames. ], batch size: 176, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:39:24,085 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.51 vs. limit=15.0 2023-06-21 00:39:31,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-21 00:40:12,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=838494.0, ans=0.125 2023-06-21 00:41:07,658 INFO [train.py:996] (1/4) Epoch 5, batch 17800, loss[loss=0.2507, simple_loss=0.3261, pruned_loss=0.08764, over 21722.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3301, pruned_loss=0.091, over 4268718.90 frames. ], batch size: 298, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:41:49,220 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.927e+02 3.424e+02 3.955e+02 9.585e+02, threshold=6.848e+02, percent-clipped=3.0 2023-06-21 00:42:02,724 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:42:02,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=838794.0, ans=0.125 2023-06-21 00:42:12,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=838854.0, ans=0.0 2023-06-21 00:42:23,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=838854.0, ans=0.125 2023-06-21 00:42:40,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=838914.0, ans=0.125 2023-06-21 00:42:49,084 INFO [train.py:996] (1/4) Epoch 5, batch 17850, loss[loss=0.2843, simple_loss=0.3527, pruned_loss=0.108, over 21829.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3327, pruned_loss=0.09279, over 4264625.59 frames. ], batch size: 118, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:42:54,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=838974.0, ans=0.125 2023-06-21 00:44:25,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=839214.0, ans=0.125 2023-06-21 00:44:29,933 INFO [train.py:996] (1/4) Epoch 5, batch 17900, loss[loss=0.2773, simple_loss=0.3741, pruned_loss=0.09024, over 21618.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3382, pruned_loss=0.09477, over 4268041.85 frames. ], batch size: 389, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:44:44,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=839274.0, ans=0.1 2023-06-21 00:44:53,489 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-21 00:45:07,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=839334.0, ans=0.1 2023-06-21 00:45:19,609 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 2.900e+02 3.378e+02 3.906e+02 6.654e+02, threshold=6.756e+02, percent-clipped=0.0 2023-06-21 00:45:48,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=839454.0, ans=0.1 2023-06-21 00:46:22,396 INFO [train.py:996] (1/4) Epoch 5, batch 17950, loss[loss=0.2459, simple_loss=0.3392, pruned_loss=0.07632, over 21591.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3371, pruned_loss=0.09063, over 4266676.90 frames. ], batch size: 441, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:46:33,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=839574.0, ans=0.125 2023-06-21 00:47:15,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=839694.0, ans=10.0 2023-06-21 00:47:31,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=839754.0, ans=0.125 2023-06-21 00:48:01,508 INFO [train.py:996] (1/4) Epoch 5, batch 18000, loss[loss=0.2392, simple_loss=0.2977, pruned_loss=0.09038, over 14853.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3296, pruned_loss=0.08841, over 4257074.55 frames. ], batch size: 62, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:48:01,508 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 00:48:17,782 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2664, simple_loss=0.3658, pruned_loss=0.08353, over 1796401.00 frames. 2023-06-21 00:48:17,783 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-21 00:48:26,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=839874.0, ans=0.2 2023-06-21 00:49:02,447 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.602e+02 3.109e+02 3.503e+02 6.028e+02, threshold=6.218e+02, percent-clipped=0.0 2023-06-21 00:49:05,200 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-21 00:49:13,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=839994.0, ans=0.125 2023-06-21 00:49:21,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=840054.0, ans=0.0 2023-06-21 00:49:46,255 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:49:46,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=840114.0, ans=0.2 2023-06-21 00:49:58,588 INFO [train.py:996] (1/4) Epoch 5, batch 18050, loss[loss=0.207, simple_loss=0.2643, pruned_loss=0.07483, over 21224.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3227, pruned_loss=0.08691, over 4261141.59 frames. ], batch size: 548, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:50:45,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=840294.0, ans=0.2 2023-06-21 00:50:48,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=840294.0, ans=0.125 2023-06-21 00:50:54,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=840294.0, ans=0.0 2023-06-21 00:50:59,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=840294.0, ans=0.0 2023-06-21 00:51:17,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=840354.0, ans=0.1 2023-06-21 00:51:28,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=840414.0, ans=0.0 2023-06-21 00:51:35,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=840414.0, ans=0.125 2023-06-21 00:51:39,036 INFO [train.py:996] (1/4) Epoch 5, batch 18100, loss[loss=0.2293, simple_loss=0.3353, pruned_loss=0.06168, over 20788.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3297, pruned_loss=0.09106, over 4263768.68 frames. ], batch size: 607, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:52:11,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=840534.0, ans=0.2 2023-06-21 00:52:27,298 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.902e+02 3.495e+02 4.106e+02 8.308e+02, threshold=6.990e+02, percent-clipped=1.0 2023-06-21 00:53:22,796 INFO [train.py:996] (1/4) Epoch 5, batch 18150, loss[loss=0.2609, simple_loss=0.3035, pruned_loss=0.1091, over 21078.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3306, pruned_loss=0.0913, over 4263778.51 frames. ], batch size: 143, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:53:51,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=840834.0, ans=0.125 2023-06-21 00:54:13,998 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-21 00:54:22,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=840954.0, ans=0.0 2023-06-21 00:54:49,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=841014.0, ans=0.125 2023-06-21 00:54:54,848 INFO [train.py:996] (1/4) Epoch 5, batch 18200, loss[loss=0.235, simple_loss=0.2958, pruned_loss=0.08707, over 21745.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3237, pruned_loss=0.09065, over 4259152.77 frames. ], batch size: 283, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:55:33,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=841134.0, ans=0.0 2023-06-21 00:55:37,485 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.776e+02 3.291e+02 4.569e+02 1.152e+03, threshold=6.583e+02, percent-clipped=3.0 2023-06-21 00:56:08,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=841254.0, ans=0.1 2023-06-21 00:56:20,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=22.5 2023-06-21 00:56:32,320 INFO [train.py:996] (1/4) Epoch 5, batch 18250, loss[loss=0.2456, simple_loss=0.3102, pruned_loss=0.09051, over 21710.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3166, pruned_loss=0.0873, over 4265309.59 frames. ], batch size: 389, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:56:58,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=841434.0, ans=0.09899494936611666 2023-06-21 00:57:39,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=841554.0, ans=0.0 2023-06-21 00:57:58,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-21 00:58:11,253 INFO [train.py:996] (1/4) Epoch 5, batch 18300, loss[loss=0.2993, simple_loss=0.3934, pruned_loss=0.1026, over 21796.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3166, pruned_loss=0.08689, over 4273976.25 frames. ], batch size: 282, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:58:54,877 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 2.809e+02 3.144e+02 3.817e+02 6.593e+02, threshold=6.288e+02, percent-clipped=1.0 2023-06-21 00:59:10,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=841854.0, ans=0.0 2023-06-21 00:59:49,938 INFO [train.py:996] (1/4) Epoch 5, batch 18350, loss[loss=0.2233, simple_loss=0.291, pruned_loss=0.0778, over 21899.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3193, pruned_loss=0.08682, over 4266845.13 frames. ], batch size: 107, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 01:00:18,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-21 01:01:30,161 INFO [train.py:996] (1/4) Epoch 5, batch 18400, loss[loss=0.1806, simple_loss=0.2628, pruned_loss=0.0492, over 21681.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3169, pruned_loss=0.08464, over 4264233.72 frames. ], batch size: 247, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:01:58,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=842334.0, ans=0.0 2023-06-21 01:02:14,606 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.972e+02 3.476e+02 4.424e+02 9.442e+02, threshold=6.951e+02, percent-clipped=6.0 2023-06-21 01:02:18,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=842394.0, ans=0.125 2023-06-21 01:03:09,996 INFO [train.py:996] (1/4) Epoch 5, batch 18450, loss[loss=0.1916, simple_loss=0.2784, pruned_loss=0.05238, over 21664.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3127, pruned_loss=0.08171, over 4252757.49 frames. ], batch size: 298, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:03:15,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=842574.0, ans=10.0 2023-06-21 01:03:32,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=842574.0, ans=0.0 2023-06-21 01:04:04,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=842694.0, ans=0.035 2023-06-21 01:04:26,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=842754.0, ans=0.125 2023-06-21 01:04:36,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=842814.0, ans=0.0 2023-06-21 01:04:39,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-21 01:04:47,239 INFO [train.py:996] (1/4) Epoch 5, batch 18500, loss[loss=0.2383, simple_loss=0.3034, pruned_loss=0.08662, over 21626.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3079, pruned_loss=0.08073, over 4248298.70 frames. ], batch size: 415, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:05:30,376 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.509e+02 2.864e+02 3.266e+02 4.867e+02, threshold=5.728e+02, percent-clipped=0.0 2023-06-21 01:05:49,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=843054.0, ans=0.125 2023-06-21 01:05:56,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=843054.0, ans=0.0 2023-06-21 01:06:26,365 INFO [train.py:996] (1/4) Epoch 5, batch 18550, loss[loss=0.2446, simple_loss=0.3097, pruned_loss=0.08972, over 21437.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3048, pruned_loss=0.08024, over 4247178.09 frames. ], batch size: 389, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:06:57,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=843234.0, ans=0.125 2023-06-21 01:07:33,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=843354.0, ans=0.125 2023-06-21 01:07:42,945 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:08:03,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=843414.0, ans=0.1 2023-06-21 01:08:06,345 INFO [train.py:996] (1/4) Epoch 5, batch 18600, loss[loss=0.205, simple_loss=0.2772, pruned_loss=0.0664, over 21188.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3028, pruned_loss=0.08123, over 4245734.53 frames. ], batch size: 143, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:08:35,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=843534.0, ans=0.1 2023-06-21 01:08:43,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=843534.0, ans=0.125 2023-06-21 01:08:49,503 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 2.765e+02 3.271e+02 3.896e+02 6.265e+02, threshold=6.542e+02, percent-clipped=2.0 2023-06-21 01:08:52,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-06-21 01:09:40,817 INFO [train.py:996] (1/4) Epoch 5, batch 18650, loss[loss=0.2123, simple_loss=0.2672, pruned_loss=0.07866, over 20178.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3021, pruned_loss=0.08179, over 4249800.04 frames. ], batch size: 703, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:10:15,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=843834.0, ans=0.125 2023-06-21 01:10:17,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-21 01:10:21,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=843894.0, ans=0.0 2023-06-21 01:11:00,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.51 vs. limit=5.0 2023-06-21 01:11:10,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-21 01:11:13,511 INFO [train.py:996] (1/4) Epoch 5, batch 18700, loss[loss=0.228, simple_loss=0.2888, pruned_loss=0.08356, over 21233.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3007, pruned_loss=0.08325, over 4249376.27 frames. ], batch size: 143, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:11:26,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=844074.0, ans=0.125 2023-06-21 01:11:56,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.719e+02 3.161e+02 4.088e+02 6.146e+02, threshold=6.321e+02, percent-clipped=0.0 2023-06-21 01:12:08,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=844194.0, ans=0.0 2023-06-21 01:12:12,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-21 01:12:52,642 INFO [train.py:996] (1/4) Epoch 5, batch 18750, loss[loss=0.2482, simple_loss=0.3047, pruned_loss=0.09581, over 21577.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3036, pruned_loss=0.08623, over 4246281.79 frames. ], batch size: 195, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:13:28,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=844434.0, ans=0.0 2023-06-21 01:13:49,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=844554.0, ans=0.09899494936611666 2023-06-21 01:14:18,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=844614.0, ans=0.125 2023-06-21 01:14:32,806 INFO [train.py:996] (1/4) Epoch 5, batch 18800, loss[loss=0.1743, simple_loss=0.2576, pruned_loss=0.04549, over 21452.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3096, pruned_loss=0.08684, over 4244280.73 frames. ], batch size: 131, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:14:39,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=844674.0, ans=0.125 2023-06-21 01:14:44,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=844674.0, ans=0.0 2023-06-21 01:15:10,998 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.115e+02 3.803e+02 4.953e+02 7.292e+02, threshold=7.607e+02, percent-clipped=7.0 2023-06-21 01:15:22,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=844794.0, ans=0.1 2023-06-21 01:15:32,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=844854.0, ans=0.2 2023-06-21 01:15:33,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=844854.0, ans=0.125 2023-06-21 01:15:46,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=844854.0, ans=0.0 2023-06-21 01:15:50,428 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:15:54,212 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-21 01:16:07,893 INFO [train.py:996] (1/4) Epoch 5, batch 18850, loss[loss=0.2333, simple_loss=0.2958, pruned_loss=0.08538, over 21728.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3081, pruned_loss=0.08334, over 4252722.90 frames. ], batch size: 112, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:16:23,500 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.66 vs. limit=10.0 2023-06-21 01:16:45,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=845034.0, ans=0.0 2023-06-21 01:17:32,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=845214.0, ans=0.1 2023-06-21 01:17:46,565 INFO [train.py:996] (1/4) Epoch 5, batch 18900, loss[loss=0.2434, simple_loss=0.2992, pruned_loss=0.09376, over 21627.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3038, pruned_loss=0.08216, over 4253662.84 frames. ], batch size: 230, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:18:20,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.16 vs. limit=10.0 2023-06-21 01:18:31,688 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.559e+02 2.920e+02 3.727e+02 6.054e+02, threshold=5.840e+02, percent-clipped=0.0 2023-06-21 01:19:27,654 INFO [train.py:996] (1/4) Epoch 5, batch 18950, loss[loss=0.2601, simple_loss=0.3504, pruned_loss=0.08493, over 21599.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3048, pruned_loss=0.08415, over 4254119.74 frames. ], batch size: 212, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:21:08,458 INFO [train.py:996] (1/4) Epoch 5, batch 19000, loss[loss=0.2393, simple_loss=0.2973, pruned_loss=0.09069, over 21572.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.313, pruned_loss=0.08635, over 4262689.19 frames. ], batch size: 194, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:21:41,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=845934.0, ans=0.95 2023-06-21 01:21:48,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=845994.0, ans=0.125 2023-06-21 01:21:53,547 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.876e+02 3.456e+02 4.188e+02 7.110e+02, threshold=6.912e+02, percent-clipped=2.0 2023-06-21 01:22:24,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.80 vs. limit=6.0 2023-06-21 01:22:28,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=846054.0, ans=0.125 2023-06-21 01:22:41,212 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:22:47,088 INFO [train.py:996] (1/4) Epoch 5, batch 19050, loss[loss=0.267, simple_loss=0.3233, pruned_loss=0.1054, over 21764.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3191, pruned_loss=0.09102, over 4267223.72 frames. ], batch size: 112, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:22:53,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=846174.0, ans=0.125 2023-06-21 01:23:21,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=846234.0, ans=0.035 2023-06-21 01:23:23,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.31 vs. limit=10.0 2023-06-21 01:23:30,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=846294.0, ans=0.2 2023-06-21 01:24:13,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=846414.0, ans=0.05 2023-06-21 01:24:14,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=846414.0, ans=0.0 2023-06-21 01:24:31,386 INFO [train.py:996] (1/4) Epoch 5, batch 19100, loss[loss=0.2228, simple_loss=0.2798, pruned_loss=0.08288, over 21840.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3189, pruned_loss=0.09309, over 4272899.36 frames. ], batch size: 118, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:24:43,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=846474.0, ans=22.5 2023-06-21 01:25:16,901 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-21 01:25:22,018 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 2.859e+02 3.382e+02 4.111e+02 6.618e+02, threshold=6.763e+02, percent-clipped=0.0 2023-06-21 01:25:53,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=846714.0, ans=0.125 2023-06-21 01:26:17,774 INFO [train.py:996] (1/4) Epoch 5, batch 19150, loss[loss=0.249, simple_loss=0.3434, pruned_loss=0.07735, over 21635.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3221, pruned_loss=0.09395, over 4269289.20 frames. ], batch size: 263, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:26:47,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-06-21 01:27:41,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=846954.0, ans=0.0 2023-06-21 01:28:00,623 INFO [train.py:996] (1/4) Epoch 5, batch 19200, loss[loss=0.2485, simple_loss=0.3465, pruned_loss=0.07524, over 21576.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3314, pruned_loss=0.09384, over 4274147.01 frames. ], batch size: 230, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 01:28:31,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=847134.0, ans=0.125 2023-06-21 01:28:35,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=847134.0, ans=0.1 2023-06-21 01:28:44,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=847194.0, ans=0.2 2023-06-21 01:28:47,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.827e+02 3.204e+02 4.140e+02 7.071e+02, threshold=6.408e+02, percent-clipped=1.0 2023-06-21 01:29:01,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=847254.0, ans=0.125 2023-06-21 01:29:10,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=847254.0, ans=0.125 2023-06-21 01:29:25,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=847314.0, ans=0.125 2023-06-21 01:29:35,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=847314.0, ans=0.125 2023-06-21 01:29:37,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=847314.0, ans=0.1 2023-06-21 01:29:41,523 INFO [train.py:996] (1/4) Epoch 5, batch 19250, loss[loss=0.1527, simple_loss=0.245, pruned_loss=0.03021, over 21450.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3335, pruned_loss=0.08962, over 4267764.75 frames. ], batch size: 211, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:29:47,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=15.0 2023-06-21 01:29:59,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=847374.0, ans=0.125 2023-06-21 01:30:17,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=847434.0, ans=0.125 2023-06-21 01:30:37,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=847554.0, ans=0.125 2023-06-21 01:31:14,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-21 01:31:20,349 INFO [train.py:996] (1/4) Epoch 5, batch 19300, loss[loss=0.2263, simple_loss=0.3185, pruned_loss=0.06706, over 21682.00 frames. ], tot_loss[loss=0.253, simple_loss=0.329, pruned_loss=0.08846, over 4273556.29 frames. ], batch size: 414, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:31:24,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=847674.0, ans=0.125 2023-06-21 01:32:07,943 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.704e+02 3.211e+02 3.924e+02 6.818e+02, threshold=6.422e+02, percent-clipped=2.0 2023-06-21 01:32:23,378 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=15.0 2023-06-21 01:32:36,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=847854.0, ans=0.1 2023-06-21 01:32:37,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=847854.0, ans=0.125 2023-06-21 01:33:01,212 INFO [train.py:996] (1/4) Epoch 5, batch 19350, loss[loss=0.2635, simple_loss=0.3457, pruned_loss=0.09062, over 21568.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3222, pruned_loss=0.08451, over 4270545.22 frames. ], batch size: 441, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:33:25,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=848034.0, ans=0.125 2023-06-21 01:33:44,943 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-21 01:34:11,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=848154.0, ans=0.125 2023-06-21 01:34:38,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=848274.0, ans=0.125 2023-06-21 01:34:39,455 INFO [train.py:996] (1/4) Epoch 5, batch 19400, loss[loss=0.2507, simple_loss=0.3277, pruned_loss=0.0869, over 21856.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3186, pruned_loss=0.08338, over 4276198.58 frames. ], batch size: 415, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:34:51,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=848274.0, ans=0.125 2023-06-21 01:35:25,892 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.765e+02 3.064e+02 3.598e+02 5.687e+02, threshold=6.129e+02, percent-clipped=0.0 2023-06-21 01:36:22,714 INFO [train.py:996] (1/4) Epoch 5, batch 19450, loss[loss=0.2247, simple_loss=0.2786, pruned_loss=0.08544, over 21602.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3143, pruned_loss=0.08456, over 4270526.88 frames. ], batch size: 231, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:37:45,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=848814.0, ans=0.0 2023-06-21 01:37:58,157 INFO [train.py:996] (1/4) Epoch 5, batch 19500, loss[loss=0.2332, simple_loss=0.2847, pruned_loss=0.09087, over 21507.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3109, pruned_loss=0.08597, over 4271521.47 frames. ], batch size: 441, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:38:08,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-21 01:38:26,520 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-21 01:38:30,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=848934.0, ans=0.1 2023-06-21 01:38:44,700 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 2.810e+02 3.330e+02 3.940e+02 7.380e+02, threshold=6.661e+02, percent-clipped=6.0 2023-06-21 01:39:16,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=849054.0, ans=0.125 2023-06-21 01:39:24,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-21 01:39:36,259 INFO [train.py:996] (1/4) Epoch 5, batch 19550, loss[loss=0.2328, simple_loss=0.2989, pruned_loss=0.0833, over 21772.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3092, pruned_loss=0.0852, over 4265787.00 frames. ], batch size: 124, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 01:39:38,726 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-21 01:39:47,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=849174.0, ans=0.2 2023-06-21 01:40:45,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-21 01:41:07,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=849414.0, ans=0.125 2023-06-21 01:41:10,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-21 01:41:19,064 INFO [train.py:996] (1/4) Epoch 5, batch 19600, loss[loss=0.267, simple_loss=0.3205, pruned_loss=0.1067, over 21315.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3115, pruned_loss=0.08609, over 4273256.53 frames. ], batch size: 159, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:41:52,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=849534.0, ans=0.125 2023-06-21 01:42:00,753 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.061e+02 3.495e+02 4.046e+02 6.477e+02, threshold=6.990e+02, percent-clipped=0.0 2023-06-21 01:42:02,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=849594.0, ans=0.125 2023-06-21 01:42:03,369 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-21 01:42:10,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=849594.0, ans=0.0 2023-06-21 01:42:11,333 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.86 vs. limit=10.0 2023-06-21 01:42:41,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.84 vs. limit=10.0 2023-06-21 01:42:57,938 INFO [train.py:996] (1/4) Epoch 5, batch 19650, loss[loss=0.266, simple_loss=0.323, pruned_loss=0.1045, over 21372.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3176, pruned_loss=0.09066, over 4276533.47 frames. ], batch size: 159, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:43:23,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=849834.0, ans=0.125 2023-06-21 01:43:53,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=849894.0, ans=0.2 2023-06-21 01:44:05,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=849954.0, ans=0.125 2023-06-21 01:44:16,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=849954.0, ans=0.95 2023-06-21 01:44:35,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=850014.0, ans=0.0 2023-06-21 01:44:46,866 INFO [train.py:996] (1/4) Epoch 5, batch 19700, loss[loss=0.2529, simple_loss=0.3152, pruned_loss=0.09529, over 19989.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3215, pruned_loss=0.09261, over 4269030.70 frames. ], batch size: 702, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:45:00,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=850074.0, ans=0.125 2023-06-21 01:45:34,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 2.950e+02 3.404e+02 4.157e+02 1.102e+03, threshold=6.808e+02, percent-clipped=4.0 2023-06-21 01:46:04,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=850254.0, ans=0.0 2023-06-21 01:46:12,667 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=15.0 2023-06-21 01:46:27,914 INFO [train.py:996] (1/4) Epoch 5, batch 19750, loss[loss=0.2703, simple_loss=0.3529, pruned_loss=0.09385, over 21656.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3327, pruned_loss=0.09382, over 4272452.72 frames. ], batch size: 263, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:46:42,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=850374.0, ans=0.125 2023-06-21 01:47:13,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=850494.0, ans=0.125 2023-06-21 01:47:59,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850614.0, ans=0.1 2023-06-21 01:48:06,768 INFO [train.py:996] (1/4) Epoch 5, batch 19800, loss[loss=0.3236, simple_loss=0.4145, pruned_loss=0.1164, over 19849.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3331, pruned_loss=0.09524, over 4279241.81 frames. ], batch size: 702, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:48:54,003 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.157e+02 4.058e+02 5.975e+02 1.111e+03, threshold=8.116e+02, percent-clipped=16.0 2023-06-21 01:49:12,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=850794.0, ans=0.2 2023-06-21 01:49:52,582 INFO [train.py:996] (1/4) Epoch 5, batch 19850, loss[loss=0.1967, simple_loss=0.2599, pruned_loss=0.06673, over 21738.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3234, pruned_loss=0.08956, over 4280278.74 frames. ], batch size: 112, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:50:17,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=851034.0, ans=0.1 2023-06-21 01:50:29,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=851094.0, ans=0.125 2023-06-21 01:51:32,064 INFO [train.py:996] (1/4) Epoch 5, batch 19900, loss[loss=0.2185, simple_loss=0.2818, pruned_loss=0.07753, over 21270.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3223, pruned_loss=0.08619, over 4269595.90 frames. ], batch size: 131, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:51:35,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=851274.0, ans=0.125 2023-06-21 01:51:50,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=851274.0, ans=0.125 2023-06-21 01:52:19,472 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.633e+02 2.856e+02 3.289e+02 5.435e+02, threshold=5.712e+02, percent-clipped=0.0 2023-06-21 01:52:19,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=851394.0, ans=0.1 2023-06-21 01:52:27,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=851394.0, ans=0.0 2023-06-21 01:52:28,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-21 01:52:30,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=851454.0, ans=0.125 2023-06-21 01:52:41,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-21 01:52:46,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=851514.0, ans=0.125 2023-06-21 01:53:08,174 INFO [train.py:996] (1/4) Epoch 5, batch 19950, loss[loss=0.2162, simple_loss=0.29, pruned_loss=0.07114, over 21631.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3157, pruned_loss=0.08577, over 4261636.33 frames. ], batch size: 263, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:54:20,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=851754.0, ans=0.0 2023-06-21 01:54:23,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=851754.0, ans=0.125 2023-06-21 01:54:46,770 INFO [train.py:996] (1/4) Epoch 5, batch 20000, loss[loss=0.2535, simple_loss=0.3287, pruned_loss=0.08915, over 21807.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3149, pruned_loss=0.08553, over 4264665.13 frames. ], batch size: 351, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:54:58,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=851874.0, ans=0.0 2023-06-21 01:55:07,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=851934.0, ans=0.1 2023-06-21 01:55:16,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-21 01:55:17,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=851934.0, ans=0.1 2023-06-21 01:55:38,158 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.174e+02 3.672e+02 4.869e+02 7.405e+02, threshold=7.343e+02, percent-clipped=12.0 2023-06-21 01:55:51,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=852054.0, ans=0.1 2023-06-21 01:55:55,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=852054.0, ans=0.0 2023-06-21 01:56:04,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=852114.0, ans=0.125 2023-06-21 01:56:25,457 INFO [train.py:996] (1/4) Epoch 5, batch 20050, loss[loss=0.2511, simple_loss=0.3155, pruned_loss=0.09332, over 21858.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3188, pruned_loss=0.08933, over 4274840.68 frames. ], batch size: 371, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:56:45,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=852234.0, ans=0.0 2023-06-21 01:57:30,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=852354.0, ans=0.0 2023-06-21 01:58:11,251 INFO [train.py:996] (1/4) Epoch 5, batch 20100, loss[loss=0.2426, simple_loss=0.3165, pruned_loss=0.08436, over 21366.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3213, pruned_loss=0.09149, over 4281805.80 frames. ], batch size: 131, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:58:58,321 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 2.825e+02 3.167e+02 3.914e+02 6.858e+02, threshold=6.334e+02, percent-clipped=0.0 2023-06-21 01:59:17,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=852654.0, ans=0.2 2023-06-21 01:59:19,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=852654.0, ans=0.0 2023-06-21 01:59:30,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-21 01:59:51,886 INFO [train.py:996] (1/4) Epoch 5, batch 20150, loss[loss=0.3259, simple_loss=0.3896, pruned_loss=0.1311, over 21549.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3327, pruned_loss=0.09568, over 4283390.84 frames. ], batch size: 414, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 02:00:22,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=852834.0, ans=0.125 2023-06-21 02:00:24,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=852834.0, ans=0.125 2023-06-21 02:00:26,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=852834.0, ans=10.0 2023-06-21 02:00:27,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=852834.0, ans=0.0 2023-06-21 02:01:17,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=853014.0, ans=0.125 2023-06-21 02:01:20,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=853014.0, ans=0.5 2023-06-21 02:01:42,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=853074.0, ans=0.125 2023-06-21 02:01:44,378 INFO [train.py:996] (1/4) Epoch 5, batch 20200, loss[loss=0.3498, simple_loss=0.4289, pruned_loss=0.1354, over 21553.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3387, pruned_loss=0.09924, over 4275637.85 frames. ], batch size: 471, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:01:45,400 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-21 02:02:04,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=853134.0, ans=0.0 2023-06-21 02:02:05,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-21 02:02:24,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=853194.0, ans=0.04949747468305833 2023-06-21 02:02:28,615 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.103e+02 3.828e+02 4.837e+02 8.948e+02, threshold=7.656e+02, percent-clipped=6.0 2023-06-21 02:02:49,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=853254.0, ans=0.125 2023-06-21 02:03:24,892 INFO [train.py:996] (1/4) Epoch 5, batch 20250, loss[loss=0.2515, simple_loss=0.338, pruned_loss=0.08251, over 20924.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3387, pruned_loss=0.09744, over 4263256.49 frames. ], batch size: 607, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:03:50,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=22.5 2023-06-21 02:05:04,406 INFO [train.py:996] (1/4) Epoch 5, batch 20300, loss[loss=0.2212, simple_loss=0.3084, pruned_loss=0.06699, over 21765.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3352, pruned_loss=0.0935, over 4264716.43 frames. ], batch size: 282, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:05:21,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=853734.0, ans=0.2 2023-06-21 02:05:51,757 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.734e+02 3.068e+02 3.713e+02 6.256e+02, threshold=6.135e+02, percent-clipped=0.0 2023-06-21 02:05:55,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=853794.0, ans=0.2 2023-06-21 02:06:32,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=853914.0, ans=0.125 2023-06-21 02:06:35,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=853914.0, ans=0.0 2023-06-21 02:06:41,900 INFO [train.py:996] (1/4) Epoch 5, batch 20350, loss[loss=0.2445, simple_loss=0.317, pruned_loss=0.086, over 21822.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3347, pruned_loss=0.09364, over 4267510.17 frames. ], batch size: 102, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:07:36,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=854094.0, ans=0.1 2023-06-21 02:08:12,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=854214.0, ans=0.0 2023-06-21 02:08:17,277 INFO [train.py:996] (1/4) Epoch 5, batch 20400, loss[loss=0.2579, simple_loss=0.3334, pruned_loss=0.09117, over 21765.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.338, pruned_loss=0.09632, over 4255365.23 frames. ], batch size: 298, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 02:09:05,779 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.145e+02 3.695e+02 4.616e+02 6.973e+02, threshold=7.390e+02, percent-clipped=6.0 2023-06-21 02:09:56,705 INFO [train.py:996] (1/4) Epoch 5, batch 20450, loss[loss=0.2446, simple_loss=0.3174, pruned_loss=0.08594, over 21946.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3391, pruned_loss=0.09929, over 4261325.24 frames. ], batch size: 316, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:10:21,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-21 02:10:34,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=854694.0, ans=0.0 2023-06-21 02:10:36,094 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:11:34,418 INFO [train.py:996] (1/4) Epoch 5, batch 20500, loss[loss=0.2628, simple_loss=0.3196, pruned_loss=0.103, over 21720.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3347, pruned_loss=0.09927, over 4253389.95 frames. ], batch size: 298, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:12:17,446 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.857e+02 3.263e+02 3.907e+02 6.416e+02, threshold=6.525e+02, percent-clipped=0.0 2023-06-21 02:12:35,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=855054.0, ans=0.125 2023-06-21 02:12:43,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=855054.0, ans=0.0 2023-06-21 02:12:57,629 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-21 02:13:07,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=855114.0, ans=0.2 2023-06-21 02:13:09,819 INFO [train.py:996] (1/4) Epoch 5, batch 20550, loss[loss=0.232, simple_loss=0.2995, pruned_loss=0.08226, over 21505.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3259, pruned_loss=0.0967, over 4237335.76 frames. ], batch size: 230, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:14:27,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=855354.0, ans=0.125 2023-06-21 02:14:36,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=855414.0, ans=0.125 2023-06-21 02:14:46,284 INFO [train.py:996] (1/4) Epoch 5, batch 20600, loss[loss=0.2761, simple_loss=0.3426, pruned_loss=0.1048, over 21822.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.329, pruned_loss=0.09474, over 4224016.43 frames. ], batch size: 107, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:15:35,928 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 2.804e+02 3.280e+02 3.757e+02 7.089e+02, threshold=6.559e+02, percent-clipped=1.0 2023-06-21 02:15:54,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=855654.0, ans=0.2 2023-06-21 02:16:12,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=855714.0, ans=0.0 2023-06-21 02:16:12,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=855714.0, ans=0.125 2023-06-21 02:16:26,169 INFO [train.py:996] (1/4) Epoch 5, batch 20650, loss[loss=0.2174, simple_loss=0.2743, pruned_loss=0.08021, over 21422.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3242, pruned_loss=0.09467, over 4228682.12 frames. ], batch size: 195, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:16:32,920 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:16:45,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=855834.0, ans=0.0 2023-06-21 02:16:46,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-21 02:17:44,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=855954.0, ans=0.125 2023-06-21 02:18:06,935 INFO [train.py:996] (1/4) Epoch 5, batch 20700, loss[loss=0.1761, simple_loss=0.2472, pruned_loss=0.05246, over 21506.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3153, pruned_loss=0.09, over 4225507.80 frames. ], batch size: 195, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:18:09,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-21 02:18:19,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=856074.0, ans=0.95 2023-06-21 02:18:57,421 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.666e+02 3.123e+02 3.714e+02 6.425e+02, threshold=6.247e+02, percent-clipped=0.0 2023-06-21 02:19:02,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=856194.0, ans=0.2 2023-06-21 02:19:32,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=856314.0, ans=0.0 2023-06-21 02:19:53,113 INFO [train.py:996] (1/4) Epoch 5, batch 20750, loss[loss=0.3234, simple_loss=0.403, pruned_loss=0.1219, over 21690.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3167, pruned_loss=0.08926, over 4236074.26 frames. ], batch size: 389, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:19:55,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=856374.0, ans=0.0 2023-06-21 02:20:23,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=856434.0, ans=10.0 2023-06-21 02:20:25,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.45 vs. limit=15.0 2023-06-21 02:20:26,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=856434.0, ans=0.0 2023-06-21 02:20:27,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=856434.0, ans=0.125 2023-06-21 02:21:01,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=856554.0, ans=0.0 2023-06-21 02:21:21,676 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:21:29,560 INFO [train.py:996] (1/4) Epoch 5, batch 20800, loss[loss=0.2475, simple_loss=0.3042, pruned_loss=0.09536, over 21730.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3212, pruned_loss=0.09006, over 4238779.24 frames. ], batch size: 351, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:21:50,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=856734.0, ans=0.1 2023-06-21 02:22:12,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=856794.0, ans=0.0 2023-06-21 02:22:15,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-21 02:22:25,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.108e+02 3.901e+02 5.599e+02 9.709e+02, threshold=7.803e+02, percent-clipped=19.0 2023-06-21 02:23:14,829 INFO [train.py:996] (1/4) Epoch 5, batch 20850, loss[loss=0.2644, simple_loss=0.3233, pruned_loss=0.1028, over 21787.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3132, pruned_loss=0.08801, over 4248795.78 frames. ], batch size: 391, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:24:55,581 INFO [train.py:996] (1/4) Epoch 5, batch 20900, loss[loss=0.2615, simple_loss=0.3297, pruned_loss=0.09666, over 21793.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3156, pruned_loss=0.0904, over 4261865.49 frames. ], batch size: 124, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:24:55,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=857274.0, ans=0.125 2023-06-21 02:25:10,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=857334.0, ans=0.0 2023-06-21 02:25:16,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=857334.0, ans=0.125 2023-06-21 02:25:16,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-21 02:25:21,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=857334.0, ans=0.125 2023-06-21 02:25:40,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=857394.0, ans=0.125 2023-06-21 02:25:44,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.849e+02 3.550e+02 4.829e+02 8.716e+02, threshold=7.101e+02, percent-clipped=1.0 2023-06-21 02:25:48,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-21 02:26:13,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=857514.0, ans=0.125 2023-06-21 02:26:24,221 INFO [train.py:996] (1/4) Epoch 5, batch 20950, loss[loss=0.2135, simple_loss=0.2814, pruned_loss=0.07282, over 21859.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3105, pruned_loss=0.08561, over 4245361.03 frames. ], batch size: 98, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:26:48,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=857634.0, ans=0.015 2023-06-21 02:27:18,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=857694.0, ans=0.125 2023-06-21 02:27:19,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.06 vs. limit=15.0 2023-06-21 02:27:50,823 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-21 02:28:02,747 INFO [train.py:996] (1/4) Epoch 5, batch 21000, loss[loss=0.2315, simple_loss=0.2986, pruned_loss=0.08221, over 21202.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3102, pruned_loss=0.08635, over 4252831.27 frames. ], batch size: 608, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:28:02,747 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 02:28:23,287 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2707, simple_loss=0.3706, pruned_loss=0.0854, over 1796401.00 frames. 2023-06-21 02:28:23,288 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-21 02:28:33,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=857874.0, ans=0.125 2023-06-21 02:29:08,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.446e+02 2.941e+02 3.372e+02 5.847e+02, threshold=5.881e+02, percent-clipped=0.0 2023-06-21 02:29:39,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=858114.0, ans=0.2 2023-06-21 02:29:52,530 INFO [train.py:996] (1/4) Epoch 5, batch 21050, loss[loss=0.1729, simple_loss=0.2412, pruned_loss=0.05233, over 16810.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3088, pruned_loss=0.08742, over 4254085.52 frames. ], batch size: 65, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:30:08,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=858234.0, ans=0.2 2023-06-21 02:30:35,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=858294.0, ans=0.125 2023-06-21 02:30:44,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=858294.0, ans=0.05 2023-06-21 02:30:57,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=858354.0, ans=0.0 2023-06-21 02:31:31,896 INFO [train.py:996] (1/4) Epoch 5, batch 21100, loss[loss=0.2331, simple_loss=0.2988, pruned_loss=0.08366, over 21457.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3053, pruned_loss=0.08651, over 4251457.51 frames. ], batch size: 132, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:31:50,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-21 02:31:52,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=858534.0, ans=0.125 2023-06-21 02:32:23,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.646e+02 3.104e+02 3.741e+02 7.727e+02, threshold=6.208e+02, percent-clipped=4.0 2023-06-21 02:32:36,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=858654.0, ans=0.0 2023-06-21 02:32:50,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=858714.0, ans=0.125 2023-06-21 02:32:51,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=858714.0, ans=0.1 2023-06-21 02:33:10,713 INFO [train.py:996] (1/4) Epoch 5, batch 21150, loss[loss=0.2436, simple_loss=0.298, pruned_loss=0.09457, over 21591.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3021, pruned_loss=0.08703, over 4252695.65 frames. ], batch size: 415, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:33:14,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=858774.0, ans=0.0 2023-06-21 02:34:13,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=12.0 2023-06-21 02:34:49,178 INFO [train.py:996] (1/4) Epoch 5, batch 21200, loss[loss=0.2775, simple_loss=0.3206, pruned_loss=0.1172, over 21281.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.2986, pruned_loss=0.08615, over 4253699.39 frames. ], batch size: 471, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:35:12,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=859134.0, ans=0.2 2023-06-21 02:35:13,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=859134.0, ans=0.015 2023-06-21 02:35:23,949 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-21 02:35:42,258 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.556e+02 2.983e+02 3.477e+02 7.677e+02, threshold=5.965e+02, percent-clipped=1.0 2023-06-21 02:36:03,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=859314.0, ans=0.0 2023-06-21 02:36:30,236 INFO [train.py:996] (1/4) Epoch 5, batch 21250, loss[loss=0.2202, simple_loss=0.2827, pruned_loss=0.07889, over 21367.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.2985, pruned_loss=0.08653, over 4264820.61 frames. ], batch size: 194, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:36:32,847 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-21 02:36:46,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=859434.0, ans=0.0 2023-06-21 02:37:21,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=859494.0, ans=0.1 2023-06-21 02:38:09,621 INFO [train.py:996] (1/4) Epoch 5, batch 21300, loss[loss=0.3037, simple_loss=0.3563, pruned_loss=0.1256, over 21695.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.307, pruned_loss=0.08941, over 4254401.22 frames. ], batch size: 473, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:38:28,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=859734.0, ans=0.0 2023-06-21 02:38:28,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-21 02:38:40,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=859794.0, ans=0.2 2023-06-21 02:39:01,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=859794.0, ans=0.125 2023-06-21 02:39:02,881 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 2.969e+02 3.329e+02 4.486e+02 8.975e+02, threshold=6.657e+02, percent-clipped=6.0 2023-06-21 02:39:44,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.65 vs. limit=10.0 2023-06-21 02:39:50,015 INFO [train.py:996] (1/4) Epoch 5, batch 21350, loss[loss=0.2215, simple_loss=0.3221, pruned_loss=0.06043, over 20870.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3104, pruned_loss=0.08942, over 4254762.32 frames. ], batch size: 607, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:39:51,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=859974.0, ans=0.2 2023-06-21 02:40:07,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=860034.0, ans=0.125 2023-06-21 02:40:30,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=860094.0, ans=0.125 2023-06-21 02:41:26,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.21 vs. limit=10.0 2023-06-21 02:41:29,860 INFO [train.py:996] (1/4) Epoch 5, batch 21400, loss[loss=0.2637, simple_loss=0.3202, pruned_loss=0.1036, over 20103.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3155, pruned_loss=0.09015, over 4259600.46 frames. ], batch size: 703, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:41:45,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=860334.0, ans=0.0 2023-06-21 02:41:56,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-21 02:42:10,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-21 02:42:16,749 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:42:22,703 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.768e+02 3.163e+02 3.686e+02 6.049e+02, threshold=6.326e+02, percent-clipped=0.0 2023-06-21 02:42:26,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=860454.0, ans=0.1 2023-06-21 02:43:01,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=860514.0, ans=0.125 2023-06-21 02:43:01,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=860514.0, ans=0.1 2023-06-21 02:43:09,464 INFO [train.py:996] (1/4) Epoch 5, batch 21450, loss[loss=0.2492, simple_loss=0.3063, pruned_loss=0.09608, over 20125.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3189, pruned_loss=0.09206, over 4267810.03 frames. ], batch size: 703, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:43:25,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=860634.0, ans=0.125 2023-06-21 02:44:05,607 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-21 02:44:48,126 INFO [train.py:996] (1/4) Epoch 5, batch 21500, loss[loss=0.235, simple_loss=0.2999, pruned_loss=0.08508, over 15316.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3173, pruned_loss=0.09351, over 4266860.95 frames. ], batch size: 60, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:45:08,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=860934.0, ans=0.1 2023-06-21 02:45:12,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=860934.0, ans=0.07 2023-06-21 02:45:40,389 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 3.006e+02 3.483e+02 4.242e+02 6.315e+02, threshold=6.966e+02, percent-clipped=0.0 2023-06-21 02:45:40,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=860994.0, ans=0.125 2023-06-21 02:46:12,263 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:46:26,520 INFO [train.py:996] (1/4) Epoch 5, batch 21550, loss[loss=0.2038, simple_loss=0.2739, pruned_loss=0.06689, over 21631.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3096, pruned_loss=0.08936, over 4252491.39 frames. ], batch size: 391, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:46:42,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=861174.0, ans=0.0 2023-06-21 02:47:07,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861294.0, ans=0.1 2023-06-21 02:47:43,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=861354.0, ans=0.125 2023-06-21 02:47:59,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=861414.0, ans=0.0 2023-06-21 02:48:03,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=861414.0, ans=0.125 2023-06-21 02:48:06,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=861474.0, ans=0.125 2023-06-21 02:48:07,445 INFO [train.py:996] (1/4) Epoch 5, batch 21600, loss[loss=0.2067, simple_loss=0.2949, pruned_loss=0.0592, over 21676.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3082, pruned_loss=0.0891, over 4253950.63 frames. ], batch size: 247, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:48:24,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=861474.0, ans=0.2 2023-06-21 02:48:47,212 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-21 02:48:51,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=861594.0, ans=0.2 2023-06-21 02:48:51,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=861594.0, ans=0.125 2023-06-21 02:49:01,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=861594.0, ans=0.125 2023-06-21 02:49:05,660 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.906e+02 3.367e+02 4.103e+02 7.141e+02, threshold=6.734e+02, percent-clipped=1.0 2023-06-21 02:49:08,241 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-21 02:49:16,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.25 vs. limit=10.0 2023-06-21 02:49:46,656 INFO [train.py:996] (1/4) Epoch 5, batch 21650, loss[loss=0.2918, simple_loss=0.3774, pruned_loss=0.1031, over 21849.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3126, pruned_loss=0.08656, over 4256285.09 frames. ], batch size: 371, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:49:58,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=861774.0, ans=0.125 2023-06-21 02:50:21,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-21 02:50:38,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=861894.0, ans=0.0 2023-06-21 02:50:45,866 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:50:57,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=861954.0, ans=0.2 2023-06-21 02:51:03,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=861954.0, ans=0.1 2023-06-21 02:51:04,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-21 02:51:26,175 INFO [train.py:996] (1/4) Epoch 5, batch 21700, loss[loss=0.2222, simple_loss=0.2817, pruned_loss=0.0814, over 21461.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3103, pruned_loss=0.08448, over 4255028.67 frames. ], batch size: 195, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:51:51,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=862134.0, ans=0.125 2023-06-21 02:51:56,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=862134.0, ans=0.125 2023-06-21 02:51:56,917 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-21 02:51:59,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=862134.0, ans=0.125 2023-06-21 02:52:09,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=862194.0, ans=0.125 2023-06-21 02:52:20,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-21 02:52:22,494 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.607e+02 2.964e+02 3.424e+02 5.516e+02, threshold=5.928e+02, percent-clipped=0.0 2023-06-21 02:52:36,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=862254.0, ans=0.1 2023-06-21 02:53:00,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-21 02:53:10,464 INFO [train.py:996] (1/4) Epoch 5, batch 21750, loss[loss=0.2121, simple_loss=0.2591, pruned_loss=0.08255, over 20712.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3061, pruned_loss=0.08449, over 4252019.56 frames. ], batch size: 608, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:53:37,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=862434.0, ans=0.0 2023-06-21 02:53:48,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=862494.0, ans=0.0 2023-06-21 02:53:56,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862494.0, ans=0.1 2023-06-21 02:54:23,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-21 02:54:43,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=862674.0, ans=0.02 2023-06-21 02:54:49,588 INFO [train.py:996] (1/4) Epoch 5, batch 21800, loss[loss=0.268, simple_loss=0.3214, pruned_loss=0.1073, over 21357.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3067, pruned_loss=0.08567, over 4248988.09 frames. ], batch size: 473, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:55:27,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=862734.0, ans=0.0 2023-06-21 02:55:43,904 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 2.757e+02 3.112e+02 3.604e+02 5.308e+02, threshold=6.224e+02, percent-clipped=0.0 2023-06-21 02:55:45,710 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:56:10,666 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:56:16,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=862914.0, ans=0.5 2023-06-21 02:56:29,357 INFO [train.py:996] (1/4) Epoch 5, batch 21850, loss[loss=0.2149, simple_loss=0.2777, pruned_loss=0.07603, over 21429.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.311, pruned_loss=0.08543, over 4251234.66 frames. ], batch size: 212, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:56:30,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-21 02:56:32,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862974.0, ans=0.1 2023-06-21 02:57:05,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=863034.0, ans=0.0 2023-06-21 02:57:16,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=863094.0, ans=0.125 2023-06-21 02:57:53,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=863214.0, ans=0.2 2023-06-21 02:58:08,164 INFO [train.py:996] (1/4) Epoch 5, batch 21900, loss[loss=0.2561, simple_loss=0.3107, pruned_loss=0.1007, over 21671.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3115, pruned_loss=0.08743, over 4252740.24 frames. ], batch size: 414, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:58:13,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=863274.0, ans=0.125 2023-06-21 02:58:56,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-21 02:59:06,538 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 2.823e+02 3.229e+02 3.710e+02 5.018e+02, threshold=6.457e+02, percent-clipped=0.0 2023-06-21 02:59:15,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=863454.0, ans=0.1 2023-06-21 02:59:22,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=863454.0, ans=0.0 2023-06-21 02:59:27,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=863514.0, ans=0.125 2023-06-21 02:59:46,630 INFO [train.py:996] (1/4) Epoch 5, batch 21950, loss[loss=0.2268, simple_loss=0.2923, pruned_loss=0.08066, over 21768.00 frames. ], tot_loss[loss=0.238, simple_loss=0.305, pruned_loss=0.08547, over 4256084.67 frames. ], batch size: 351, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:00:33,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.50 vs. limit=6.0 2023-06-21 03:00:36,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=863694.0, ans=0.125 2023-06-21 03:01:22,001 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-21 03:01:27,626 INFO [train.py:996] (1/4) Epoch 5, batch 22000, loss[loss=0.1876, simple_loss=0.2686, pruned_loss=0.05327, over 21710.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3002, pruned_loss=0.08329, over 4256626.90 frames. ], batch size: 333, lr: 6.08e-03, grad_scale: 32.0 2023-06-21 03:01:28,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=863874.0, ans=0.0 2023-06-21 03:01:31,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=863874.0, ans=0.125 2023-06-21 03:02:24,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=863994.0, ans=0.1 2023-06-21 03:02:29,957 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.512e+02 2.928e+02 3.420e+02 5.826e+02, threshold=5.856e+02, percent-clipped=0.0 2023-06-21 03:02:46,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.24 vs. limit=10.0 2023-06-21 03:02:50,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-21 03:03:08,611 INFO [train.py:996] (1/4) Epoch 5, batch 22050, loss[loss=0.1604, simple_loss=0.2393, pruned_loss=0.04075, over 21494.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3057, pruned_loss=0.08511, over 4259706.20 frames. ], batch size: 212, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:04:05,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-21 03:04:09,807 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:04:37,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-21 03:04:52,756 INFO [train.py:996] (1/4) Epoch 5, batch 22100, loss[loss=0.2798, simple_loss=0.3481, pruned_loss=0.1057, over 21744.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3162, pruned_loss=0.09019, over 4259688.54 frames. ], batch size: 414, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:05:21,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=864534.0, ans=0.125 2023-06-21 03:05:27,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=864594.0, ans=0.0 2023-06-21 03:05:37,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=864594.0, ans=0.125 2023-06-21 03:05:42,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=864594.0, ans=0.2 2023-06-21 03:05:48,621 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 3.305e+02 3.693e+02 4.258e+02 6.395e+02, threshold=7.386e+02, percent-clipped=3.0 2023-06-21 03:06:31,556 INFO [train.py:996] (1/4) Epoch 5, batch 22150, loss[loss=0.2778, simple_loss=0.3432, pruned_loss=0.1062, over 21384.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3201, pruned_loss=0.09212, over 4265800.43 frames. ], batch size: 548, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:06:46,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-21 03:06:58,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=864834.0, ans=0.125 2023-06-21 03:07:05,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=864834.0, ans=0.0 2023-06-21 03:07:07,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=864894.0, ans=0.0 2023-06-21 03:07:27,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-21 03:07:33,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=864954.0, ans=0.125 2023-06-21 03:08:10,572 INFO [train.py:996] (1/4) Epoch 5, batch 22200, loss[loss=0.2343, simple_loss=0.3118, pruned_loss=0.07843, over 21190.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3211, pruned_loss=0.09285, over 4274941.98 frames. ], batch size: 143, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:08:41,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=865134.0, ans=0.05 2023-06-21 03:09:09,738 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 3.019e+02 3.347e+02 3.956e+02 6.093e+02, threshold=6.693e+02, percent-clipped=0.0 2023-06-21 03:09:14,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865254.0, ans=0.1 2023-06-21 03:09:31,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=865314.0, ans=0.125 2023-06-21 03:09:42,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=865314.0, ans=0.125 2023-06-21 03:09:57,875 INFO [train.py:996] (1/4) Epoch 5, batch 22250, loss[loss=0.2574, simple_loss=0.3304, pruned_loss=0.09224, over 21783.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3279, pruned_loss=0.09558, over 4283682.56 frames. ], batch size: 247, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:10:15,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=865434.0, ans=0.2 2023-06-21 03:10:18,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=865434.0, ans=0.05 2023-06-21 03:10:40,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=865494.0, ans=0.0 2023-06-21 03:10:48,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=865494.0, ans=0.035 2023-06-21 03:11:19,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=865614.0, ans=0.0 2023-06-21 03:11:32,222 INFO [train.py:996] (1/4) Epoch 5, batch 22300, loss[loss=0.248, simple_loss=0.3422, pruned_loss=0.07692, over 21702.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3305, pruned_loss=0.09813, over 4286773.09 frames. ], batch size: 263, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:11:38,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=865674.0, ans=0.0 2023-06-21 03:11:39,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=865674.0, ans=0.95 2023-06-21 03:11:41,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=865674.0, ans=0.125 2023-06-21 03:11:52,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=865734.0, ans=0.125 2023-06-21 03:11:58,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=865734.0, ans=0.125 2023-06-21 03:12:01,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=865734.0, ans=0.0 2023-06-21 03:12:03,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865734.0, ans=0.1 2023-06-21 03:12:27,217 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.377e+02 3.190e+02 3.753e+02 5.122e+02 1.002e+03, threshold=7.506e+02, percent-clipped=11.0 2023-06-21 03:12:33,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=865854.0, ans=0.2 2023-06-21 03:12:38,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=865854.0, ans=0.0 2023-06-21 03:12:51,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=865914.0, ans=0.2 2023-06-21 03:13:05,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=865914.0, ans=0.5 2023-06-21 03:13:14,443 INFO [train.py:996] (1/4) Epoch 5, batch 22350, loss[loss=0.2671, simple_loss=0.3266, pruned_loss=0.1038, over 21863.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3283, pruned_loss=0.09853, over 4294327.97 frames. ], batch size: 298, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:13:21,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=865974.0, ans=0.0 2023-06-21 03:13:23,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=865974.0, ans=10.0 2023-06-21 03:13:30,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=866034.0, ans=0.0 2023-06-21 03:13:54,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=866094.0, ans=0.125 2023-06-21 03:14:18,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=866154.0, ans=0.125 2023-06-21 03:14:53,820 INFO [train.py:996] (1/4) Epoch 5, batch 22400, loss[loss=0.2347, simple_loss=0.3434, pruned_loss=0.06293, over 20807.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3239, pruned_loss=0.09479, over 4295246.89 frames. ], batch size: 607, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:15:15,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-21 03:15:22,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.85 vs. limit=10.0 2023-06-21 03:15:35,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=866394.0, ans=0.125 2023-06-21 03:15:44,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.33 vs. limit=12.0 2023-06-21 03:15:45,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.706e+02 3.089e+02 3.768e+02 7.797e+02, threshold=6.178e+02, percent-clipped=1.0 2023-06-21 03:15:47,722 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-21 03:16:32,963 INFO [train.py:996] (1/4) Epoch 5, batch 22450, loss[loss=0.2275, simple_loss=0.2833, pruned_loss=0.08587, over 21243.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3191, pruned_loss=0.09445, over 4291626.63 frames. ], batch size: 144, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:17:14,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=866694.0, ans=0.125 2023-06-21 03:18:14,411 INFO [train.py:996] (1/4) Epoch 5, batch 22500, loss[loss=0.2094, simple_loss=0.2741, pruned_loss=0.07236, over 21601.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3139, pruned_loss=0.09393, over 4292304.78 frames. ], batch size: 298, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:18:37,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-21 03:18:45,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-21 03:19:04,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-21 03:19:05,429 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.870e+02 3.254e+02 4.012e+02 8.224e+02, threshold=6.508e+02, percent-clipped=4.0 2023-06-21 03:19:06,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-21 03:19:12,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=867054.0, ans=0.125 2023-06-21 03:19:53,744 INFO [train.py:996] (1/4) Epoch 5, batch 22550, loss[loss=0.2844, simple_loss=0.3592, pruned_loss=0.1048, over 22046.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3177, pruned_loss=0.09339, over 4296044.43 frames. ], batch size: 113, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:19:54,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=867174.0, ans=0.05 2023-06-21 03:20:00,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=867174.0, ans=0.1 2023-06-21 03:20:31,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=867294.0, ans=0.2 2023-06-21 03:20:32,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=867294.0, ans=0.125 2023-06-21 03:21:19,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=867414.0, ans=0.07 2023-06-21 03:21:22,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=867414.0, ans=0.07 2023-06-21 03:21:26,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=867414.0, ans=0.2 2023-06-21 03:21:30,661 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:21:40,199 INFO [train.py:996] (1/4) Epoch 5, batch 22600, loss[loss=0.32, simple_loss=0.3953, pruned_loss=0.1223, over 21493.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.321, pruned_loss=0.09387, over 4288665.55 frames. ], batch size: 471, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:21:55,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=867534.0, ans=0.0 2023-06-21 03:22:40,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.062e+02 3.535e+02 4.633e+02 8.415e+02, threshold=7.070e+02, percent-clipped=6.0 2023-06-21 03:23:18,517 INFO [train.py:996] (1/4) Epoch 5, batch 22650, loss[loss=0.2622, simple_loss=0.3109, pruned_loss=0.1067, over 21778.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3187, pruned_loss=0.09389, over 4286470.99 frames. ], batch size: 118, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:23:43,914 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:23:56,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-21 03:24:57,499 INFO [train.py:996] (1/4) Epoch 5, batch 22700, loss[loss=0.2185, simple_loss=0.2716, pruned_loss=0.08272, over 21764.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3117, pruned_loss=0.0929, over 4281653.09 frames. ], batch size: 318, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:25:11,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=12.0 2023-06-21 03:25:26,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=868134.0, ans=0.07 2023-06-21 03:25:59,500 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.710e+02 3.113e+02 3.866e+02 5.786e+02, threshold=6.226e+02, percent-clipped=0.0 2023-06-21 03:26:08,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-06-21 03:26:18,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=868314.0, ans=0.125 2023-06-21 03:26:23,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=868314.0, ans=0.2 2023-06-21 03:26:30,974 INFO [train.py:996] (1/4) Epoch 5, batch 22750, loss[loss=0.2875, simple_loss=0.3542, pruned_loss=0.1103, over 21500.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3122, pruned_loss=0.09425, over 4285813.24 frames. ], batch size: 131, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:26:38,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=15.0 2023-06-21 03:27:10,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.85 vs. limit=10.0 2023-06-21 03:27:13,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=22.5 2023-06-21 03:27:40,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=868554.0, ans=0.125 2023-06-21 03:27:52,716 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.61 vs. limit=10.0 2023-06-21 03:28:15,786 INFO [train.py:996] (1/4) Epoch 5, batch 22800, loss[loss=0.2448, simple_loss=0.3006, pruned_loss=0.09448, over 21643.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3162, pruned_loss=0.09659, over 4286024.77 frames. ], batch size: 263, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:28:16,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=868674.0, ans=0.2 2023-06-21 03:28:44,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=868734.0, ans=0.125 2023-06-21 03:29:17,220 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 2.823e+02 3.345e+02 3.974e+02 6.068e+02, threshold=6.691e+02, percent-clipped=0.0 2023-06-21 03:29:27,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=868854.0, ans=0.125 2023-06-21 03:29:49,099 INFO [train.py:996] (1/4) Epoch 5, batch 22850, loss[loss=0.286, simple_loss=0.3241, pruned_loss=0.124, over 21224.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3135, pruned_loss=0.09574, over 4275687.68 frames. ], batch size: 471, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:30:02,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=868974.0, ans=0.2 2023-06-21 03:30:04,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=868974.0, ans=0.2 2023-06-21 03:30:08,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=869034.0, ans=0.125 2023-06-21 03:30:52,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=869094.0, ans=10.0 2023-06-21 03:31:13,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=869214.0, ans=0.1 2023-06-21 03:31:16,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=869214.0, ans=0.0 2023-06-21 03:31:35,936 INFO [train.py:996] (1/4) Epoch 5, batch 22900, loss[loss=0.1603, simple_loss=0.2167, pruned_loss=0.05196, over 16138.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3177, pruned_loss=0.09516, over 4269763.65 frames. ], batch size: 60, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:31:49,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.63 vs. limit=6.0 2023-06-21 03:32:02,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=869334.0, ans=0.125 2023-06-21 03:32:39,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.436e+02 3.293e+02 3.917e+02 5.124e+02 7.831e+02, threshold=7.834e+02, percent-clipped=10.0 2023-06-21 03:32:52,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=869454.0, ans=0.2 2023-06-21 03:33:10,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2023-06-21 03:33:15,481 INFO [train.py:996] (1/4) Epoch 5, batch 22950, loss[loss=0.2439, simple_loss=0.3758, pruned_loss=0.05599, over 20823.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3332, pruned_loss=0.09335, over 4269316.29 frames. ], batch size: 607, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:33:49,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=869634.0, ans=0.05 2023-06-21 03:34:52,755 INFO [train.py:996] (1/4) Epoch 5, batch 23000, loss[loss=0.2567, simple_loss=0.319, pruned_loss=0.09714, over 21262.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3314, pruned_loss=0.09077, over 4275099.50 frames. ], batch size: 159, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:34:56,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=869874.0, ans=0.125 2023-06-21 03:34:57,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=22.5 2023-06-21 03:35:04,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=869874.0, ans=0.04949747468305833 2023-06-21 03:35:37,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=869994.0, ans=0.0 2023-06-21 03:35:46,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=869994.0, ans=0.125 2023-06-21 03:35:56,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.793e+02 3.379e+02 3.965e+02 7.564e+02, threshold=6.759e+02, percent-clipped=0.0 2023-06-21 03:36:11,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=870054.0, ans=0.0 2023-06-21 03:36:43,308 INFO [train.py:996] (1/4) Epoch 5, batch 23050, loss[loss=0.283, simple_loss=0.3472, pruned_loss=0.1094, over 21955.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3326, pruned_loss=0.09326, over 4276407.68 frames. ], batch size: 372, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:37:05,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-21 03:37:55,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=870414.0, ans=0.125 2023-06-21 03:37:57,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=870414.0, ans=0.125 2023-06-21 03:38:22,812 INFO [train.py:996] (1/4) Epoch 5, batch 23100, loss[loss=0.2334, simple_loss=0.2891, pruned_loss=0.08888, over 20159.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3275, pruned_loss=0.09356, over 4271157.78 frames. ], batch size: 703, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:38:23,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=870474.0, ans=0.125 2023-06-21 03:38:39,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.92 vs. limit=10.0 2023-06-21 03:38:53,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=870534.0, ans=0.2 2023-06-21 03:39:07,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=870594.0, ans=0.0 2023-06-21 03:39:20,131 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 2.864e+02 3.390e+02 4.261e+02 7.523e+02, threshold=6.780e+02, percent-clipped=3.0 2023-06-21 03:39:39,430 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:40:00,802 INFO [train.py:996] (1/4) Epoch 5, batch 23150, loss[loss=0.2986, simple_loss=0.3447, pruned_loss=0.1263, over 21810.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.32, pruned_loss=0.0921, over 4265564.87 frames. ], batch size: 441, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:40:21,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=870834.0, ans=10.0 2023-06-21 03:41:18,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=871014.0, ans=0.0 2023-06-21 03:41:20,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=871014.0, ans=0.125 2023-06-21 03:41:28,439 INFO [train.py:996] (1/4) Epoch 5, batch 23200, loss[loss=0.2436, simple_loss=0.3126, pruned_loss=0.08731, over 21907.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3189, pruned_loss=0.09311, over 4267785.37 frames. ], batch size: 371, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:41:40,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=871074.0, ans=0.125 2023-06-21 03:41:50,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=871074.0, ans=0.05 2023-06-21 03:42:01,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=871134.0, ans=0.0 2023-06-21 03:42:20,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.37 vs. limit=15.0 2023-06-21 03:42:26,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2023-06-21 03:42:29,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.864e+02 3.235e+02 3.730e+02 5.431e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-21 03:42:58,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-06-21 03:43:11,483 INFO [train.py:996] (1/4) Epoch 5, batch 23250, loss[loss=0.3094, simple_loss=0.353, pruned_loss=0.1329, over 21685.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.319, pruned_loss=0.09497, over 4281898.19 frames. ], batch size: 507, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:44:43,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=871614.0, ans=0.125 2023-06-21 03:44:57,859 INFO [train.py:996] (1/4) Epoch 5, batch 23300, loss[loss=0.2861, simple_loss=0.3802, pruned_loss=0.09601, over 21741.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3285, pruned_loss=0.09756, over 4280293.10 frames. ], batch size: 247, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:45:02,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=871674.0, ans=0.0 2023-06-21 03:45:42,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-21 03:45:52,628 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.979e+02 3.447e+02 3.938e+02 6.103e+02, threshold=6.894e+02, percent-clipped=0.0 2023-06-21 03:46:33,338 INFO [train.py:996] (1/4) Epoch 5, batch 23350, loss[loss=0.2656, simple_loss=0.3453, pruned_loss=0.09294, over 21715.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3321, pruned_loss=0.09562, over 4285538.67 frames. ], batch size: 351, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:46:40,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=871974.0, ans=0.2 2023-06-21 03:46:48,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.34 vs. limit=15.0 2023-06-21 03:46:49,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=872034.0, ans=0.125 2023-06-21 03:48:11,182 INFO [train.py:996] (1/4) Epoch 5, batch 23400, loss[loss=0.2715, simple_loss=0.3317, pruned_loss=0.1057, over 21777.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3245, pruned_loss=0.09102, over 4282846.38 frames. ], batch size: 441, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:48:36,973 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-21 03:48:48,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.27 vs. limit=15.0 2023-06-21 03:49:17,345 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 2.664e+02 3.181e+02 4.182e+02 6.937e+02, threshold=6.362e+02, percent-clipped=1.0 2023-06-21 03:49:30,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=872454.0, ans=0.125 2023-06-21 03:49:52,796 INFO [train.py:996] (1/4) Epoch 5, batch 23450, loss[loss=0.3004, simple_loss=0.3555, pruned_loss=0.1226, over 21793.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3258, pruned_loss=0.09413, over 4281416.76 frames. ], batch size: 332, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:51:11,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=872754.0, ans=0.0 2023-06-21 03:51:14,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=872814.0, ans=0.125 2023-06-21 03:51:26,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=872814.0, ans=0.0 2023-06-21 03:51:30,829 INFO [train.py:996] (1/4) Epoch 5, batch 23500, loss[loss=0.2643, simple_loss=0.3294, pruned_loss=0.09961, over 21858.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3255, pruned_loss=0.0956, over 4287700.47 frames. ], batch size: 107, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:51:57,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=872934.0, ans=0.125 2023-06-21 03:52:38,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.401e+02 3.046e+02 3.693e+02 4.776e+02 9.117e+02, threshold=7.385e+02, percent-clipped=5.0 2023-06-21 03:53:05,905 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-21 03:53:08,201 INFO [train.py:996] (1/4) Epoch 5, batch 23550, loss[loss=0.227, simple_loss=0.2785, pruned_loss=0.08777, over 21189.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3217, pruned_loss=0.095, over 4277752.04 frames. ], batch size: 159, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:53:42,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-21 03:54:08,214 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-21 03:54:18,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.26 vs. limit=15.0 2023-06-21 03:54:46,888 INFO [train.py:996] (1/4) Epoch 5, batch 23600, loss[loss=0.2492, simple_loss=0.2891, pruned_loss=0.1046, over 19916.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3198, pruned_loss=0.09428, over 4284251.32 frames. ], batch size: 702, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:55:02,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=873474.0, ans=0.1 2023-06-21 03:55:08,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=873534.0, ans=0.125 2023-06-21 03:55:12,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=873534.0, ans=0.125 2023-06-21 03:55:55,533 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.629e+02 3.088e+02 3.713e+02 7.100e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-21 03:56:00,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=873654.0, ans=0.1 2023-06-21 03:56:22,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=22.5 2023-06-21 03:56:23,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=873714.0, ans=0.125 2023-06-21 03:56:31,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-21 03:56:32,161 INFO [train.py:996] (1/4) Epoch 5, batch 23650, loss[loss=0.241, simple_loss=0.3172, pruned_loss=0.08238, over 21427.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3179, pruned_loss=0.09116, over 4278160.04 frames. ], batch size: 194, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:57:38,144 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-21 03:58:05,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=874014.0, ans=0.125 2023-06-21 03:58:10,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=874014.0, ans=0.125 2023-06-21 03:58:11,230 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-21 03:58:13,464 INFO [train.py:996] (1/4) Epoch 5, batch 23700, loss[loss=0.244, simple_loss=0.3138, pruned_loss=0.08707, over 21313.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3223, pruned_loss=0.09134, over 4274433.61 frames. ], batch size: 176, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:58:30,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=874074.0, ans=0.125 2023-06-21 03:58:30,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874074.0, ans=0.1 2023-06-21 03:58:44,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=874134.0, ans=0.1 2023-06-21 03:59:17,501 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.032e+02 3.536e+02 4.190e+02 7.050e+02, threshold=7.071e+02, percent-clipped=3.0 2023-06-21 03:59:38,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=874314.0, ans=0.0 2023-06-21 03:59:53,568 INFO [train.py:996] (1/4) Epoch 5, batch 23750, loss[loss=0.2366, simple_loss=0.3315, pruned_loss=0.07088, over 21265.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3252, pruned_loss=0.09254, over 4276081.74 frames. ], batch size: 549, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 04:00:10,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=874374.0, ans=0.0 2023-06-21 04:00:36,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=874434.0, ans=0.2 2023-06-21 04:01:08,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-21 04:01:26,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=874614.0, ans=0.0 2023-06-21 04:01:38,905 INFO [train.py:996] (1/4) Epoch 5, batch 23800, loss[loss=0.274, simple_loss=0.3871, pruned_loss=0.08042, over 20792.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3223, pruned_loss=0.08974, over 4268268.17 frames. ], batch size: 607, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:02:26,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=874794.0, ans=0.04949747468305833 2023-06-21 04:02:39,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=874794.0, ans=0.125 2023-06-21 04:02:43,702 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.704e+02 3.208e+02 4.045e+02 9.409e+02, threshold=6.416e+02, percent-clipped=3.0 2023-06-21 04:02:49,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=874854.0, ans=0.2 2023-06-21 04:02:54,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=874854.0, ans=0.125 2023-06-21 04:03:29,767 INFO [train.py:996] (1/4) Epoch 5, batch 23850, loss[loss=0.3453, simple_loss=0.4, pruned_loss=0.1453, over 21482.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3325, pruned_loss=0.09276, over 4274978.19 frames. ], batch size: 471, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:03:36,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=874974.0, ans=0.1 2023-06-21 04:04:20,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=875094.0, ans=0.0 2023-06-21 04:05:04,015 INFO [train.py:996] (1/4) Epoch 5, batch 23900, loss[loss=0.2546, simple_loss=0.3382, pruned_loss=0.08545, over 21468.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3411, pruned_loss=0.09662, over 4275887.50 frames. ], batch size: 211, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:05:35,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=875334.0, ans=0.1 2023-06-21 04:05:40,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=875394.0, ans=0.0 2023-06-21 04:05:53,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=22.5 2023-06-21 04:06:02,867 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 3.113e+02 3.559e+02 4.462e+02 8.067e+02, threshold=7.118e+02, percent-clipped=8.0 2023-06-21 04:06:18,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=875454.0, ans=0.0 2023-06-21 04:06:34,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=875514.0, ans=0.125 2023-06-21 04:06:41,970 INFO [train.py:996] (1/4) Epoch 5, batch 23950, loss[loss=0.2537, simple_loss=0.3151, pruned_loss=0.09615, over 21568.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3354, pruned_loss=0.09673, over 4256022.94 frames. ], batch size: 263, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:07:17,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=875694.0, ans=0.1 2023-06-21 04:07:41,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=875694.0, ans=0.125 2023-06-21 04:08:18,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=875814.0, ans=0.1 2023-06-21 04:08:21,269 INFO [train.py:996] (1/4) Epoch 5, batch 24000, loss[loss=0.3101, simple_loss=0.3695, pruned_loss=0.1253, over 21403.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3345, pruned_loss=0.09825, over 4261388.85 frames. ], batch size: 471, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:08:21,270 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 04:08:38,091 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2683, simple_loss=0.3693, pruned_loss=0.08367, over 1796401.00 frames. 2023-06-21 04:08:38,092 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-21 04:08:40,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=875874.0, ans=0.2 2023-06-21 04:09:34,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=875994.0, ans=0.125 2023-06-21 04:09:43,017 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.178e+02 3.721e+02 4.593e+02 6.442e+02, threshold=7.441e+02, percent-clipped=0.0 2023-06-21 04:10:12,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=876174.0, ans=0.1 2023-06-21 04:10:18,500 INFO [train.py:996] (1/4) Epoch 5, batch 24050, loss[loss=0.3185, simple_loss=0.3842, pruned_loss=0.1264, over 21256.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3358, pruned_loss=0.09861, over 4262246.29 frames. ], batch size: 143, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:10:33,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-06-21 04:10:55,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=876234.0, ans=0.125 2023-06-21 04:10:55,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-21 04:10:58,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=876294.0, ans=0.125 2023-06-21 04:11:21,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-21 04:11:38,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=12.0 2023-06-21 04:11:57,818 INFO [train.py:996] (1/4) Epoch 5, batch 24100, loss[loss=0.3124, simple_loss=0.3884, pruned_loss=0.1182, over 21615.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3365, pruned_loss=0.09766, over 4267411.03 frames. ], batch size: 414, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:12:10,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=876474.0, ans=0.125 2023-06-21 04:12:15,618 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-21 04:12:57,147 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 2.915e+02 3.291e+02 4.001e+02 6.877e+02, threshold=6.582e+02, percent-clipped=0.0 2023-06-21 04:12:57,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=876654.0, ans=0.0 2023-06-21 04:13:02,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=876654.0, ans=0.125 2023-06-21 04:13:31,042 INFO [train.py:996] (1/4) Epoch 5, batch 24150, loss[loss=0.2625, simple_loss=0.3182, pruned_loss=0.1034, over 21665.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3359, pruned_loss=0.09955, over 4274081.51 frames. ], batch size: 263, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:14:16,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=876894.0, ans=0.1 2023-06-21 04:14:40,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=876954.0, ans=0.125 2023-06-21 04:14:51,794 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:15:00,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=877014.0, ans=0.125 2023-06-21 04:15:01,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.94 vs. limit=15.0 2023-06-21 04:15:11,004 INFO [train.py:996] (1/4) Epoch 5, batch 24200, loss[loss=0.2648, simple_loss=0.353, pruned_loss=0.08833, over 21744.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3371, pruned_loss=0.1, over 4273476.77 frames. ], batch size: 332, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:16:17,765 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.992e+02 3.434e+02 4.148e+02 5.774e+02, threshold=6.868e+02, percent-clipped=0.0 2023-06-21 04:16:58,519 INFO [train.py:996] (1/4) Epoch 5, batch 24250, loss[loss=0.1963, simple_loss=0.3004, pruned_loss=0.04606, over 21771.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3336, pruned_loss=0.09271, over 4275779.39 frames. ], batch size: 332, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:17:02,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=877374.0, ans=0.0 2023-06-21 04:17:54,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=877554.0, ans=0.125 2023-06-21 04:18:37,939 INFO [train.py:996] (1/4) Epoch 5, batch 24300, loss[loss=0.2202, simple_loss=0.2982, pruned_loss=0.07113, over 21797.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3265, pruned_loss=0.08606, over 4274665.56 frames. ], batch size: 351, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:18:54,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.40 vs. limit=15.0 2023-06-21 04:19:42,944 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.429e+02 3.041e+02 4.140e+02 6.830e+02, threshold=6.081e+02, percent-clipped=0.0 2023-06-21 04:20:02,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=877914.0, ans=10.0 2023-06-21 04:20:07,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.39 vs. limit=15.0 2023-06-21 04:20:09,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-21 04:20:20,904 INFO [train.py:996] (1/4) Epoch 5, batch 24350, loss[loss=0.2855, simple_loss=0.3562, pruned_loss=0.1073, over 21852.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3233, pruned_loss=0.08672, over 4273416.86 frames. ], batch size: 371, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:20:29,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=877974.0, ans=0.1 2023-06-21 04:20:29,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=877974.0, ans=0.2 2023-06-21 04:20:59,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=878094.0, ans=0.125 2023-06-21 04:22:04,665 INFO [train.py:996] (1/4) Epoch 5, batch 24400, loss[loss=0.2546, simple_loss=0.3266, pruned_loss=0.09132, over 21683.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3277, pruned_loss=0.09047, over 4279940.05 frames. ], batch size: 247, lr: 6.03e-03, grad_scale: 32.0 2023-06-21 04:22:20,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=878334.0, ans=0.0 2023-06-21 04:22:22,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-21 04:22:24,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=878334.0, ans=0.0 2023-06-21 04:23:06,235 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.330e+02 3.732e+02 4.584e+02 7.697e+02, threshold=7.464e+02, percent-clipped=2.0 2023-06-21 04:23:09,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=878454.0, ans=0.0 2023-06-21 04:23:30,544 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:23:33,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=878514.0, ans=0.1 2023-06-21 04:23:44,670 INFO [train.py:996] (1/4) Epoch 5, batch 24450, loss[loss=0.2402, simple_loss=0.329, pruned_loss=0.07566, over 21598.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3309, pruned_loss=0.09205, over 4275897.62 frames. ], batch size: 263, lr: 6.03e-03, grad_scale: 32.0 2023-06-21 04:23:46,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=878574.0, ans=0.125 2023-06-21 04:23:49,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=878574.0, ans=0.07 2023-06-21 04:23:59,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=878634.0, ans=0.1 2023-06-21 04:23:59,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=878634.0, ans=0.125 2023-06-21 04:24:04,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=878634.0, ans=0.125 2023-06-21 04:24:31,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=878694.0, ans=0.1 2023-06-21 04:24:48,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=878754.0, ans=0.1 2023-06-21 04:24:49,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.62 vs. limit=22.5 2023-06-21 04:25:16,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.96 vs. limit=12.0 2023-06-21 04:25:23,336 INFO [train.py:996] (1/4) Epoch 5, batch 24500, loss[loss=0.2313, simple_loss=0.2963, pruned_loss=0.08318, over 21263.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3313, pruned_loss=0.09266, over 4281032.91 frames. ], batch size: 608, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:25:31,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-21 04:26:31,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.806e+02 3.370e+02 4.048e+02 6.223e+02, threshold=6.740e+02, percent-clipped=0.0 2023-06-21 04:27:02,326 INFO [train.py:996] (1/4) Epoch 5, batch 24550, loss[loss=0.2729, simple_loss=0.3426, pruned_loss=0.1016, over 21633.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3328, pruned_loss=0.09426, over 4286936.68 frames. ], batch size: 389, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:28:26,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=879414.0, ans=0.125 2023-06-21 04:28:29,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=879414.0, ans=0.0 2023-06-21 04:28:39,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-21 04:28:42,273 INFO [train.py:996] (1/4) Epoch 5, batch 24600, loss[loss=0.2006, simple_loss=0.2632, pruned_loss=0.06902, over 21799.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3268, pruned_loss=0.09421, over 4287064.24 frames. ], batch size: 118, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:28:54,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=879474.0, ans=0.125 2023-06-21 04:28:56,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-21 04:29:02,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=879534.0, ans=0.0 2023-06-21 04:29:28,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=879594.0, ans=0.1 2023-06-21 04:29:53,770 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.125e+02 3.627e+02 4.480e+02 7.581e+02, threshold=7.254e+02, percent-clipped=2.0 2023-06-21 04:30:21,426 INFO [train.py:996] (1/4) Epoch 5, batch 24650, loss[loss=0.2139, simple_loss=0.2766, pruned_loss=0.07566, over 21702.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.32, pruned_loss=0.09304, over 4280327.87 frames. ], batch size: 282, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:30:41,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=879834.0, ans=0.05 2023-06-21 04:31:11,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=879894.0, ans=0.125 2023-06-21 04:31:11,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-06-21 04:31:17,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=879894.0, ans=0.125 2023-06-21 04:31:34,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=879954.0, ans=0.1 2023-06-21 04:31:37,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=879954.0, ans=0.1 2023-06-21 04:31:39,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-21 04:32:07,112 INFO [train.py:996] (1/4) Epoch 5, batch 24700, loss[loss=0.2308, simple_loss=0.3002, pruned_loss=0.0807, over 21582.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3168, pruned_loss=0.09068, over 4280848.35 frames. ], batch size: 263, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:32:10,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=880074.0, ans=0.1 2023-06-21 04:32:35,052 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-21 04:32:56,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=880194.0, ans=0.0 2023-06-21 04:33:02,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=880194.0, ans=0.035 2023-06-21 04:33:08,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=880254.0, ans=0.125 2023-06-21 04:33:13,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.830e+02 3.084e+02 3.762e+02 5.962e+02, threshold=6.167e+02, percent-clipped=0.0 2023-06-21 04:33:24,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=880314.0, ans=10.0 2023-06-21 04:33:39,757 INFO [train.py:996] (1/4) Epoch 5, batch 24750, loss[loss=0.2113, simple_loss=0.2856, pruned_loss=0.06851, over 20760.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3114, pruned_loss=0.08783, over 4266003.04 frames. ], batch size: 607, lr: 6.02e-03, grad_scale: 8.0 2023-06-21 04:34:16,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=880434.0, ans=0.2 2023-06-21 04:35:18,237 INFO [train.py:996] (1/4) Epoch 5, batch 24800, loss[loss=0.2757, simple_loss=0.3346, pruned_loss=0.1084, over 21845.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3064, pruned_loss=0.08798, over 4265203.89 frames. ], batch size: 112, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:35:55,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=880734.0, ans=0.035 2023-06-21 04:36:31,834 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.672e+02 2.950e+02 3.460e+02 6.225e+02, threshold=5.900e+02, percent-clipped=1.0 2023-06-21 04:36:54,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-06-21 04:36:57,239 INFO [train.py:996] (1/4) Epoch 5, batch 24850, loss[loss=0.1894, simple_loss=0.2479, pruned_loss=0.06546, over 21207.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3067, pruned_loss=0.08921, over 4265955.30 frames. ], batch size: 143, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:36:57,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=880974.0, ans=0.1 2023-06-21 04:37:04,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=880974.0, ans=0.0 2023-06-21 04:38:01,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=881094.0, ans=0.2 2023-06-21 04:38:27,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=881214.0, ans=0.1 2023-06-21 04:38:36,662 INFO [train.py:996] (1/4) Epoch 5, batch 24900, loss[loss=0.2651, simple_loss=0.3309, pruned_loss=0.09964, over 21295.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3087, pruned_loss=0.08957, over 4265651.44 frames. ], batch size: 176, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:39:17,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=881394.0, ans=0.04949747468305833 2023-06-21 04:39:51,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.066e+02 3.454e+02 4.012e+02 6.143e+02, threshold=6.909e+02, percent-clipped=1.0 2023-06-21 04:40:06,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=881514.0, ans=0.0 2023-06-21 04:40:22,241 INFO [train.py:996] (1/4) Epoch 5, batch 24950, loss[loss=0.2955, simple_loss=0.3524, pruned_loss=0.1193, over 21342.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3188, pruned_loss=0.09504, over 4272021.98 frames. ], batch size: 176, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:40:40,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=881574.0, ans=0.2 2023-06-21 04:42:02,077 INFO [train.py:996] (1/4) Epoch 5, batch 25000, loss[loss=0.277, simple_loss=0.3446, pruned_loss=0.1047, over 21738.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3233, pruned_loss=0.09597, over 4263896.80 frames. ], batch size: 351, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:42:15,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-21 04:42:44,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=881934.0, ans=0.0 2023-06-21 04:42:47,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-21 04:43:07,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=882054.0, ans=0.0 2023-06-21 04:43:10,523 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.862e+02 3.406e+02 4.060e+02 6.504e+02, threshold=6.812e+02, percent-clipped=0.0 2023-06-21 04:43:46,034 INFO [train.py:996] (1/4) Epoch 5, batch 25050, loss[loss=0.2227, simple_loss=0.2877, pruned_loss=0.07886, over 21800.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3163, pruned_loss=0.09393, over 4266799.12 frames. ], batch size: 352, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:43:52,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=882174.0, ans=0.125 2023-06-21 04:44:45,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=882354.0, ans=0.125 2023-06-21 04:45:02,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=882354.0, ans=0.125 2023-06-21 04:45:20,794 INFO [train.py:996] (1/4) Epoch 5, batch 25100, loss[loss=0.2445, simple_loss=0.3314, pruned_loss=0.07877, over 21648.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3095, pruned_loss=0.092, over 4266567.22 frames. ], batch size: 391, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:45:27,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=882474.0, ans=0.0 2023-06-21 04:46:12,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=882594.0, ans=0.0 2023-06-21 04:46:29,028 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.734e+02 3.137e+02 3.918e+02 6.199e+02, threshold=6.274e+02, percent-clipped=0.0 2023-06-21 04:46:51,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=882714.0, ans=0.2 2023-06-21 04:46:52,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-21 04:46:59,073 INFO [train.py:996] (1/4) Epoch 5, batch 25150, loss[loss=0.2131, simple_loss=0.3066, pruned_loss=0.05985, over 21896.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3151, pruned_loss=0.09048, over 4276946.58 frames. ], batch size: 316, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:48:01,398 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:48:37,286 INFO [train.py:996] (1/4) Epoch 5, batch 25200, loss[loss=0.2478, simple_loss=0.3416, pruned_loss=0.07701, over 19725.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3149, pruned_loss=0.08788, over 4278095.86 frames. ], batch size: 703, lr: 6.02e-03, grad_scale: 32.0 2023-06-21 04:48:43,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=883074.0, ans=0.1 2023-06-21 04:48:50,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-21 04:49:20,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-21 04:49:46,902 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.669e+02 3.257e+02 4.012e+02 7.318e+02, threshold=6.513e+02, percent-clipped=2.0 2023-06-21 04:49:52,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-21 04:50:17,407 INFO [train.py:996] (1/4) Epoch 5, batch 25250, loss[loss=0.238, simple_loss=0.2985, pruned_loss=0.0888, over 21690.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3129, pruned_loss=0.08584, over 4270827.28 frames. ], batch size: 282, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:51:18,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=883554.0, ans=0.2 2023-06-21 04:51:57,235 INFO [train.py:996] (1/4) Epoch 5, batch 25300, loss[loss=0.2532, simple_loss=0.3325, pruned_loss=0.08694, over 21686.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3108, pruned_loss=0.0854, over 4262963.29 frames. ], batch size: 441, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:51:57,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=883674.0, ans=0.0 2023-06-21 04:52:59,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=883854.0, ans=0.0 2023-06-21 04:53:01,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=883854.0, ans=0.125 2023-06-21 04:53:01,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-21 04:53:02,439 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.770e+02 3.140e+02 3.813e+02 4.907e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-21 04:53:21,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=883914.0, ans=0.2 2023-06-21 04:53:28,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=883914.0, ans=0.0 2023-06-21 04:53:33,694 INFO [train.py:996] (1/4) Epoch 5, batch 25350, loss[loss=0.2232, simple_loss=0.305, pruned_loss=0.0707, over 21640.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3128, pruned_loss=0.08551, over 4255981.03 frames. ], batch size: 414, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:53:42,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=883974.0, ans=0.0 2023-06-21 04:54:11,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884034.0, ans=0.1 2023-06-21 04:54:15,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=884094.0, ans=0.125 2023-06-21 04:55:07,840 INFO [train.py:996] (1/4) Epoch 5, batch 25400, loss[loss=0.2431, simple_loss=0.3092, pruned_loss=0.08847, over 21801.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3089, pruned_loss=0.08526, over 4257548.25 frames. ], batch size: 351, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:56:02,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.48 vs. limit=15.0 2023-06-21 04:56:11,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-21 04:56:15,904 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.763e+02 3.058e+02 3.669e+02 6.374e+02, threshold=6.116e+02, percent-clipped=1.0 2023-06-21 04:56:29,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=884514.0, ans=0.1 2023-06-21 04:56:33,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=884514.0, ans=0.1 2023-06-21 04:56:40,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=884514.0, ans=0.125 2023-06-21 04:56:46,796 INFO [train.py:996] (1/4) Epoch 5, batch 25450, loss[loss=0.235, simple_loss=0.33, pruned_loss=0.07003, over 21843.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.31, pruned_loss=0.08697, over 4264035.97 frames. ], batch size: 351, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:58:09,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=884814.0, ans=0.125 2023-06-21 04:58:31,840 INFO [train.py:996] (1/4) Epoch 5, batch 25500, loss[loss=0.3328, simple_loss=0.4041, pruned_loss=0.1308, over 21446.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3104, pruned_loss=0.08371, over 4262866.51 frames. ], batch size: 507, lr: 6.01e-03, grad_scale: 16.0 2023-06-21 04:58:37,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=884874.0, ans=0.125 2023-06-21 04:59:44,010 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.815e+02 3.207e+02 3.771e+02 6.756e+02, threshold=6.413e+02, percent-clipped=1.0 2023-06-21 04:59:51,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=885114.0, ans=0.2 2023-06-21 04:59:51,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=885114.0, ans=0.125 2023-06-21 05:00:13,093 INFO [train.py:996] (1/4) Epoch 5, batch 25550, loss[loss=0.2515, simple_loss=0.3507, pruned_loss=0.0761, over 21884.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3177, pruned_loss=0.08417, over 4258797.67 frames. ], batch size: 371, lr: 6.01e-03, grad_scale: 16.0 2023-06-21 05:00:33,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-21 05:01:03,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.58 vs. limit=10.0 2023-06-21 05:01:15,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-21 05:01:15,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-21 05:01:18,840 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-21 05:02:02,398 INFO [train.py:996] (1/4) Epoch 5, batch 25600, loss[loss=0.3495, simple_loss=0.3994, pruned_loss=0.1498, over 21325.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3211, pruned_loss=0.08515, over 4254233.43 frames. ], batch size: 507, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:02:45,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=885594.0, ans=0.125 2023-06-21 05:02:52,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=885594.0, ans=0.0 2023-06-21 05:03:03,961 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.775e+02 3.286e+02 3.783e+02 5.833e+02, threshold=6.573e+02, percent-clipped=0.0 2023-06-21 05:03:41,863 INFO [train.py:996] (1/4) Epoch 5, batch 25650, loss[loss=0.2475, simple_loss=0.3074, pruned_loss=0.09385, over 20759.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3219, pruned_loss=0.08834, over 4255086.59 frames. ], batch size: 607, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:03:42,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-21 05:03:45,562 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=12.0 2023-06-21 05:03:54,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=885774.0, ans=0.125 2023-06-21 05:03:54,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=885774.0, ans=0.0 2023-06-21 05:04:20,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=885894.0, ans=0.1 2023-06-21 05:04:32,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=885894.0, ans=0.0 2023-06-21 05:05:21,297 INFO [train.py:996] (1/4) Epoch 5, batch 25700, loss[loss=0.2043, simple_loss=0.2825, pruned_loss=0.06305, over 21618.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3195, pruned_loss=0.08963, over 4259123.22 frames. ], batch size: 263, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:05:26,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=886074.0, ans=0.0 2023-06-21 05:05:38,495 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-21 05:05:43,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=886134.0, ans=0.125 2023-06-21 05:05:52,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=886134.0, ans=0.2 2023-06-21 05:06:23,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.865e+02 3.376e+02 4.055e+02 7.604e+02, threshold=6.752e+02, percent-clipped=2.0 2023-06-21 05:06:49,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886314.0, ans=0.1 2023-06-21 05:06:54,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=886314.0, ans=0.125 2023-06-21 05:06:58,882 INFO [train.py:996] (1/4) Epoch 5, batch 25750, loss[loss=0.2626, simple_loss=0.3284, pruned_loss=0.09835, over 20054.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3251, pruned_loss=0.0927, over 4262285.49 frames. ], batch size: 702, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:07:00,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-21 05:07:09,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=886374.0, ans=0.09899494936611666 2023-06-21 05:07:10,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=886374.0, ans=0.0 2023-06-21 05:07:28,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=886434.0, ans=0.125 2023-06-21 05:08:17,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=886554.0, ans=0.125 2023-06-21 05:08:32,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=886614.0, ans=0.2 2023-06-21 05:08:45,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=886674.0, ans=0.125 2023-06-21 05:08:46,833 INFO [train.py:996] (1/4) Epoch 5, batch 25800, loss[loss=0.3254, simple_loss=0.4091, pruned_loss=0.1209, over 19915.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3395, pruned_loss=0.09779, over 4263936.49 frames. ], batch size: 702, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:09:59,471 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.944e+02 3.581e+02 4.306e+02 8.254e+02, threshold=7.162e+02, percent-clipped=3.0 2023-06-21 05:10:04,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=886914.0, ans=0.125 2023-06-21 05:10:26,641 INFO [train.py:996] (1/4) Epoch 5, batch 25850, loss[loss=0.2691, simple_loss=0.3323, pruned_loss=0.1029, over 21382.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3407, pruned_loss=0.09807, over 4268587.18 frames. ], batch size: 144, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:10:27,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.95 vs. limit=5.0 2023-06-21 05:10:33,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=886974.0, ans=0.2 2023-06-21 05:10:59,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=887034.0, ans=0.1 2023-06-21 05:11:14,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=887094.0, ans=0.125 2023-06-21 05:11:49,616 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=15.0 2023-06-21 05:11:54,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=887214.0, ans=0.1 2023-06-21 05:12:07,861 INFO [train.py:996] (1/4) Epoch 5, batch 25900, loss[loss=0.2836, simple_loss=0.3701, pruned_loss=0.09855, over 21722.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3407, pruned_loss=0.09796, over 4273983.86 frames. ], batch size: 247, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:13:26,829 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.088e+02 3.549e+02 4.240e+02 5.933e+02, threshold=7.098e+02, percent-clipped=0.0 2023-06-21 05:13:37,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=887514.0, ans=0.0 2023-06-21 05:13:58,656 INFO [train.py:996] (1/4) Epoch 5, batch 25950, loss[loss=0.2735, simple_loss=0.3474, pruned_loss=0.09978, over 21759.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3465, pruned_loss=0.1008, over 4273043.75 frames. ], batch size: 113, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:14:19,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=887634.0, ans=0.0 2023-06-21 05:14:57,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=887754.0, ans=0.2 2023-06-21 05:15:11,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=887754.0, ans=0.0 2023-06-21 05:15:16,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=887814.0, ans=0.0 2023-06-21 05:15:29,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=887814.0, ans=0.95 2023-06-21 05:15:40,348 INFO [train.py:996] (1/4) Epoch 5, batch 26000, loss[loss=0.3592, simple_loss=0.4068, pruned_loss=0.1559, over 21357.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.347, pruned_loss=0.1, over 4278008.88 frames. ], batch size: 507, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:15:52,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=887874.0, ans=0.025 2023-06-21 05:15:56,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=887874.0, ans=0.2 2023-06-21 05:15:58,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=887874.0, ans=0.125 2023-06-21 05:16:52,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 2.994e+02 3.502e+02 4.127e+02 6.076e+02, threshold=7.004e+02, percent-clipped=0.0 2023-06-21 05:16:52,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=888054.0, ans=0.2 2023-06-21 05:16:58,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.48 vs. limit=10.0 2023-06-21 05:17:14,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=888114.0, ans=0.125 2023-06-21 05:17:19,601 INFO [train.py:996] (1/4) Epoch 5, batch 26050, loss[loss=0.3013, simple_loss=0.355, pruned_loss=0.1238, over 21810.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.346, pruned_loss=0.1012, over 4280921.80 frames. ], batch size: 441, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:17:58,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=888234.0, ans=0.1 2023-06-21 05:18:00,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-21 05:18:13,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=888294.0, ans=0.1 2023-06-21 05:18:25,450 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:18:33,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-06-21 05:18:58,123 INFO [train.py:996] (1/4) Epoch 5, batch 26100, loss[loss=0.2262, simple_loss=0.2929, pruned_loss=0.07971, over 21672.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3405, pruned_loss=0.1003, over 4281657.03 frames. ], batch size: 263, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:19:15,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=888474.0, ans=0.2 2023-06-21 05:19:25,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=888534.0, ans=0.0 2023-06-21 05:19:33,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-21 05:20:05,880 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.461e+02 2.983e+02 3.615e+02 4.836e+02 1.225e+03, threshold=7.230e+02, percent-clipped=7.0 2023-06-21 05:20:24,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-21 05:20:39,118 INFO [train.py:996] (1/4) Epoch 5, batch 26150, loss[loss=0.2746, simple_loss=0.3436, pruned_loss=0.1028, over 21791.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3368, pruned_loss=0.09955, over 4281755.18 frames. ], batch size: 441, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:20:53,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=888834.0, ans=0.0 2023-06-21 05:20:55,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=888834.0, ans=0.0 2023-06-21 05:21:14,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=888834.0, ans=0.1 2023-06-21 05:21:43,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=888954.0, ans=0.1 2023-06-21 05:22:19,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-06-21 05:22:20,380 INFO [train.py:996] (1/4) Epoch 5, batch 26200, loss[loss=0.2825, simple_loss=0.3742, pruned_loss=0.09542, over 21755.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3368, pruned_loss=0.09705, over 4278466.14 frames. ], batch size: 332, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:23:27,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=889254.0, ans=0.0 2023-06-21 05:23:33,371 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.909e+02 3.359e+02 4.257e+02 6.778e+02, threshold=6.718e+02, percent-clipped=0.0 2023-06-21 05:23:46,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=889314.0, ans=0.125 2023-06-21 05:23:56,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=889314.0, ans=0.125 2023-06-21 05:24:01,201 INFO [train.py:996] (1/4) Epoch 5, batch 26250, loss[loss=0.2559, simple_loss=0.3197, pruned_loss=0.09604, over 21313.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.34, pruned_loss=0.09556, over 4281488.49 frames. ], batch size: 176, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:24:19,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=889374.0, ans=0.125 2023-06-21 05:25:26,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=889614.0, ans=0.1 2023-06-21 05:25:27,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-21 05:25:39,635 INFO [train.py:996] (1/4) Epoch 5, batch 26300, loss[loss=0.2318, simple_loss=0.3045, pruned_loss=0.07955, over 21878.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.337, pruned_loss=0.09616, over 4289375.41 frames. ], batch size: 351, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:25:58,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=889734.0, ans=0.125 2023-06-21 05:26:16,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=889794.0, ans=10.0 2023-06-21 05:26:58,699 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.888e+02 3.226e+02 3.870e+02 6.035e+02, threshold=6.451e+02, percent-clipped=0.0 2023-06-21 05:26:59,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=889854.0, ans=0.125 2023-06-21 05:27:25,019 INFO [train.py:996] (1/4) Epoch 5, batch 26350, loss[loss=0.2758, simple_loss=0.3443, pruned_loss=0.1036, over 21290.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.335, pruned_loss=0.09657, over 4295957.55 frames. ], batch size: 548, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:28:44,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=890214.0, ans=0.125 2023-06-21 05:28:59,008 INFO [train.py:996] (1/4) Epoch 5, batch 26400, loss[loss=0.2522, simple_loss=0.2947, pruned_loss=0.1049, over 21611.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3294, pruned_loss=0.09717, over 4297201.99 frames. ], batch size: 247, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:30:08,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=890454.0, ans=0.09899494936611666 2023-06-21 05:30:10,877 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 2.951e+02 3.748e+02 4.421e+02 1.228e+03, threshold=7.496e+02, percent-clipped=6.0 2023-06-21 05:30:13,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=890454.0, ans=0.125 2023-06-21 05:30:15,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-06-21 05:30:25,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=890514.0, ans=0.1 2023-06-21 05:30:37,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=890574.0, ans=0.0 2023-06-21 05:30:37,998 INFO [train.py:996] (1/4) Epoch 5, batch 26450, loss[loss=0.2847, simple_loss=0.3825, pruned_loss=0.09347, over 21839.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3301, pruned_loss=0.09694, over 4285597.41 frames. ], batch size: 372, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:31:02,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-21 05:31:35,162 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:31:53,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-21 05:32:05,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-06-21 05:32:17,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=890874.0, ans=0.125 2023-06-21 05:32:19,172 INFO [train.py:996] (1/4) Epoch 5, batch 26500, loss[loss=0.2681, simple_loss=0.3695, pruned_loss=0.08329, over 20814.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3325, pruned_loss=0.09545, over 4272363.69 frames. ], batch size: 607, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:32:32,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=890874.0, ans=0.1 2023-06-21 05:32:36,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=890934.0, ans=0.125 2023-06-21 05:32:59,951 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:33:26,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=891054.0, ans=0.125 2023-06-21 05:33:42,175 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.071e+02 3.801e+02 4.543e+02 1.004e+03, threshold=7.603e+02, percent-clipped=5.0 2023-06-21 05:33:49,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=891114.0, ans=0.07 2023-06-21 05:34:01,413 INFO [train.py:996] (1/4) Epoch 5, batch 26550, loss[loss=0.231, simple_loss=0.3296, pruned_loss=0.06621, over 21634.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3286, pruned_loss=0.09175, over 4273011.50 frames. ], batch size: 389, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:34:23,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=891234.0, ans=0.1 2023-06-21 05:34:23,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=891234.0, ans=0.0 2023-06-21 05:34:26,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=891234.0, ans=0.0 2023-06-21 05:34:46,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=12.0 2023-06-21 05:34:49,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-21 05:35:34,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=891414.0, ans=0.2 2023-06-21 05:35:40,869 INFO [train.py:996] (1/4) Epoch 5, batch 26600, loss[loss=0.2422, simple_loss=0.3029, pruned_loss=0.09076, over 21754.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3279, pruned_loss=0.08883, over 4268604.19 frames. ], batch size: 124, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:35:51,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=891474.0, ans=0.07 2023-06-21 05:35:57,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=891474.0, ans=0.125 2023-06-21 05:36:07,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=891534.0, ans=0.125 2023-06-21 05:36:41,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=891654.0, ans=0.04949747468305833 2023-06-21 05:36:59,728 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.044e+02 3.559e+02 4.505e+02 6.702e+02, threshold=7.118e+02, percent-clipped=0.0 2023-06-21 05:37:01,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=891714.0, ans=0.0 2023-06-21 05:37:13,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-21 05:37:23,629 INFO [train.py:996] (1/4) Epoch 5, batch 26650, loss[loss=0.2473, simple_loss=0.3215, pruned_loss=0.08657, over 21504.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3213, pruned_loss=0.08805, over 4263574.54 frames. ], batch size: 473, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:38:29,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=891954.0, ans=0.2 2023-06-21 05:38:35,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=891954.0, ans=0.1 2023-06-21 05:38:47,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=892014.0, ans=0.125 2023-06-21 05:39:01,910 INFO [train.py:996] (1/4) Epoch 5, batch 26700, loss[loss=0.2244, simple_loss=0.2853, pruned_loss=0.08175, over 21251.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3129, pruned_loss=0.08359, over 4264502.46 frames. ], batch size: 159, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:40:18,702 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 2.509e+02 2.895e+02 3.334e+02 4.980e+02, threshold=5.790e+02, percent-clipped=0.0 2023-06-21 05:40:47,742 INFO [train.py:996] (1/4) Epoch 5, batch 26750, loss[loss=0.2884, simple_loss=0.3696, pruned_loss=0.1036, over 21776.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3133, pruned_loss=0.08281, over 4267560.05 frames. ], batch size: 118, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:41:13,365 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.25 vs. limit=22.5 2023-06-21 05:41:37,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=892494.0, ans=0.125 2023-06-21 05:41:44,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=892554.0, ans=0.125 2023-06-21 05:41:45,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=892554.0, ans=0.125 2023-06-21 05:42:03,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=892614.0, ans=0.0 2023-06-21 05:42:23,650 INFO [train.py:996] (1/4) Epoch 5, batch 26800, loss[loss=0.2439, simple_loss=0.3193, pruned_loss=0.08427, over 20819.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3225, pruned_loss=0.08891, over 4271731.35 frames. ], batch size: 611, lr: 5.98e-03, grad_scale: 32.0 2023-06-21 05:42:24,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-21 05:42:34,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=892674.0, ans=0.125 2023-06-21 05:43:18,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-21 05:43:27,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=892854.0, ans=0.125 2023-06-21 05:43:45,164 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 3.040e+02 3.566e+02 4.548e+02 6.934e+02, threshold=7.132e+02, percent-clipped=4.0 2023-06-21 05:44:03,855 INFO [train.py:996] (1/4) Epoch 5, batch 26850, loss[loss=0.2141, simple_loss=0.2721, pruned_loss=0.07808, over 21301.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.322, pruned_loss=0.09074, over 4277636.34 frames. ], batch size: 144, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:44:29,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=893034.0, ans=0.0 2023-06-21 05:45:09,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=893154.0, ans=0.125 2023-06-21 05:45:33,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-21 05:45:43,429 INFO [train.py:996] (1/4) Epoch 5, batch 26900, loss[loss=0.2235, simple_loss=0.2735, pruned_loss=0.08678, over 21676.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3135, pruned_loss=0.08981, over 4277727.73 frames. ], batch size: 282, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:45:50,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=893274.0, ans=0.125 2023-06-21 05:46:34,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=893394.0, ans=0.1 2023-06-21 05:46:40,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=893454.0, ans=0.125 2023-06-21 05:47:04,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.809e+02 3.300e+02 3.699e+02 7.956e+02, threshold=6.601e+02, percent-clipped=1.0 2023-06-21 05:47:16,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=893514.0, ans=0.1 2023-06-21 05:47:22,062 INFO [train.py:996] (1/4) Epoch 5, batch 26950, loss[loss=0.2534, simple_loss=0.329, pruned_loss=0.08888, over 21193.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3141, pruned_loss=0.09057, over 4270205.51 frames. ], batch size: 143, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:47:52,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=893634.0, ans=0.0 2023-06-21 05:48:07,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=893694.0, ans=0.125 2023-06-21 05:48:07,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=893694.0, ans=0.05 2023-06-21 05:48:08,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.25 vs. limit=22.5 2023-06-21 05:48:22,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=893694.0, ans=0.125 2023-06-21 05:48:44,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 05:49:02,222 INFO [train.py:996] (1/4) Epoch 5, batch 27000, loss[loss=0.2187, simple_loss=0.3018, pruned_loss=0.06783, over 21656.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3141, pruned_loss=0.08789, over 4268815.44 frames. ], batch size: 247, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:49:02,223 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 05:49:20,484 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2444, simple_loss=0.3449, pruned_loss=0.07195, over 1796401.00 frames. 2023-06-21 05:49:20,485 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-21 05:49:57,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=893934.0, ans=0.0 2023-06-21 05:50:12,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=893994.0, ans=0.125 2023-06-21 05:50:30,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=894054.0, ans=0.0 2023-06-21 05:50:38,948 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.497e+02 2.990e+02 3.496e+02 4.876e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 05:50:51,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-06-21 05:50:59,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=894174.0, ans=0.0 2023-06-21 05:51:01,165 INFO [train.py:996] (1/4) Epoch 5, batch 27050, loss[loss=0.2529, simple_loss=0.3302, pruned_loss=0.08783, over 21889.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3182, pruned_loss=0.08588, over 4276682.81 frames. ], batch size: 371, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:51:06,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=894174.0, ans=0.0 2023-06-21 05:51:45,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-06-21 05:51:54,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=894294.0, ans=0.0 2023-06-21 05:52:42,200 INFO [train.py:996] (1/4) Epoch 5, batch 27100, loss[loss=0.2743, simple_loss=0.3623, pruned_loss=0.0931, over 21733.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3206, pruned_loss=0.08712, over 4275700.68 frames. ], batch size: 414, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:53:10,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=894534.0, ans=0.0 2023-06-21 05:53:11,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-21 05:54:00,318 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.188e+02 3.922e+02 5.852e+02 9.183e+02, threshold=7.845e+02, percent-clipped=23.0 2023-06-21 05:54:18,421 INFO [train.py:996] (1/4) Epoch 5, batch 27150, loss[loss=0.2513, simple_loss=0.3361, pruned_loss=0.08324, over 21428.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3331, pruned_loss=0.09166, over 4283290.55 frames. ], batch size: 211, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:55:37,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=894954.0, ans=0.035 2023-06-21 05:55:42,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=895014.0, ans=0.125 2023-06-21 05:55:58,210 INFO [train.py:996] (1/4) Epoch 5, batch 27200, loss[loss=0.3025, simple_loss=0.3653, pruned_loss=0.1198, over 21362.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3424, pruned_loss=0.09537, over 4277989.03 frames. ], batch size: 548, lr: 5.98e-03, grad_scale: 32.0 2023-06-21 05:56:23,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=895074.0, ans=0.125 2023-06-21 05:56:50,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.83 vs. limit=10.0 2023-06-21 05:57:14,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=895254.0, ans=0.125 2023-06-21 05:57:14,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=895254.0, ans=0.125 2023-06-21 05:57:23,620 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.260e+02 3.703e+02 4.553e+02 9.386e+02, threshold=7.407e+02, percent-clipped=2.0 2023-06-21 05:57:29,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=895314.0, ans=0.125 2023-06-21 05:57:52,265 INFO [train.py:996] (1/4) Epoch 5, batch 27250, loss[loss=0.3149, simple_loss=0.3719, pruned_loss=0.1289, over 21427.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3445, pruned_loss=0.09919, over 4276255.36 frames. ], batch size: 471, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 05:59:16,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=895614.0, ans=0.2 2023-06-21 05:59:17,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=895614.0, ans=0.02 2023-06-21 05:59:33,979 INFO [train.py:996] (1/4) Epoch 5, batch 27300, loss[loss=0.2908, simple_loss=0.3553, pruned_loss=0.1132, over 21271.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3471, pruned_loss=0.1009, over 4279972.76 frames. ], batch size: 143, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:00:01,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=895734.0, ans=0.125 2023-06-21 06:00:02,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=895734.0, ans=0.125 2023-06-21 06:00:04,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=895734.0, ans=0.125 2023-06-21 06:00:27,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=895794.0, ans=0.125 2023-06-21 06:00:57,904 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 2.999e+02 3.424e+02 4.068e+02 6.879e+02, threshold=6.848e+02, percent-clipped=0.0 2023-06-21 06:01:07,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=895914.0, ans=0.2 2023-06-21 06:01:15,079 INFO [train.py:996] (1/4) Epoch 5, batch 27350, loss[loss=0.2572, simple_loss=0.3309, pruned_loss=0.09178, over 21615.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.349, pruned_loss=0.1002, over 4277788.44 frames. ], batch size: 230, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:01:36,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=896034.0, ans=0.125 2023-06-21 06:01:46,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-21 06:02:31,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.98 vs. limit=10.0 2023-06-21 06:02:41,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=896214.0, ans=0.0 2023-06-21 06:02:54,637 INFO [train.py:996] (1/4) Epoch 5, batch 27400, loss[loss=0.2603, simple_loss=0.3241, pruned_loss=0.09828, over 21880.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3433, pruned_loss=0.09917, over 4280997.38 frames. ], batch size: 371, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:03:10,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=896334.0, ans=0.1 2023-06-21 06:03:23,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=896334.0, ans=0.0 2023-06-21 06:04:02,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=896454.0, ans=0.1 2023-06-21 06:04:08,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.765e+02 3.152e+02 3.980e+02 5.730e+02, threshold=6.304e+02, percent-clipped=0.0 2023-06-21 06:04:34,752 INFO [train.py:996] (1/4) Epoch 5, batch 27450, loss[loss=0.2372, simple_loss=0.3268, pruned_loss=0.07384, over 20004.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3361, pruned_loss=0.09634, over 4280790.83 frames. ], batch size: 702, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:05:05,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-21 06:05:31,985 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:06:14,861 INFO [train.py:996] (1/4) Epoch 5, batch 27500, loss[loss=0.2431, simple_loss=0.3069, pruned_loss=0.08965, over 21924.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3331, pruned_loss=0.09645, over 4282230.85 frames. ], batch size: 316, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:06:44,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-06-21 06:07:29,132 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.898e+02 3.228e+02 3.815e+02 7.854e+02, threshold=6.456e+02, percent-clipped=2.0 2023-06-21 06:07:32,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=897114.0, ans=0.0 2023-06-21 06:07:54,329 INFO [train.py:996] (1/4) Epoch 5, batch 27550, loss[loss=0.2257, simple_loss=0.2788, pruned_loss=0.0863, over 20772.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3275, pruned_loss=0.09368, over 4283539.45 frames. ], batch size: 607, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:07:56,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=897174.0, ans=0.2 2023-06-21 06:08:20,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=897234.0, ans=0.2 2023-06-21 06:08:23,974 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:08:53,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=897354.0, ans=0.125 2023-06-21 06:08:59,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=897354.0, ans=0.0 2023-06-21 06:09:29,798 INFO [train.py:996] (1/4) Epoch 5, batch 27600, loss[loss=0.2321, simple_loss=0.2884, pruned_loss=0.08788, over 21386.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3196, pruned_loss=0.0923, over 4275919.28 frames. ], batch size: 160, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:10:08,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=897594.0, ans=0.0 2023-06-21 06:10:09,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=22.5 2023-06-21 06:10:39,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=897654.0, ans=0.0 2023-06-21 06:10:43,730 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.760e+02 3.130e+02 3.904e+02 5.692e+02, threshold=6.260e+02, percent-clipped=0.0 2023-06-21 06:10:51,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=897714.0, ans=0.5 2023-06-21 06:11:08,627 INFO [train.py:996] (1/4) Epoch 5, batch 27650, loss[loss=0.2512, simple_loss=0.3266, pruned_loss=0.08788, over 21398.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3145, pruned_loss=0.09123, over 4269601.14 frames. ], batch size: 176, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:11:15,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=897774.0, ans=0.125 2023-06-21 06:11:39,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-21 06:11:42,774 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-21 06:11:52,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=897894.0, ans=10.0 2023-06-21 06:11:53,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=897894.0, ans=0.0 2023-06-21 06:12:11,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.38 vs. limit=10.0 2023-06-21 06:12:20,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=897954.0, ans=0.2 2023-06-21 06:12:25,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=898014.0, ans=0.2 2023-06-21 06:12:49,023 INFO [train.py:996] (1/4) Epoch 5, batch 27700, loss[loss=0.3215, simple_loss=0.393, pruned_loss=0.125, over 21643.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.315, pruned_loss=0.09003, over 4275194.58 frames. ], batch size: 441, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:13:02,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=898074.0, ans=0.0 2023-06-21 06:13:36,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=898194.0, ans=0.2 2023-06-21 06:14:07,892 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.066e+02 3.761e+02 4.326e+02 8.310e+02, threshold=7.523e+02, percent-clipped=4.0 2023-06-21 06:14:27,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=898374.0, ans=0.0 2023-06-21 06:14:28,396 INFO [train.py:996] (1/4) Epoch 5, batch 27750, loss[loss=0.2089, simple_loss=0.2816, pruned_loss=0.06807, over 21745.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3189, pruned_loss=0.0895, over 4279608.59 frames. ], batch size: 124, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:14:44,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=898434.0, ans=0.125 2023-06-21 06:14:51,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=898434.0, ans=0.2 2023-06-21 06:16:06,798 INFO [train.py:996] (1/4) Epoch 5, batch 27800, loss[loss=0.2449, simple_loss=0.3078, pruned_loss=0.091, over 21692.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3164, pruned_loss=0.08921, over 4279883.49 frames. ], batch size: 263, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:16:59,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=12.0 2023-06-21 06:17:17,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=898854.0, ans=0.125 2023-06-21 06:17:26,840 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.753e+02 3.252e+02 3.951e+02 6.290e+02, threshold=6.504e+02, percent-clipped=0.0 2023-06-21 06:17:48,405 INFO [train.py:996] (1/4) Epoch 5, batch 27850, loss[loss=0.2444, simple_loss=0.3362, pruned_loss=0.07635, over 21320.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3166, pruned_loss=0.09058, over 4286915.38 frames. ], batch size: 176, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:18:57,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-21 06:19:16,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=899214.0, ans=0.1 2023-06-21 06:19:30,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=899274.0, ans=0.0 2023-06-21 06:19:31,892 INFO [train.py:996] (1/4) Epoch 5, batch 27900, loss[loss=0.2246, simple_loss=0.3088, pruned_loss=0.07023, over 21444.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3258, pruned_loss=0.09149, over 4283897.56 frames. ], batch size: 211, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:19:34,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-21 06:19:55,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=899334.0, ans=0.0 2023-06-21 06:20:04,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-21 06:20:57,627 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 2.911e+02 3.342e+02 3.967e+02 6.742e+02, threshold=6.683e+02, percent-clipped=1.0 2023-06-21 06:21:19,104 INFO [train.py:996] (1/4) Epoch 5, batch 27950, loss[loss=0.2265, simple_loss=0.3141, pruned_loss=0.06942, over 21706.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3244, pruned_loss=0.08737, over 4280376.62 frames. ], batch size: 298, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:21:20,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=22.5 2023-06-21 06:21:23,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-21 06:21:27,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=899574.0, ans=0.07 2023-06-21 06:21:37,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=899634.0, ans=0.2 2023-06-21 06:21:44,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=899634.0, ans=0.1 2023-06-21 06:22:32,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-21 06:22:52,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=899814.0, ans=0.125 2023-06-21 06:22:59,473 INFO [train.py:996] (1/4) Epoch 5, batch 28000, loss[loss=0.2645, simple_loss=0.3211, pruned_loss=0.1039, over 21319.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.321, pruned_loss=0.08478, over 4281821.57 frames. ], batch size: 176, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:23:01,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=899874.0, ans=0.125 2023-06-21 06:23:25,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=899934.0, ans=0.2 2023-06-21 06:23:41,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-06-21 06:24:09,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=900054.0, ans=0.0 2023-06-21 06:24:19,317 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.872e+02 3.186e+02 3.800e+02 5.572e+02, threshold=6.373e+02, percent-clipped=0.0 2023-06-21 06:24:40,546 INFO [train.py:996] (1/4) Epoch 5, batch 28050, loss[loss=0.1897, simple_loss=0.2565, pruned_loss=0.06147, over 21417.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3181, pruned_loss=0.08608, over 4274117.56 frames. ], batch size: 211, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:24:45,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=900174.0, ans=0.125 2023-06-21 06:24:47,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=900174.0, ans=0.125 2023-06-21 06:24:59,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=15.0 2023-06-21 06:25:23,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=900294.0, ans=0.125 2023-06-21 06:25:35,981 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-21 06:25:48,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=900354.0, ans=0.0 2023-06-21 06:26:20,959 INFO [train.py:996] (1/4) Epoch 5, batch 28100, loss[loss=0.266, simple_loss=0.3156, pruned_loss=0.1082, over 21782.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3163, pruned_loss=0.08635, over 4268077.77 frames. ], batch size: 351, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:26:32,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=900474.0, ans=0.125 2023-06-21 06:27:17,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=900594.0, ans=0.125 2023-06-21 06:27:47,380 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.042e+02 3.636e+02 4.421e+02 1.163e+03, threshold=7.272e+02, percent-clipped=7.0 2023-06-21 06:27:49,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=900714.0, ans=0.0 2023-06-21 06:27:52,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=900714.0, ans=0.1 2023-06-21 06:27:57,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=900714.0, ans=0.125 2023-06-21 06:28:07,005 INFO [train.py:996] (1/4) Epoch 5, batch 28150, loss[loss=0.1941, simple_loss=0.2557, pruned_loss=0.06627, over 21621.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3115, pruned_loss=0.08677, over 4264610.86 frames. ], batch size: 247, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:28:23,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=900774.0, ans=0.0 2023-06-21 06:28:37,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-21 06:28:44,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=900834.0, ans=0.0 2023-06-21 06:29:20,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-06-21 06:29:25,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-21 06:29:48,140 INFO [train.py:996] (1/4) Epoch 5, batch 28200, loss[loss=0.2473, simple_loss=0.3106, pruned_loss=0.09197, over 21368.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3089, pruned_loss=0.08839, over 4259219.59 frames. ], batch size: 549, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:30:12,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=901134.0, ans=0.125 2023-06-21 06:30:18,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=901134.0, ans=0.2 2023-06-21 06:30:28,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=901194.0, ans=0.1 2023-06-21 06:31:02,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-21 06:31:14,330 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.109e+02 3.691e+02 4.482e+02 7.045e+02, threshold=7.382e+02, percent-clipped=0.0 2023-06-21 06:31:33,973 INFO [train.py:996] (1/4) Epoch 5, batch 28250, loss[loss=0.249, simple_loss=0.3056, pruned_loss=0.09623, over 21747.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3135, pruned_loss=0.09269, over 4263993.86 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:31:46,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-21 06:31:54,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=22.5 2023-06-21 06:32:54,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=901614.0, ans=0.125 2023-06-21 06:33:15,362 INFO [train.py:996] (1/4) Epoch 5, batch 28300, loss[loss=0.2161, simple_loss=0.2757, pruned_loss=0.07825, over 21450.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3121, pruned_loss=0.09012, over 4267077.85 frames. ], batch size: 212, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:33:31,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-21 06:34:05,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=901794.0, ans=0.2 2023-06-21 06:34:41,914 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.744e+02 3.366e+02 4.135e+02 8.525e+02, threshold=6.731e+02, percent-clipped=3.0 2023-06-21 06:34:53,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=901914.0, ans=0.125 2023-06-21 06:34:56,281 INFO [train.py:996] (1/4) Epoch 5, batch 28350, loss[loss=0.3102, simple_loss=0.3639, pruned_loss=0.1283, over 21342.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.31, pruned_loss=0.08453, over 4267688.04 frames. ], batch size: 507, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:35:26,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=902034.0, ans=0.1 2023-06-21 06:35:28,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=902034.0, ans=0.0 2023-06-21 06:35:29,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=902034.0, ans=0.125 2023-06-21 06:35:59,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=902154.0, ans=0.125 2023-06-21 06:36:20,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=902214.0, ans=0.1 2023-06-21 06:36:24,171 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.32 vs. limit=15.0 2023-06-21 06:36:25,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=902214.0, ans=0.125 2023-06-21 06:36:40,509 INFO [train.py:996] (1/4) Epoch 5, batch 28400, loss[loss=0.2381, simple_loss=0.2938, pruned_loss=0.0912, over 21275.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3055, pruned_loss=0.08408, over 4262666.49 frames. ], batch size: 176, lr: 5.95e-03, grad_scale: 32.0 2023-06-21 06:36:52,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=902274.0, ans=0.125 2023-06-21 06:37:00,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=902334.0, ans=0.0 2023-06-21 06:37:50,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-21 06:37:57,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=902454.0, ans=0.125 2023-06-21 06:38:03,529 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.069e+02 3.636e+02 4.494e+02 7.236e+02, threshold=7.272e+02, percent-clipped=3.0 2023-06-21 06:38:19,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=902574.0, ans=0.1 2023-06-21 06:38:20,750 INFO [train.py:996] (1/4) Epoch 5, batch 28450, loss[loss=0.2767, simple_loss=0.3355, pruned_loss=0.109, over 21831.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3124, pruned_loss=0.08826, over 4272022.71 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:38:44,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=902634.0, ans=0.125 2023-06-21 06:39:14,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-21 06:39:31,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=902754.0, ans=0.0 2023-06-21 06:39:34,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-21 06:39:59,730 INFO [train.py:996] (1/4) Epoch 5, batch 28500, loss[loss=0.3306, simple_loss=0.3786, pruned_loss=0.1413, over 21491.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3163, pruned_loss=0.09177, over 4276862.81 frames. ], batch size: 507, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:40:00,664 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-21 06:40:36,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=902994.0, ans=0.1 2023-06-21 06:41:05,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-21 06:41:14,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=903054.0, ans=0.1 2023-06-21 06:41:28,907 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.871e+02 3.406e+02 3.870e+02 6.038e+02, threshold=6.812e+02, percent-clipped=0.0 2023-06-21 06:41:41,920 INFO [train.py:996] (1/4) Epoch 5, batch 28550, loss[loss=0.2153, simple_loss=0.3362, pruned_loss=0.04715, over 19908.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3249, pruned_loss=0.09506, over 4275058.21 frames. ], batch size: 702, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:41:50,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=903174.0, ans=0.05 2023-06-21 06:42:06,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=15.0 2023-06-21 06:43:07,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=903414.0, ans=0.0 2023-06-21 06:43:22,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-21 06:43:24,811 INFO [train.py:996] (1/4) Epoch 5, batch 28600, loss[loss=0.2792, simple_loss=0.3493, pruned_loss=0.1045, over 21691.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3317, pruned_loss=0.09673, over 4278833.97 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:43:25,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=903474.0, ans=0.2 2023-06-21 06:44:23,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=903594.0, ans=0.1 2023-06-21 06:44:34,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=903654.0, ans=0.125 2023-06-21 06:44:54,243 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 2.937e+02 3.367e+02 4.039e+02 6.744e+02, threshold=6.734e+02, percent-clipped=0.0 2023-06-21 06:45:12,135 INFO [train.py:996] (1/4) Epoch 5, batch 28650, loss[loss=0.2253, simple_loss=0.2809, pruned_loss=0.08489, over 21242.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3267, pruned_loss=0.0963, over 4265725.63 frames. ], batch size: 549, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:45:41,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-21 06:45:54,576 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-21 06:46:06,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=903894.0, ans=0.025 2023-06-21 06:46:56,187 INFO [train.py:996] (1/4) Epoch 5, batch 28700, loss[loss=0.2671, simple_loss=0.3314, pruned_loss=0.1014, over 21774.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.326, pruned_loss=0.09767, over 4272683.77 frames. ], batch size: 441, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:47:03,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-06-21 06:48:13,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 2.955e+02 3.205e+02 3.884e+02 6.833e+02, threshold=6.409e+02, percent-clipped=1.0 2023-06-21 06:48:37,248 INFO [train.py:996] (1/4) Epoch 5, batch 28750, loss[loss=0.2112, simple_loss=0.2907, pruned_loss=0.0659, over 21118.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3257, pruned_loss=0.09749, over 4274843.21 frames. ], batch size: 607, lr: 5.94e-03, grad_scale: 16.0 2023-06-21 06:48:47,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=904374.0, ans=0.1 2023-06-21 06:48:57,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=904434.0, ans=0.2 2023-06-21 06:49:11,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=904434.0, ans=0.04949747468305833 2023-06-21 06:49:13,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=904434.0, ans=0.1 2023-06-21 06:49:29,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=904494.0, ans=0.125 2023-06-21 06:49:34,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=904554.0, ans=0.125 2023-06-21 06:49:47,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=904554.0, ans=0.125 2023-06-21 06:50:08,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=904614.0, ans=0.125 2023-06-21 06:50:17,670 INFO [train.py:996] (1/4) Epoch 5, batch 28800, loss[loss=0.2449, simple_loss=0.3218, pruned_loss=0.08402, over 20668.00 frames. ], tot_loss[loss=0.263, simple_loss=0.33, pruned_loss=0.09802, over 4276134.42 frames. ], batch size: 607, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:50:26,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=904674.0, ans=0.0 2023-06-21 06:51:07,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2023-06-21 06:51:23,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=904854.0, ans=0.125 2023-06-21 06:51:23,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=904854.0, ans=0.1 2023-06-21 06:51:33,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=904854.0, ans=0.1 2023-06-21 06:51:39,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=904914.0, ans=0.125 2023-06-21 06:51:45,684 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.918e+02 3.315e+02 4.122e+02 9.599e+02, threshold=6.630e+02, percent-clipped=10.0 2023-06-21 06:52:08,212 INFO [train.py:996] (1/4) Epoch 5, batch 28850, loss[loss=0.2439, simple_loss=0.3151, pruned_loss=0.08634, over 21801.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3319, pruned_loss=0.09913, over 4277465.62 frames. ], batch size: 441, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:52:13,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=904974.0, ans=0.125 2023-06-21 06:53:08,214 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-21 06:53:09,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=905154.0, ans=0.125 2023-06-21 06:53:20,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=905154.0, ans=0.0 2023-06-21 06:53:48,803 INFO [train.py:996] (1/4) Epoch 5, batch 28900, loss[loss=0.2442, simple_loss=0.3075, pruned_loss=0.09046, over 21679.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3333, pruned_loss=0.09982, over 4278492.04 frames. ], batch size: 230, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:53:49,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-21 06:54:12,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=905334.0, ans=0.0 2023-06-21 06:54:28,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.46 vs. limit=15.0 2023-06-21 06:54:43,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=905454.0, ans=0.125 2023-06-21 06:54:52,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=905454.0, ans=0.2 2023-06-21 06:55:17,780 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.129e+02 3.502e+02 4.010e+02 6.253e+02, threshold=7.003e+02, percent-clipped=0.0 2023-06-21 06:55:31,399 INFO [train.py:996] (1/4) Epoch 5, batch 28950, loss[loss=0.2362, simple_loss=0.3335, pruned_loss=0.0694, over 21785.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3364, pruned_loss=0.09927, over 4276977.88 frames. ], batch size: 351, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:55:43,731 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-21 06:56:09,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=905694.0, ans=0.125 2023-06-21 06:56:11,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=905694.0, ans=0.015 2023-06-21 06:56:32,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=905694.0, ans=0.0 2023-06-21 06:56:35,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=905754.0, ans=0.0 2023-06-21 06:57:12,946 INFO [train.py:996] (1/4) Epoch 5, batch 29000, loss[loss=0.2614, simple_loss=0.3341, pruned_loss=0.09435, over 22014.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3384, pruned_loss=0.0979, over 4279758.09 frames. ], batch size: 317, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:57:24,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=905874.0, ans=0.1 2023-06-21 06:58:26,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=906054.0, ans=0.1 2023-06-21 06:58:39,403 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.974e+02 3.473e+02 4.065e+02 5.758e+02, threshold=6.947e+02, percent-clipped=0.0 2023-06-21 06:58:42,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=906114.0, ans=0.0 2023-06-21 06:58:52,018 INFO [train.py:996] (1/4) Epoch 5, batch 29050, loss[loss=0.2282, simple_loss=0.3, pruned_loss=0.07823, over 21940.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3366, pruned_loss=0.09905, over 4281428.90 frames. ], batch size: 113, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:00:05,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-06-21 07:00:10,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=906354.0, ans=0.2 2023-06-21 07:00:26,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=906414.0, ans=0.0 2023-06-21 07:00:32,306 INFO [train.py:996] (1/4) Epoch 5, batch 29100, loss[loss=0.2324, simple_loss=0.2901, pruned_loss=0.08733, over 21746.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3269, pruned_loss=0.09605, over 4281525.90 frames. ], batch size: 351, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:01:26,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=906594.0, ans=0.2 2023-06-21 07:02:00,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.771e+02 3.124e+02 3.774e+02 6.095e+02, threshold=6.248e+02, percent-clipped=0.0 2023-06-21 07:02:10,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=906714.0, ans=0.125 2023-06-21 07:02:13,076 INFO [train.py:996] (1/4) Epoch 5, batch 29150, loss[loss=0.2473, simple_loss=0.3115, pruned_loss=0.0915, over 21314.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3252, pruned_loss=0.09385, over 4280932.60 frames. ], batch size: 211, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:03:32,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=906954.0, ans=0.09899494936611666 2023-06-21 07:03:32,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=906954.0, ans=0.0 2023-06-21 07:03:53,177 INFO [train.py:996] (1/4) Epoch 5, batch 29200, loss[loss=0.2095, simple_loss=0.2735, pruned_loss=0.07275, over 21328.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3192, pruned_loss=0.0922, over 4285055.22 frames. ], batch size: 131, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:03:56,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=907074.0, ans=0.125 2023-06-21 07:04:50,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=907194.0, ans=0.125 2023-06-21 07:04:56,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=907194.0, ans=0.1 2023-06-21 07:05:14,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=907254.0, ans=0.125 2023-06-21 07:05:22,443 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.832e+02 3.192e+02 3.760e+02 6.246e+02, threshold=6.385e+02, percent-clipped=0.0 2023-06-21 07:05:22,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=907314.0, ans=0.125 2023-06-21 07:05:37,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=907314.0, ans=0.1 2023-06-21 07:05:40,657 INFO [train.py:996] (1/4) Epoch 5, batch 29250, loss[loss=0.2051, simple_loss=0.289, pruned_loss=0.06061, over 21555.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.316, pruned_loss=0.08887, over 4271568.03 frames. ], batch size: 230, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:06:38,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=907494.0, ans=0.0 2023-06-21 07:06:54,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=907554.0, ans=0.125 2023-06-21 07:07:15,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=907614.0, ans=0.0 2023-06-21 07:07:21,316 INFO [train.py:996] (1/4) Epoch 5, batch 29300, loss[loss=0.245, simple_loss=0.3169, pruned_loss=0.08651, over 19920.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3195, pruned_loss=0.08884, over 4273234.43 frames. ], batch size: 703, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:08:07,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=907734.0, ans=0.125 2023-06-21 07:08:08,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=907794.0, ans=0.2 2023-06-21 07:08:46,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.808e+02 3.408e+02 4.024e+02 6.878e+02, threshold=6.816e+02, percent-clipped=1.0 2023-06-21 07:09:05,125 INFO [train.py:996] (1/4) Epoch 5, batch 29350, loss[loss=0.2373, simple_loss=0.3104, pruned_loss=0.08212, over 21326.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.316, pruned_loss=0.08833, over 4273893.48 frames. ], batch size: 160, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:09:21,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=907974.0, ans=0.015 2023-06-21 07:10:45,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=908214.0, ans=0.0 2023-06-21 07:10:45,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-21 07:10:53,127 INFO [train.py:996] (1/4) Epoch 5, batch 29400, loss[loss=0.2025, simple_loss=0.2519, pruned_loss=0.07659, over 20209.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3138, pruned_loss=0.08584, over 4271105.55 frames. ], batch size: 704, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:11:08,161 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:11:24,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=908334.0, ans=0.0 2023-06-21 07:12:05,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=908454.0, ans=0.2 2023-06-21 07:12:05,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=908454.0, ans=0.0 2023-06-21 07:12:26,259 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.910e+02 3.307e+02 3.988e+02 6.309e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-21 07:12:34,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=908514.0, ans=0.0 2023-06-21 07:12:41,818 INFO [train.py:996] (1/4) Epoch 5, batch 29450, loss[loss=0.2458, simple_loss=0.3203, pruned_loss=0.0856, over 21776.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3106, pruned_loss=0.0836, over 4264252.87 frames. ], batch size: 332, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:13:41,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=908754.0, ans=0.0 2023-06-21 07:14:21,516 INFO [train.py:996] (1/4) Epoch 5, batch 29500, loss[loss=0.2395, simple_loss=0.3057, pruned_loss=0.08667, over 21866.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3155, pruned_loss=0.08736, over 4274839.52 frames. ], batch size: 351, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:15:33,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-21 07:15:45,745 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 2.991e+02 3.477e+02 4.425e+02 6.921e+02, threshold=6.954e+02, percent-clipped=2.0 2023-06-21 07:15:46,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=909114.0, ans=0.0 2023-06-21 07:15:57,588 INFO [train.py:996] (1/4) Epoch 5, batch 29550, loss[loss=0.2663, simple_loss=0.3161, pruned_loss=0.1082, over 21408.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3163, pruned_loss=0.08989, over 4284127.02 frames. ], batch size: 194, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:16:25,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=909234.0, ans=0.2 2023-06-21 07:16:34,144 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-21 07:17:30,235 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:17:42,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=909474.0, ans=0.5 2023-06-21 07:17:43,933 INFO [train.py:996] (1/4) Epoch 5, batch 29600, loss[loss=0.2767, simple_loss=0.3609, pruned_loss=0.09628, over 21719.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3237, pruned_loss=0.09341, over 4292379.03 frames. ], batch size: 298, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:18:13,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=8.0 2023-06-21 07:18:17,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=909534.0, ans=0.0 2023-06-21 07:18:56,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=909654.0, ans=0.0 2023-06-21 07:19:01,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=909654.0, ans=0.125 2023-06-21 07:19:08,632 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.750e+02 3.191e+02 3.922e+02 6.335e+02, threshold=6.382e+02, percent-clipped=0.0 2023-06-21 07:19:28,006 INFO [train.py:996] (1/4) Epoch 5, batch 29650, loss[loss=0.2118, simple_loss=0.2854, pruned_loss=0.06911, over 21104.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3213, pruned_loss=0.09009, over 4281780.77 frames. ], batch size: 608, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:19:36,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=909774.0, ans=0.1 2023-06-21 07:19:37,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=909774.0, ans=0.035 2023-06-21 07:19:54,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=909834.0, ans=0.125 2023-06-21 07:19:57,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=909834.0, ans=0.0 2023-06-21 07:21:09,368 INFO [train.py:996] (1/4) Epoch 5, batch 29700, loss[loss=0.2646, simple_loss=0.3698, pruned_loss=0.07968, over 21813.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3226, pruned_loss=0.09006, over 4281914.96 frames. ], batch size: 282, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:21:24,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=910134.0, ans=0.0 2023-06-21 07:21:31,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-21 07:21:31,811 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-21 07:22:18,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=910254.0, ans=0.125 2023-06-21 07:22:29,345 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.944e+02 3.350e+02 4.483e+02 9.156e+02, threshold=6.700e+02, percent-clipped=7.0 2023-06-21 07:22:45,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-21 07:22:50,069 INFO [train.py:996] (1/4) Epoch 5, batch 29750, loss[loss=0.2386, simple_loss=0.3224, pruned_loss=0.0774, over 21371.00 frames. ], tot_loss[loss=0.253, simple_loss=0.327, pruned_loss=0.08944, over 4280749.21 frames. ], batch size: 194, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:22:51,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=910374.0, ans=0.125 2023-06-21 07:23:06,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-21 07:23:15,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-21 07:23:54,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=910554.0, ans=0.0 2023-06-21 07:23:58,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-21 07:24:31,244 INFO [train.py:996] (1/4) Epoch 5, batch 29800, loss[loss=0.2811, simple_loss=0.3501, pruned_loss=0.1061, over 21501.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3281, pruned_loss=0.09007, over 4279853.83 frames. ], batch size: 131, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:25:21,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=910794.0, ans=0.04949747468305833 2023-06-21 07:25:51,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=910914.0, ans=0.125 2023-06-21 07:25:55,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=910914.0, ans=0.1 2023-06-21 07:25:56,576 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.642e+02 3.044e+02 3.598e+02 6.041e+02, threshold=6.089e+02, percent-clipped=0.0 2023-06-21 07:26:11,928 INFO [train.py:996] (1/4) Epoch 5, batch 29850, loss[loss=0.2449, simple_loss=0.3152, pruned_loss=0.08727, over 21818.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3238, pruned_loss=0.08806, over 4275518.95 frames. ], batch size: 118, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:26:14,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=910974.0, ans=0.125 2023-06-21 07:26:31,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=911034.0, ans=0.04949747468305833 2023-06-21 07:26:50,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-21 07:27:04,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=911154.0, ans=0.1 2023-06-21 07:27:18,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=911154.0, ans=0.125 2023-06-21 07:27:18,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=911154.0, ans=0.1 2023-06-21 07:27:39,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=911214.0, ans=0.0 2023-06-21 07:27:53,654 INFO [train.py:996] (1/4) Epoch 5, batch 29900, loss[loss=0.2937, simple_loss=0.3494, pruned_loss=0.119, over 21328.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.322, pruned_loss=0.08919, over 4270053.50 frames. ], batch size: 159, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:28:05,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=911274.0, ans=0.1 2023-06-21 07:28:23,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=911334.0, ans=0.125 2023-06-21 07:29:22,825 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.918e+02 3.287e+02 3.972e+02 7.501e+02, threshold=6.574e+02, percent-clipped=2.0 2023-06-21 07:29:34,453 INFO [train.py:996] (1/4) Epoch 5, batch 29950, loss[loss=0.2799, simple_loss=0.348, pruned_loss=0.1059, over 21243.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3262, pruned_loss=0.09366, over 4275004.13 frames. ], batch size: 143, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:29:48,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=911574.0, ans=0.125 2023-06-21 07:30:42,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=911754.0, ans=0.125 2023-06-21 07:30:45,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=911754.0, ans=0.2 2023-06-21 07:31:03,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=911814.0, ans=0.125 2023-06-21 07:31:16,736 INFO [train.py:996] (1/4) Epoch 5, batch 30000, loss[loss=0.2581, simple_loss=0.3501, pruned_loss=0.08307, over 21818.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3284, pruned_loss=0.09499, over 4275667.80 frames. ], batch size: 371, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:31:16,737 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 07:31:38,130 INFO [train.py:1028] (1/4) Epoch 5, validation: loss=0.2485, simple_loss=0.3493, pruned_loss=0.0739, over 1796401.00 frames. 2023-06-21 07:31:38,132 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-21 07:32:34,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.40 vs. limit=15.0 2023-06-21 07:32:36,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=911994.0, ans=0.0 2023-06-21 07:33:14,710 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.033e+02 3.680e+02 4.795e+02 8.556e+02, threshold=7.360e+02, percent-clipped=8.0 2023-06-21 07:33:36,630 INFO [train.py:996] (1/4) Epoch 5, batch 30050, loss[loss=0.2642, simple_loss=0.3522, pruned_loss=0.08813, over 21653.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3316, pruned_loss=0.09078, over 4264144.52 frames. ], batch size: 263, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:33:53,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=12.0 2023-06-21 07:34:44,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=912354.0, ans=0.1 2023-06-21 07:34:48,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-21 07:34:51,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-21 07:35:16,533 INFO [train.py:996] (1/4) Epoch 5, batch 30100, loss[loss=0.2303, simple_loss=0.2874, pruned_loss=0.08663, over 21528.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3303, pruned_loss=0.09079, over 4256091.89 frames. ], batch size: 414, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:36:00,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=912594.0, ans=0.1 2023-06-21 07:36:06,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=912594.0, ans=0.1 2023-06-21 07:36:30,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=912714.0, ans=0.02 2023-06-21 07:36:31,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=912714.0, ans=0.1 2023-06-21 07:36:40,813 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.174e+02 3.797e+02 4.451e+02 9.370e+02, threshold=7.593e+02, percent-clipped=1.0 2023-06-21 07:36:49,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=912714.0, ans=0.125 2023-06-21 07:37:02,693 INFO [train.py:996] (1/4) Epoch 5, batch 30150, loss[loss=0.2703, simple_loss=0.3351, pruned_loss=0.1027, over 21668.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3287, pruned_loss=0.09272, over 4258131.87 frames. ], batch size: 351, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:37:25,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=912834.0, ans=0.125 2023-06-21 07:38:45,290 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.20 vs. limit=15.0 2023-06-21 07:38:45,869 INFO [train.py:996] (1/4) Epoch 5, batch 30200, loss[loss=0.2467, simple_loss=0.3466, pruned_loss=0.07337, over 21596.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3304, pruned_loss=0.09152, over 4251445.65 frames. ], batch size: 414, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:39:39,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=913194.0, ans=0.0 2023-06-21 07:39:51,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=913254.0, ans=0.0 2023-06-21 07:40:17,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.970e+02 3.553e+02 4.438e+02 6.781e+02, threshold=7.107e+02, percent-clipped=0.0 2023-06-21 07:40:28,568 INFO [train.py:996] (1/4) Epoch 5, batch 30250, loss[loss=0.2685, simple_loss=0.36, pruned_loss=0.08848, over 21260.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3392, pruned_loss=0.0944, over 4253073.54 frames. ], batch size: 176, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:40:39,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=913374.0, ans=0.0 2023-06-21 07:40:40,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=913374.0, ans=0.0 2023-06-21 07:40:42,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=913374.0, ans=0.125 2023-06-21 07:40:43,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=913374.0, ans=0.2 2023-06-21 07:41:11,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=913494.0, ans=0.125 2023-06-21 07:41:48,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=913554.0, ans=0.125 2023-06-21 07:42:08,261 INFO [train.py:996] (1/4) Epoch 5, batch 30300, loss[loss=0.2036, simple_loss=0.2696, pruned_loss=0.0688, over 21608.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3368, pruned_loss=0.09462, over 4251450.69 frames. ], batch size: 264, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:42:49,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=913794.0, ans=0.125 2023-06-21 07:42:59,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=913794.0, ans=0.125 2023-06-21 07:43:14,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=913854.0, ans=0.0 2023-06-21 07:43:34,401 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.375e+02 4.066e+02 5.117e+02 7.478e+02, threshold=8.132e+02, percent-clipped=2.0 2023-06-21 07:43:51,171 INFO [train.py:996] (1/4) Epoch 5, batch 30350, loss[loss=0.2484, simple_loss=0.3286, pruned_loss=0.08413, over 21784.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3348, pruned_loss=0.09492, over 4244280.66 frames. ], batch size: 333, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:43:51,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=913974.0, ans=0.0 2023-06-21 07:44:15,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=914034.0, ans=0.125 2023-06-21 07:44:27,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=914094.0, ans=0.0 2023-06-21 07:44:56,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-21 07:45:12,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.89 vs. limit=15.0 2023-06-21 07:45:20,100 INFO [train.py:996] (1/4) Epoch 5, batch 30400, loss[loss=0.2397, simple_loss=0.292, pruned_loss=0.09372, over 20230.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3273, pruned_loss=0.09273, over 4226611.86 frames. ], batch size: 703, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:45:42,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=914334.0, ans=0.0 2023-06-21 07:45:52,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-21 07:46:09,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=914454.0, ans=0.0 2023-06-21 07:46:16,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=914454.0, ans=0.0 2023-06-21 07:46:35,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.883e+02 3.783e+02 4.866e+02 6.156e+02 1.756e+03, threshold=9.731e+02, percent-clipped=9.0 2023-06-21 07:46:46,022 INFO [train.py:996] (1/4) Epoch 5, batch 30450, loss[loss=0.3278, simple_loss=0.4465, pruned_loss=0.1046, over 19741.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3297, pruned_loss=0.09262, over 4174535.13 frames. ], batch size: 702, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:46:56,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=914574.0, ans=0.0 2023-06-21 07:46:58,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=914574.0, ans=0.2 2023-06-21 07:47:41,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=914754.0, ans=0.1 2023-06-21 07:47:51,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=914814.0, ans=0.2 2023-06-21 07:49:37,060 INFO [train.py:996] (1/4) Epoch 6, batch 0, loss[loss=0.2497, simple_loss=0.3094, pruned_loss=0.095, over 21863.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3094, pruned_loss=0.095, over 21863.00 frames. ], batch size: 373, lr: 5.35e-03, grad_scale: 32.0 2023-06-21 07:49:37,060 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 07:49:52,703 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2457, simple_loss=0.3531, pruned_loss=0.06922, over 1796401.00 frames. 2023-06-21 07:49:52,704 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-21 07:50:12,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-21 07:50:15,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-21 07:50:22,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=914898.0, ans=0.125 2023-06-21 07:50:31,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=914898.0, ans=0.125 2023-06-21 07:51:04,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=915018.0, ans=0.125 2023-06-21 07:51:27,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.626e+02 5.784e+02 9.951e+02 2.861e+03, threshold=1.157e+03, percent-clipped=26.0 2023-06-21 07:51:28,594 INFO [train.py:996] (1/4) Epoch 6, batch 50, loss[loss=0.2865, simple_loss=0.3606, pruned_loss=0.1062, over 21869.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3346, pruned_loss=0.09276, over 961591.80 frames. ], batch size: 124, lr: 5.35e-03, grad_scale: 32.0 2023-06-21 07:52:57,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=915378.0, ans=0.0 2023-06-21 07:52:57,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=915378.0, ans=0.0 2023-06-21 07:52:59,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=915378.0, ans=0.0 2023-06-21 07:53:05,782 INFO [train.py:996] (1/4) Epoch 6, batch 100, loss[loss=0.27, simple_loss=0.3596, pruned_loss=0.09023, over 21565.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3488, pruned_loss=0.09475, over 1691452.74 frames. ], batch size: 230, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 07:53:12,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=915438.0, ans=0.125 2023-06-21 07:53:14,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=915438.0, ans=0.0 2023-06-21 07:53:35,171 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-21 07:53:59,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=915558.0, ans=0.1 2023-06-21 07:54:10,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=915618.0, ans=0.2 2023-06-21 07:54:41,352 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.736e+02 3.116e+02 3.564e+02 7.052e+02, threshold=6.231e+02, percent-clipped=0.0 2023-06-21 07:54:42,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-21 07:54:42,879 INFO [train.py:996] (1/4) Epoch 6, batch 150, loss[loss=0.2413, simple_loss=0.3311, pruned_loss=0.07576, over 19823.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3518, pruned_loss=0.0959, over 2257533.85 frames. ], batch size: 702, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 07:55:02,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-21 07:55:36,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=915858.0, ans=0.125 2023-06-21 07:55:55,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=915918.0, ans=0.0 2023-06-21 07:56:16,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-21 07:56:22,164 INFO [train.py:996] (1/4) Epoch 6, batch 200, loss[loss=0.2123, simple_loss=0.2872, pruned_loss=0.06873, over 21251.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3477, pruned_loss=0.09582, over 2707304.73 frames. ], batch size: 159, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 07:56:30,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-21 07:56:39,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=916038.0, ans=0.125 2023-06-21 07:57:00,981 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-21 07:58:01,357 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.028e+02 3.538e+02 4.112e+02 1.174e+03, threshold=7.076e+02, percent-clipped=8.0 2023-06-21 07:58:01,377 INFO [train.py:996] (1/4) Epoch 6, batch 250, loss[loss=0.2456, simple_loss=0.3252, pruned_loss=0.08302, over 20699.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3409, pruned_loss=0.09377, over 3055560.44 frames. ], batch size: 607, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 07:58:41,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=916458.0, ans=0.125 2023-06-21 07:59:02,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=916518.0, ans=0.05 2023-06-21 07:59:39,954 INFO [train.py:996] (1/4) Epoch 6, batch 300, loss[loss=0.2709, simple_loss=0.3423, pruned_loss=0.09981, over 21807.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3353, pruned_loss=0.09399, over 3325909.12 frames. ], batch size: 298, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:00:02,908 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:00:17,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=916698.0, ans=0.125 2023-06-21 08:00:20,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=916758.0, ans=0.125 2023-06-21 08:00:39,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=916758.0, ans=0.2 2023-06-21 08:00:51,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-21 08:01:06,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-21 08:01:20,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 3.010e+02 3.563e+02 4.495e+02 6.815e+02, threshold=7.126e+02, percent-clipped=0.0 2023-06-21 08:01:20,533 INFO [train.py:996] (1/4) Epoch 6, batch 350, loss[loss=0.2129, simple_loss=0.2799, pruned_loss=0.07292, over 21446.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3296, pruned_loss=0.09199, over 3543455.34 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:01:21,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=916938.0, ans=0.125 2023-06-21 08:01:37,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-06-21 08:01:41,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=916998.0, ans=0.0 2023-06-21 08:01:43,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-21 08:01:52,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916998.0, ans=0.1 2023-06-21 08:02:15,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-21 08:02:18,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=917058.0, ans=0.0 2023-06-21 08:02:20,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-21 08:02:20,555 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-21 08:02:37,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=917118.0, ans=0.2 2023-06-21 08:02:58,464 INFO [train.py:996] (1/4) Epoch 6, batch 400, loss[loss=0.2676, simple_loss=0.3595, pruned_loss=0.08788, over 21381.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3244, pruned_loss=0.0905, over 3709746.82 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:03:37,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=917298.0, ans=0.0 2023-06-21 08:03:45,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=917358.0, ans=0.0 2023-06-21 08:04:36,491 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.827e+02 3.421e+02 4.074e+02 6.754e+02, threshold=6.843e+02, percent-clipped=0.0 2023-06-21 08:04:36,511 INFO [train.py:996] (1/4) Epoch 6, batch 450, loss[loss=0.2995, simple_loss=0.3308, pruned_loss=0.1341, over 21397.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3202, pruned_loss=0.08872, over 3835949.48 frames. ], batch size: 509, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:04:43,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=917538.0, ans=10.0 2023-06-21 08:05:15,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=917598.0, ans=0.125 2023-06-21 08:05:26,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=917658.0, ans=0.125 2023-06-21 08:05:37,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=917658.0, ans=0.125 2023-06-21 08:06:07,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=917778.0, ans=0.125 2023-06-21 08:06:18,058 INFO [train.py:996] (1/4) Epoch 6, batch 500, loss[loss=0.2145, simple_loss=0.2852, pruned_loss=0.07191, over 20791.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.321, pruned_loss=0.08841, over 3934692.25 frames. ], batch size: 608, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:06:18,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=917838.0, ans=0.125 2023-06-21 08:07:07,752 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.93 vs. limit=15.0 2023-06-21 08:07:24,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=918018.0, ans=0.0 2023-06-21 08:07:30,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=918018.0, ans=0.0 2023-06-21 08:07:48,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=918078.0, ans=0.0 2023-06-21 08:07:50,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=918138.0, ans=0.125 2023-06-21 08:07:51,176 INFO [train.py:996] (1/4) Epoch 6, batch 550, loss[loss=0.2093, simple_loss=0.3094, pruned_loss=0.05464, over 21345.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3203, pruned_loss=0.0866, over 4014236.11 frames. ], batch size: 211, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:07:57,428 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.995e+02 3.563e+02 4.699e+02 8.861e+02, threshold=7.125e+02, percent-clipped=10.0 2023-06-21 08:08:22,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=918198.0, ans=0.1 2023-06-21 08:08:35,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=918258.0, ans=0.0 2023-06-21 08:09:13,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-21 08:09:31,310 INFO [train.py:996] (1/4) Epoch 6, batch 600, loss[loss=0.2506, simple_loss=0.3423, pruned_loss=0.07945, over 21641.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3269, pruned_loss=0.08681, over 4078045.42 frames. ], batch size: 263, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:10:02,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-21 08:10:05,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=918498.0, ans=0.2 2023-06-21 08:10:31,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=918558.0, ans=0.1 2023-06-21 08:10:43,041 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-21 08:10:48,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=918678.0, ans=0.125 2023-06-21 08:10:51,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=918678.0, ans=0.125 2023-06-21 08:11:06,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=918678.0, ans=0.125 2023-06-21 08:11:08,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=918738.0, ans=0.0 2023-06-21 08:11:09,689 INFO [train.py:996] (1/4) Epoch 6, batch 650, loss[loss=0.2485, simple_loss=0.3117, pruned_loss=0.09267, over 21836.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3277, pruned_loss=0.08778, over 4112866.37 frames. ], batch size: 351, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:11:11,266 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.881e+02 3.396e+02 3.907e+02 7.469e+02, threshold=6.792e+02, percent-clipped=1.0 2023-06-21 08:11:17,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-21 08:11:38,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=918798.0, ans=0.0 2023-06-21 08:11:41,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=918798.0, ans=0.0 2023-06-21 08:11:47,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=918798.0, ans=0.125 2023-06-21 08:12:42,307 INFO [train.py:996] (1/4) Epoch 6, batch 700, loss[loss=0.254, simple_loss=0.3414, pruned_loss=0.08324, over 21724.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3272, pruned_loss=0.08812, over 4152181.50 frames. ], batch size: 332, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:12:50,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=919038.0, ans=0.2 2023-06-21 08:12:56,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=919038.0, ans=0.125 2023-06-21 08:14:20,546 INFO [train.py:996] (1/4) Epoch 6, batch 750, loss[loss=0.245, simple_loss=0.308, pruned_loss=0.09096, over 21872.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3255, pruned_loss=0.08887, over 4186038.41 frames. ], batch size: 107, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:14:26,736 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.276e+02 4.088e+02 4.962e+02 1.159e+03, threshold=8.176e+02, percent-clipped=5.0 2023-06-21 08:14:44,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.31 vs. limit=22.5 2023-06-21 08:14:47,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919338.0, ans=0.1 2023-06-21 08:15:29,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-21 08:15:55,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=919578.0, ans=0.125 2023-06-21 08:15:58,056 INFO [train.py:996] (1/4) Epoch 6, batch 800, loss[loss=0.2261, simple_loss=0.2953, pruned_loss=0.0785, over 21804.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3238, pruned_loss=0.09002, over 4203544.65 frames. ], batch size: 118, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:16:01,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=919638.0, ans=0.2 2023-06-21 08:16:29,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=919698.0, ans=0.125 2023-06-21 08:16:35,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=919698.0, ans=0.125 2023-06-21 08:16:37,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=919698.0, ans=0.125 2023-06-21 08:16:40,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=919698.0, ans=0.125 2023-06-21 08:17:35,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=919878.0, ans=0.125 2023-06-21 08:17:38,688 INFO [train.py:996] (1/4) Epoch 6, batch 850, loss[loss=0.2289, simple_loss=0.2893, pruned_loss=0.0842, over 22002.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3204, pruned_loss=0.08971, over 4226402.31 frames. ], batch size: 103, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:17:39,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=919938.0, ans=0.125 2023-06-21 08:17:40,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 2.947e+02 3.491e+02 3.933e+02 7.622e+02, threshold=6.983e+02, percent-clipped=0.0 2023-06-21 08:18:47,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=920118.0, ans=0.1 2023-06-21 08:19:00,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=920178.0, ans=0.2 2023-06-21 08:19:21,793 INFO [train.py:996] (1/4) Epoch 6, batch 900, loss[loss=0.2081, simple_loss=0.2812, pruned_loss=0.06748, over 21824.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.317, pruned_loss=0.08904, over 4241843.64 frames. ], batch size: 124, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:19:41,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=920238.0, ans=0.125 2023-06-21 08:19:43,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=920238.0, ans=0.0 2023-06-21 08:19:50,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-21 08:20:40,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-06-21 08:21:05,281 INFO [train.py:996] (1/4) Epoch 6, batch 950, loss[loss=0.2364, simple_loss=0.296, pruned_loss=0.08844, over 21354.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3158, pruned_loss=0.08899, over 4259347.09 frames. ], batch size: 159, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:21:06,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.884e+02 3.289e+02 4.152e+02 6.570e+02, threshold=6.579e+02, percent-clipped=0.0 2023-06-21 08:21:23,106 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:21:24,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920598.0, ans=0.1 2023-06-21 08:21:25,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-21 08:21:37,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=920658.0, ans=0.125 2023-06-21 08:22:30,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=920778.0, ans=0.07 2023-06-21 08:22:38,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=920838.0, ans=0.125 2023-06-21 08:22:39,420 INFO [train.py:996] (1/4) Epoch 6, batch 1000, loss[loss=0.2597, simple_loss=0.3331, pruned_loss=0.09313, over 21737.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3168, pruned_loss=0.0897, over 4269237.76 frames. ], batch size: 389, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:23:03,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-21 08:23:44,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-21 08:23:45,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=921018.0, ans=0.0 2023-06-21 08:24:04,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=921078.0, ans=0.125 2023-06-21 08:24:12,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=921138.0, ans=0.0 2023-06-21 08:24:13,760 INFO [train.py:996] (1/4) Epoch 6, batch 1050, loss[loss=0.2569, simple_loss=0.3198, pruned_loss=0.097, over 21521.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3204, pruned_loss=0.09073, over 4275208.05 frames. ], batch size: 212, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:24:15,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.022e+02 3.396e+02 3.710e+02 5.985e+02, threshold=6.792e+02, percent-clipped=0.0 2023-06-21 08:24:19,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=921138.0, ans=0.125 2023-06-21 08:24:31,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=921198.0, ans=0.125 2023-06-21 08:25:13,185 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:25:21,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=921378.0, ans=0.0 2023-06-21 08:25:41,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=921378.0, ans=0.125 2023-06-21 08:25:48,845 INFO [train.py:996] (1/4) Epoch 6, batch 1100, loss[loss=0.2252, simple_loss=0.303, pruned_loss=0.07369, over 21518.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3197, pruned_loss=0.08881, over 4280856.53 frames. ], batch size: 548, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:26:19,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-21 08:27:21,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=921678.0, ans=0.125 2023-06-21 08:27:25,422 INFO [train.py:996] (1/4) Epoch 6, batch 1150, loss[loss=0.2408, simple_loss=0.3233, pruned_loss=0.07916, over 21738.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3191, pruned_loss=0.08833, over 4285343.70 frames. ], batch size: 282, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:27:28,755 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.136e+02 3.809e+02 5.209e+02 8.344e+02, threshold=7.619e+02, percent-clipped=5.0 2023-06-21 08:27:49,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.36 vs. limit=6.0 2023-06-21 08:27:54,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=921798.0, ans=0.125 2023-06-21 08:28:07,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=921858.0, ans=0.125 2023-06-21 08:28:45,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=921918.0, ans=0.035 2023-06-21 08:28:48,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-21 08:28:50,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=921978.0, ans=0.05 2023-06-21 08:28:51,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=921978.0, ans=0.1 2023-06-21 08:29:05,560 INFO [train.py:996] (1/4) Epoch 6, batch 1200, loss[loss=0.2136, simple_loss=0.2963, pruned_loss=0.06543, over 21834.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3216, pruned_loss=0.09024, over 4280909.10 frames. ], batch size: 298, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:29:05,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=922038.0, ans=0.125 2023-06-21 08:29:30,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=922098.0, ans=0.125 2023-06-21 08:29:37,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-21 08:30:22,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=922218.0, ans=0.09899494936611666 2023-06-21 08:30:30,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=922278.0, ans=0.05 2023-06-21 08:30:43,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=922338.0, ans=0.125 2023-06-21 08:30:44,982 INFO [train.py:996] (1/4) Epoch 6, batch 1250, loss[loss=0.2583, simple_loss=0.3304, pruned_loss=0.0931, over 21898.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3252, pruned_loss=0.09243, over 4282763.61 frames. ], batch size: 118, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:30:47,955 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.807e+02 3.082e+02 3.703e+02 6.160e+02, threshold=6.164e+02, percent-clipped=0.0 2023-06-21 08:31:08,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=922398.0, ans=0.2 2023-06-21 08:31:16,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=922398.0, ans=0.125 2023-06-21 08:31:28,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-06-21 08:31:58,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=922518.0, ans=0.0 2023-06-21 08:32:20,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=922578.0, ans=0.1 2023-06-21 08:32:21,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-21 08:32:25,622 INFO [train.py:996] (1/4) Epoch 6, batch 1300, loss[loss=0.2469, simple_loss=0.3324, pruned_loss=0.08066, over 21788.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3258, pruned_loss=0.09216, over 4290747.59 frames. ], batch size: 282, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:32:25,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=922638.0, ans=0.125 2023-06-21 08:32:52,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=922698.0, ans=0.125 2023-06-21 08:33:33,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=922818.0, ans=0.125 2023-06-21 08:33:35,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-21 08:33:49,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-21 08:33:52,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=922878.0, ans=0.0 2023-06-21 08:34:11,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=922938.0, ans=10.0 2023-06-21 08:34:12,153 INFO [train.py:996] (1/4) Epoch 6, batch 1350, loss[loss=0.2113, simple_loss=0.2749, pruned_loss=0.07391, over 21717.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3252, pruned_loss=0.09119, over 4288463.62 frames. ], batch size: 247, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:34:15,480 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 2.950e+02 3.402e+02 4.327e+02 7.422e+02, threshold=6.804e+02, percent-clipped=3.0 2023-06-21 08:34:26,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=922998.0, ans=0.05 2023-06-21 08:34:38,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=922998.0, ans=0.125 2023-06-21 08:34:52,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923058.0, ans=0.1 2023-06-21 08:34:54,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=923058.0, ans=0.07 2023-06-21 08:35:24,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=923118.0, ans=0.125 2023-06-21 08:35:32,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-21 08:35:50,712 INFO [train.py:996] (1/4) Epoch 6, batch 1400, loss[loss=0.2242, simple_loss=0.2762, pruned_loss=0.08608, over 21277.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3245, pruned_loss=0.09123, over 4283724.25 frames. ], batch size: 177, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:35:51,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=923238.0, ans=0.0 2023-06-21 08:35:51,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=923238.0, ans=0.125 2023-06-21 08:36:09,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=923298.0, ans=0.125 2023-06-21 08:36:10,873 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:37:08,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-21 08:37:31,183 INFO [train.py:996] (1/4) Epoch 6, batch 1450, loss[loss=0.2171, simple_loss=0.2988, pruned_loss=0.06769, over 21361.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3232, pruned_loss=0.09095, over 4284340.80 frames. ], batch size: 211, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:37:37,335 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.855e+02 3.384e+02 3.937e+02 6.877e+02, threshold=6.768e+02, percent-clipped=1.0 2023-06-21 08:38:22,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=923658.0, ans=15.0 2023-06-21 08:38:57,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=923778.0, ans=0.0 2023-06-21 08:39:11,665 INFO [train.py:996] (1/4) Epoch 6, batch 1500, loss[loss=0.207, simple_loss=0.2844, pruned_loss=0.06479, over 20040.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3248, pruned_loss=0.09197, over 4292910.21 frames. ], batch size: 702, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:39:24,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=923838.0, ans=0.0 2023-06-21 08:39:43,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=923898.0, ans=0.0 2023-06-21 08:40:02,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=923958.0, ans=0.2 2023-06-21 08:40:53,814 INFO [train.py:996] (1/4) Epoch 6, batch 1550, loss[loss=0.2261, simple_loss=0.2893, pruned_loss=0.08146, over 21637.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3216, pruned_loss=0.09014, over 4285369.51 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:41:00,457 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.808e+02 3.171e+02 3.740e+02 6.860e+02, threshold=6.342e+02, percent-clipped=1.0 2023-06-21 08:41:02,647 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:42:20,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=924378.0, ans=0.05 2023-06-21 08:42:36,166 INFO [train.py:996] (1/4) Epoch 6, batch 1600, loss[loss=0.1548, simple_loss=0.1996, pruned_loss=0.05498, over 16689.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3201, pruned_loss=0.0895, over 4281906.12 frames. ], batch size: 61, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:43:21,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=924498.0, ans=0.125 2023-06-21 08:43:48,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=924618.0, ans=0.015 2023-06-21 08:44:25,203 INFO [train.py:996] (1/4) Epoch 6, batch 1650, loss[loss=0.2107, simple_loss=0.2804, pruned_loss=0.07048, over 21834.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.318, pruned_loss=0.08895, over 4275428.06 frames. ], batch size: 107, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:44:29,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-21 08:44:31,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.183e+02 3.962e+02 4.475e+02 7.912e+02, threshold=7.925e+02, percent-clipped=6.0 2023-06-21 08:45:38,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2023-06-21 08:46:07,213 INFO [train.py:996] (1/4) Epoch 6, batch 1700, loss[loss=0.2172, simple_loss=0.307, pruned_loss=0.0637, over 21607.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3229, pruned_loss=0.09046, over 4272411.28 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:46:17,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=925038.0, ans=0.125 2023-06-21 08:47:25,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=925278.0, ans=0.125 2023-06-21 08:47:54,707 INFO [train.py:996] (1/4) Epoch 6, batch 1750, loss[loss=0.3573, simple_loss=0.4267, pruned_loss=0.144, over 21521.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3233, pruned_loss=0.08877, over 4277671.77 frames. ], batch size: 471, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:48:05,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.133e+02 3.705e+02 4.363e+02 7.096e+02, threshold=7.410e+02, percent-clipped=0.0 2023-06-21 08:48:43,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=925458.0, ans=0.2 2023-06-21 08:49:10,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=925518.0, ans=0.0 2023-06-21 08:49:10,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=925518.0, ans=0.1 2023-06-21 08:49:10,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=925518.0, ans=0.05 2023-06-21 08:49:10,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=925518.0, ans=0.125 2023-06-21 08:49:37,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-21 08:49:43,003 INFO [train.py:996] (1/4) Epoch 6, batch 1800, loss[loss=0.1819, simple_loss=0.2653, pruned_loss=0.04926, over 21616.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3216, pruned_loss=0.08692, over 4269144.94 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:50:24,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925758.0, ans=0.1 2023-06-21 08:50:26,469 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-21 08:51:23,699 INFO [train.py:996] (1/4) Epoch 6, batch 1850, loss[loss=0.246, simple_loss=0.3245, pruned_loss=0.08369, over 21930.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3219, pruned_loss=0.08496, over 4256682.27 frames. ], batch size: 316, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:51:30,091 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.862e+02 3.405e+02 4.274e+02 8.543e+02, threshold=6.809e+02, percent-clipped=2.0 2023-06-21 08:52:08,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=926058.0, ans=0.0 2023-06-21 08:53:02,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=926178.0, ans=0.125 2023-06-21 08:53:05,023 INFO [train.py:996] (1/4) Epoch 6, batch 1900, loss[loss=0.1986, simple_loss=0.2831, pruned_loss=0.057, over 21773.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.321, pruned_loss=0.08459, over 4263482.03 frames. ], batch size: 282, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:53:23,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=926238.0, ans=0.0 2023-06-21 08:53:57,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-21 08:54:05,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=926418.0, ans=0.035 2023-06-21 08:54:07,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=926418.0, ans=0.125 2023-06-21 08:54:12,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-21 08:54:38,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=926478.0, ans=0.1 2023-06-21 08:54:48,415 INFO [train.py:996] (1/4) Epoch 6, batch 1950, loss[loss=0.2293, simple_loss=0.282, pruned_loss=0.08828, over 21633.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3163, pruned_loss=0.08358, over 4268347.65 frames. ], batch size: 247, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:54:55,255 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.902e+02 3.428e+02 4.161e+02 7.529e+02, threshold=6.855e+02, percent-clipped=4.0 2023-06-21 08:55:05,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=926538.0, ans=0.1 2023-06-21 08:55:19,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=926598.0, ans=0.125 2023-06-21 08:55:58,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-21 08:56:19,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=926778.0, ans=0.125 2023-06-21 08:56:27,449 INFO [train.py:996] (1/4) Epoch 6, batch 2000, loss[loss=0.2718, simple_loss=0.3688, pruned_loss=0.08745, over 21626.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3174, pruned_loss=0.08393, over 4266885.10 frames. ], batch size: 414, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 08:56:57,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=926898.0, ans=0.0 2023-06-21 08:57:33,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=927018.0, ans=0.125 2023-06-21 08:58:02,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=927078.0, ans=0.125 2023-06-21 08:58:08,458 INFO [train.py:996] (1/4) Epoch 6, batch 2050, loss[loss=0.217, simple_loss=0.293, pruned_loss=0.07048, over 21556.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3148, pruned_loss=0.08334, over 4267312.89 frames. ], batch size: 131, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 08:58:19,902 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.151e+02 3.657e+02 4.300e+02 8.922e+02, threshold=7.314e+02, percent-clipped=4.0 2023-06-21 08:59:14,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=927318.0, ans=0.015 2023-06-21 08:59:42,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-21 08:59:49,713 INFO [train.py:996] (1/4) Epoch 6, batch 2100, loss[loss=0.2521, simple_loss=0.3311, pruned_loss=0.08657, over 21802.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3178, pruned_loss=0.086, over 4273679.05 frames. ], batch size: 282, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:00:22,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-21 09:00:31,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=927558.0, ans=0.1 2023-06-21 09:00:54,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=927618.0, ans=0.125 2023-06-21 09:01:08,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-21 09:01:16,337 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-21 09:01:17,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=927678.0, ans=0.0 2023-06-21 09:01:23,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=927678.0, ans=0.125 2023-06-21 09:01:31,390 INFO [train.py:996] (1/4) Epoch 6, batch 2150, loss[loss=0.246, simple_loss=0.3293, pruned_loss=0.08137, over 21423.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.319, pruned_loss=0.08776, over 4273655.56 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:01:43,410 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.848e+02 3.301e+02 4.038e+02 6.672e+02, threshold=6.603e+02, percent-clipped=0.0 2023-06-21 09:01:55,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-21 09:01:58,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=927798.0, ans=0.0 2023-06-21 09:02:48,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=927918.0, ans=0.125 2023-06-21 09:03:13,672 INFO [train.py:996] (1/4) Epoch 6, batch 2200, loss[loss=0.1952, simple_loss=0.271, pruned_loss=0.05971, over 21172.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3212, pruned_loss=0.08861, over 4275691.86 frames. ], batch size: 143, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:04:14,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=928158.0, ans=0.07 2023-06-21 09:04:24,393 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.63 vs. limit=15.0 2023-06-21 09:04:36,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-21 09:04:52,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=928338.0, ans=0.2 2023-06-21 09:04:53,480 INFO [train.py:996] (1/4) Epoch 6, batch 2250, loss[loss=0.24, simple_loss=0.2888, pruned_loss=0.09562, over 21157.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3172, pruned_loss=0.08666, over 4266352.91 frames. ], batch size: 159, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:05:06,967 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.750e+02 3.169e+02 3.694e+02 5.600e+02, threshold=6.338e+02, percent-clipped=0.0 2023-06-21 09:05:18,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=928398.0, ans=0.125 2023-06-21 09:05:28,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=928398.0, ans=0.125 2023-06-21 09:05:38,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=928458.0, ans=0.0 2023-06-21 09:05:58,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=928518.0, ans=0.125 2023-06-21 09:06:06,693 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-21 09:06:26,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=928578.0, ans=0.1 2023-06-21 09:06:35,963 INFO [train.py:996] (1/4) Epoch 6, batch 2300, loss[loss=0.2163, simple_loss=0.2776, pruned_loss=0.07754, over 21662.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3145, pruned_loss=0.08605, over 4266898.70 frames. ], batch size: 333, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:06:54,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=928698.0, ans=0.0 2023-06-21 09:06:59,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=928698.0, ans=0.125 2023-06-21 09:07:43,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=928818.0, ans=0.2 2023-06-21 09:08:14,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=928878.0, ans=0.125 2023-06-21 09:08:17,124 INFO [train.py:996] (1/4) Epoch 6, batch 2350, loss[loss=0.2378, simple_loss=0.2937, pruned_loss=0.09096, over 21244.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3132, pruned_loss=0.08617, over 4265303.92 frames. ], batch size: 144, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:08:25,729 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.345e+02 4.237e+02 6.014e+02 1.096e+03, threshold=8.474e+02, percent-clipped=18.0 2023-06-21 09:09:14,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=929058.0, ans=0.125 2023-06-21 09:09:16,888 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.62 vs. limit=15.0 2023-06-21 09:09:37,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=929178.0, ans=0.0 2023-06-21 09:09:42,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=929178.0, ans=0.125 2023-06-21 09:09:50,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=929178.0, ans=0.1 2023-06-21 09:09:55,370 INFO [train.py:996] (1/4) Epoch 6, batch 2400, loss[loss=0.2771, simple_loss=0.3516, pruned_loss=0.1013, over 20714.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.317, pruned_loss=0.08929, over 4276300.31 frames. ], batch size: 607, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:09:58,971 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:10:07,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=929238.0, ans=0.1 2023-06-21 09:10:07,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=12.0 2023-06-21 09:10:12,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=929298.0, ans=0.0 2023-06-21 09:11:33,343 INFO [train.py:996] (1/4) Epoch 6, batch 2450, loss[loss=0.2462, simple_loss=0.305, pruned_loss=0.09368, over 22031.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3185, pruned_loss=0.09106, over 4279101.01 frames. ], batch size: 103, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:11:41,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 3.127e+02 3.688e+02 4.498e+02 8.076e+02, threshold=7.375e+02, percent-clipped=0.0 2023-06-21 09:11:45,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-21 09:12:23,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=929658.0, ans=0.2 2023-06-21 09:12:29,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=22.5 2023-06-21 09:12:30,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=929658.0, ans=0.07 2023-06-21 09:12:34,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=929718.0, ans=0.125 2023-06-21 09:12:50,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=929778.0, ans=0.125 2023-06-21 09:13:13,611 INFO [train.py:996] (1/4) Epoch 6, batch 2500, loss[loss=0.2191, simple_loss=0.2785, pruned_loss=0.07988, over 15225.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3184, pruned_loss=0.09162, over 4262692.00 frames. ], batch size: 61, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:14:49,951 INFO [train.py:996] (1/4) Epoch 6, batch 2550, loss[loss=0.2285, simple_loss=0.2817, pruned_loss=0.08763, over 21463.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3181, pruned_loss=0.09029, over 4262528.89 frames. ], batch size: 211, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:14:58,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.790e+02 3.218e+02 3.631e+02 5.360e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 09:15:37,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-21 09:15:57,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=930318.0, ans=0.125 2023-06-21 09:16:24,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-21 09:16:28,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=930378.0, ans=0.125 2023-06-21 09:16:31,227 INFO [train.py:996] (1/4) Epoch 6, batch 2600, loss[loss=0.2392, simple_loss=0.3007, pruned_loss=0.08885, over 21700.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3202, pruned_loss=0.09016, over 4265822.00 frames. ], batch size: 351, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:16:49,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=930498.0, ans=0.2 2023-06-21 09:16:54,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.09 vs. limit=12.0 2023-06-21 09:17:12,247 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:17:22,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=930558.0, ans=0.125 2023-06-21 09:17:45,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=930678.0, ans=0.125 2023-06-21 09:18:09,157 INFO [train.py:996] (1/4) Epoch 6, batch 2650, loss[loss=0.3054, simple_loss=0.3655, pruned_loss=0.1227, over 21413.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3211, pruned_loss=0.09216, over 4270959.00 frames. ], batch size: 471, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:18:10,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-21 09:18:14,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=930738.0, ans=0.2 2023-06-21 09:18:16,901 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.039e+02 3.537e+02 4.396e+02 7.352e+02, threshold=7.074e+02, percent-clipped=6.0 2023-06-21 09:19:04,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=930858.0, ans=0.125 2023-06-21 09:19:12,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=930918.0, ans=0.04949747468305833 2023-06-21 09:19:26,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=930918.0, ans=0.125 2023-06-21 09:19:52,593 INFO [train.py:996] (1/4) Epoch 6, batch 2700, loss[loss=0.2361, simple_loss=0.2963, pruned_loss=0.08798, over 21610.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3195, pruned_loss=0.0914, over 4280792.06 frames. ], batch size: 230, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:21:34,651 INFO [train.py:996] (1/4) Epoch 6, batch 2750, loss[loss=0.2383, simple_loss=0.305, pruned_loss=0.08578, over 21813.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.319, pruned_loss=0.09041, over 4277317.86 frames. ], batch size: 298, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:21:42,920 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 2.939e+02 3.495e+02 4.251e+02 6.748e+02, threshold=6.989e+02, percent-clipped=0.0 2023-06-21 09:21:50,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931398.0, ans=0.1 2023-06-21 09:22:05,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-21 09:23:20,456 INFO [train.py:996] (1/4) Epoch 6, batch 2800, loss[loss=0.2533, simple_loss=0.3047, pruned_loss=0.1009, over 21392.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3231, pruned_loss=0.09138, over 4276991.58 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:23:29,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=931638.0, ans=0.0 2023-06-21 09:24:08,305 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-21 09:24:27,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=931758.0, ans=0.125 2023-06-21 09:25:03,025 INFO [train.py:996] (1/4) Epoch 6, batch 2850, loss[loss=0.25, simple_loss=0.3177, pruned_loss=0.09115, over 21702.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3254, pruned_loss=0.09254, over 4278585.25 frames. ], batch size: 332, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:25:23,445 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.230e+02 3.892e+02 4.894e+02 8.283e+02, threshold=7.785e+02, percent-clipped=6.0 2023-06-21 09:25:46,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=931998.0, ans=0.0 2023-06-21 09:26:31,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932178.0, ans=0.1 2023-06-21 09:26:45,925 INFO [train.py:996] (1/4) Epoch 6, batch 2900, loss[loss=0.2698, simple_loss=0.3264, pruned_loss=0.1066, over 21377.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3197, pruned_loss=0.0911, over 4272702.69 frames. ], batch size: 159, lr: 5.30e-03, grad_scale: 16.0 2023-06-21 09:27:48,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932418.0, ans=0.1 2023-06-21 09:27:49,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-21 09:27:58,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=932418.0, ans=0.125 2023-06-21 09:28:17,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932478.0, ans=0.1 2023-06-21 09:28:28,479 INFO [train.py:996] (1/4) Epoch 6, batch 2950, loss[loss=0.2983, simple_loss=0.3476, pruned_loss=0.1245, over 21773.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3213, pruned_loss=0.09144, over 4279110.56 frames. ], batch size: 508, lr: 5.30e-03, grad_scale: 16.0 2023-06-21 09:28:41,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=932538.0, ans=0.2 2023-06-21 09:28:41,900 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:28:42,917 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 2.961e+02 3.319e+02 4.000e+02 7.696e+02, threshold=6.638e+02, percent-clipped=0.0 2023-06-21 09:29:06,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932598.0, ans=0.1 2023-06-21 09:29:24,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=932658.0, ans=0.125 2023-06-21 09:29:29,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=932658.0, ans=0.0 2023-06-21 09:29:38,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=932718.0, ans=0.0 2023-06-21 09:30:13,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=932838.0, ans=0.05 2023-06-21 09:30:14,707 INFO [train.py:996] (1/4) Epoch 6, batch 3000, loss[loss=0.2872, simple_loss=0.3464, pruned_loss=0.1141, over 21248.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3258, pruned_loss=0.09239, over 4278630.92 frames. ], batch size: 143, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:30:14,707 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 09:30:34,691 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.255, simple_loss=0.3481, pruned_loss=0.08099, over 1796401.00 frames. 2023-06-21 09:30:34,692 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-21 09:30:38,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932838.0, ans=0.1 2023-06-21 09:30:39,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-21 09:30:45,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=932838.0, ans=0.125 2023-06-21 09:30:50,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=932898.0, ans=0.125 2023-06-21 09:32:16,767 INFO [train.py:996] (1/4) Epoch 6, batch 3050, loss[loss=0.2378, simple_loss=0.3013, pruned_loss=0.08718, over 21671.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3274, pruned_loss=0.09148, over 4278256.46 frames. ], batch size: 263, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:32:26,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 2.903e+02 3.413e+02 4.363e+02 7.333e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-21 09:32:34,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=933198.0, ans=0.125 2023-06-21 09:33:11,539 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:33:21,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=933318.0, ans=0.125 2023-06-21 09:33:58,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-21 09:33:59,309 INFO [train.py:996] (1/4) Epoch 6, batch 3100, loss[loss=0.2203, simple_loss=0.3021, pruned_loss=0.06924, over 21462.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3266, pruned_loss=0.09033, over 4276533.74 frames. ], batch size: 211, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:34:07,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=933438.0, ans=0.0 2023-06-21 09:34:32,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=933498.0, ans=0.125 2023-06-21 09:34:44,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-21 09:34:56,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=933618.0, ans=0.125 2023-06-21 09:35:00,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=933618.0, ans=0.1 2023-06-21 09:35:03,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=933618.0, ans=0.125 2023-06-21 09:35:40,890 INFO [train.py:996] (1/4) Epoch 6, batch 3150, loss[loss=0.2581, simple_loss=0.3232, pruned_loss=0.09653, over 20695.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3286, pruned_loss=0.09154, over 4273804.71 frames. ], batch size: 607, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:35:42,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=933738.0, ans=0.0 2023-06-21 09:35:55,931 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.015e+02 3.533e+02 4.107e+02 6.510e+02, threshold=7.067e+02, percent-clipped=0.0 2023-06-21 09:36:32,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=933858.0, ans=0.1 2023-06-21 09:37:08,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=933978.0, ans=0.125 2023-06-21 09:37:22,843 INFO [train.py:996] (1/4) Epoch 6, batch 3200, loss[loss=0.2753, simple_loss=0.3229, pruned_loss=0.1139, over 21592.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3288, pruned_loss=0.09114, over 4273220.83 frames. ], batch size: 548, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:38:15,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-21 09:39:08,400 INFO [train.py:996] (1/4) Epoch 6, batch 3250, loss[loss=0.2713, simple_loss=0.3215, pruned_loss=0.1106, over 21876.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3311, pruned_loss=0.09265, over 4276232.83 frames. ], batch size: 353, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:39:18,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.861e+02 3.274e+02 3.932e+02 7.956e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-21 09:39:20,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=934338.0, ans=0.0 2023-06-21 09:39:35,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=934398.0, ans=0.125 2023-06-21 09:39:59,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=934458.0, ans=0.0 2023-06-21 09:40:38,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=934578.0, ans=0.1 2023-06-21 09:40:38,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=934578.0, ans=0.0 2023-06-21 09:40:43,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=934578.0, ans=0.0 2023-06-21 09:40:49,787 INFO [train.py:996] (1/4) Epoch 6, batch 3300, loss[loss=0.2581, simple_loss=0.3335, pruned_loss=0.09141, over 21501.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3241, pruned_loss=0.09182, over 4265332.82 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:41:42,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=934758.0, ans=0.125 2023-06-21 09:41:47,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=934758.0, ans=0.125 2023-06-21 09:42:14,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=934878.0, ans=0.05 2023-06-21 09:42:26,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=934878.0, ans=0.0 2023-06-21 09:42:26,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-21 09:42:26,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.44 vs. limit=6.0 2023-06-21 09:42:30,706 INFO [train.py:996] (1/4) Epoch 6, batch 3350, loss[loss=0.2731, simple_loss=0.3421, pruned_loss=0.1021, over 21437.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3258, pruned_loss=0.09095, over 4269743.89 frames. ], batch size: 548, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:42:34,840 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-21 09:42:45,015 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.921e+02 3.413e+02 3.921e+02 6.338e+02, threshold=6.826e+02, percent-clipped=0.0 2023-06-21 09:44:17,665 INFO [train.py:996] (1/4) Epoch 6, batch 3400, loss[loss=0.2651, simple_loss=0.343, pruned_loss=0.09356, over 21067.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3269, pruned_loss=0.09211, over 4273117.05 frames. ], batch size: 607, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:44:54,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=935298.0, ans=0.0 2023-06-21 09:45:40,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=935478.0, ans=0.125 2023-06-21 09:46:04,884 INFO [train.py:996] (1/4) Epoch 6, batch 3450, loss[loss=0.2261, simple_loss=0.2922, pruned_loss=0.08005, over 21887.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3226, pruned_loss=0.09179, over 4279760.79 frames. ], batch size: 107, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:46:16,690 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.951e+02 3.358e+02 4.026e+02 6.824e+02, threshold=6.715e+02, percent-clipped=0.0 2023-06-21 09:46:28,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=935598.0, ans=0.1 2023-06-21 09:46:39,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=935598.0, ans=0.0 2023-06-21 09:47:01,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=935658.0, ans=0.125 2023-06-21 09:47:04,019 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-21 09:47:08,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=935718.0, ans=0.125 2023-06-21 09:47:38,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=935778.0, ans=0.0 2023-06-21 09:47:47,059 INFO [train.py:996] (1/4) Epoch 6, batch 3500, loss[loss=0.3457, simple_loss=0.3925, pruned_loss=0.1494, over 21276.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3333, pruned_loss=0.09613, over 4282890.14 frames. ], batch size: 143, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:48:10,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935898.0, ans=0.1 2023-06-21 09:48:14,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=935898.0, ans=0.125 2023-06-21 09:48:53,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=936018.0, ans=0.125 2023-06-21 09:49:01,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=936018.0, ans=0.125 2023-06-21 09:49:19,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=936078.0, ans=0.2 2023-06-21 09:49:20,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=936078.0, ans=0.07 2023-06-21 09:49:24,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=936078.0, ans=0.125 2023-06-21 09:49:28,507 INFO [train.py:996] (1/4) Epoch 6, batch 3550, loss[loss=0.313, simple_loss=0.3434, pruned_loss=0.1413, over 21373.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3358, pruned_loss=0.0979, over 4287865.76 frames. ], batch size: 508, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:49:44,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 3.119e+02 3.460e+02 4.086e+02 7.821e+02, threshold=6.921e+02, percent-clipped=5.0 2023-06-21 09:49:45,161 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-21 09:49:56,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=936198.0, ans=0.1 2023-06-21 09:50:23,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=936258.0, ans=0.125 2023-06-21 09:51:13,817 INFO [train.py:996] (1/4) Epoch 6, batch 3600, loss[loss=0.2978, simple_loss=0.3516, pruned_loss=0.122, over 21560.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3299, pruned_loss=0.09712, over 4286584.06 frames. ], batch size: 389, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:52:21,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=936618.0, ans=0.04949747468305833 2023-06-21 09:52:30,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=936678.0, ans=0.125 2023-06-21 09:52:56,094 INFO [train.py:996] (1/4) Epoch 6, batch 3650, loss[loss=0.1711, simple_loss=0.2172, pruned_loss=0.06245, over 17260.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3299, pruned_loss=0.09652, over 4283632.64 frames. ], batch size: 61, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:53:08,883 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 3.039e+02 3.609e+02 4.641e+02 6.973e+02, threshold=7.218e+02, percent-clipped=1.0 2023-06-21 09:53:22,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=936798.0, ans=0.1 2023-06-21 09:53:55,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=936918.0, ans=0.2 2023-06-21 09:54:32,653 INFO [train.py:996] (1/4) Epoch 6, batch 3700, loss[loss=0.2592, simple_loss=0.3229, pruned_loss=0.09774, over 21543.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3281, pruned_loss=0.09547, over 4275706.69 frames. ], batch size: 131, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:54:50,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=937038.0, ans=0.0 2023-06-21 09:54:54,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.82 vs. limit=10.0 2023-06-21 09:55:01,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937098.0, ans=0.1 2023-06-21 09:55:19,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=937158.0, ans=0.125 2023-06-21 09:55:21,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937158.0, ans=0.1 2023-06-21 09:56:17,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=937338.0, ans=0.025 2023-06-21 09:56:18,862 INFO [train.py:996] (1/4) Epoch 6, batch 3750, loss[loss=0.1865, simple_loss=0.2626, pruned_loss=0.05521, over 21600.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3262, pruned_loss=0.09488, over 4282492.74 frames. ], batch size: 195, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:56:31,728 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.989e+02 3.545e+02 4.107e+02 7.890e+02, threshold=7.090e+02, percent-clipped=2.0 2023-06-21 09:57:27,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=937518.0, ans=0.1 2023-06-21 09:58:01,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-21 09:58:01,402 INFO [train.py:996] (1/4) Epoch 6, batch 3800, loss[loss=0.2494, simple_loss=0.3491, pruned_loss=0.07483, over 21197.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3234, pruned_loss=0.09319, over 4273286.46 frames. ], batch size: 548, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:58:13,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=937638.0, ans=0.125 2023-06-21 09:59:40,062 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-21 09:59:42,391 INFO [train.py:996] (1/4) Epoch 6, batch 3850, loss[loss=0.3257, simple_loss=0.4389, pruned_loss=0.1062, over 19826.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3245, pruned_loss=0.09451, over 4261398.62 frames. ], batch size: 702, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:59:55,376 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 3.412e+02 4.254e+02 5.791e+02 1.316e+03, threshold=8.507e+02, percent-clipped=12.0 2023-06-21 10:00:27,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=938058.0, ans=0.0 2023-06-21 10:00:33,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=938058.0, ans=0.0 2023-06-21 10:01:04,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=938118.0, ans=0.1 2023-06-21 10:01:09,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=938178.0, ans=0.1 2023-06-21 10:01:23,209 INFO [train.py:996] (1/4) Epoch 6, batch 3900, loss[loss=0.2482, simple_loss=0.3136, pruned_loss=0.09143, over 21893.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3199, pruned_loss=0.09364, over 4262137.04 frames. ], batch size: 118, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 10:01:27,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=938238.0, ans=0.0 2023-06-21 10:01:40,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=938298.0, ans=0.1 2023-06-21 10:02:07,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=938358.0, ans=0.125 2023-06-21 10:02:39,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=938418.0, ans=0.0 2023-06-21 10:02:49,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-21 10:03:04,364 INFO [train.py:996] (1/4) Epoch 6, batch 3950, loss[loss=0.192, simple_loss=0.2659, pruned_loss=0.05902, over 21435.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3205, pruned_loss=0.09243, over 4265577.43 frames. ], batch size: 131, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 10:03:05,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=938538.0, ans=15.0 2023-06-21 10:03:17,120 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.886e+02 3.404e+02 4.103e+02 5.613e+02, threshold=6.809e+02, percent-clipped=0.0 2023-06-21 10:03:55,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=938658.0, ans=0.0 2023-06-21 10:04:25,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=938718.0, ans=0.2 2023-06-21 10:04:28,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=938778.0, ans=0.0 2023-06-21 10:04:45,825 INFO [train.py:996] (1/4) Epoch 6, batch 4000, loss[loss=0.1946, simple_loss=0.2624, pruned_loss=0.06336, over 21486.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3124, pruned_loss=0.08797, over 4268135.35 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:04:52,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=938838.0, ans=0.125 2023-06-21 10:05:04,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=938838.0, ans=0.125 2023-06-21 10:05:05,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=938898.0, ans=0.125 2023-06-21 10:05:38,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=938958.0, ans=0.125 2023-06-21 10:06:02,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=939018.0, ans=0.2 2023-06-21 10:06:07,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-21 10:06:26,113 INFO [train.py:996] (1/4) Epoch 6, batch 4050, loss[loss=0.2235, simple_loss=0.2992, pruned_loss=0.07391, over 21470.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3122, pruned_loss=0.08646, over 4253158.76 frames. ], batch size: 211, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:06:43,677 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.907e+02 3.499e+02 4.123e+02 8.601e+02, threshold=6.998e+02, percent-clipped=5.0 2023-06-21 10:07:53,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=939378.0, ans=0.0 2023-06-21 10:08:00,845 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:08:13,467 INFO [train.py:996] (1/4) Epoch 6, batch 4100, loss[loss=0.2245, simple_loss=0.3015, pruned_loss=0.07369, over 21602.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.315, pruned_loss=0.08634, over 4261642.40 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:08:50,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=939498.0, ans=0.125 2023-06-21 10:09:09,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=939558.0, ans=0.025 2023-06-21 10:09:25,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.15 vs. limit=22.5 2023-06-21 10:09:26,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=939618.0, ans=0.0 2023-06-21 10:09:33,036 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-21 10:09:54,960 INFO [train.py:996] (1/4) Epoch 6, batch 4150, loss[loss=0.2452, simple_loss=0.3162, pruned_loss=0.08712, over 21577.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.315, pruned_loss=0.08373, over 4270653.15 frames. ], batch size: 548, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:09:55,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=939738.0, ans=0.0 2023-06-21 10:10:06,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=939738.0, ans=0.125 2023-06-21 10:10:11,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-21 10:10:17,695 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 3.037e+02 3.666e+02 4.331e+02 9.059e+02, threshold=7.332e+02, percent-clipped=3.0 2023-06-21 10:10:24,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=939798.0, ans=0.1 2023-06-21 10:10:44,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=939858.0, ans=0.0 2023-06-21 10:10:52,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=939858.0, ans=0.125 2023-06-21 10:11:44,085 INFO [train.py:996] (1/4) Epoch 6, batch 4200, loss[loss=0.2987, simple_loss=0.3624, pruned_loss=0.1175, over 21469.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3142, pruned_loss=0.08266, over 4270132.36 frames. ], batch size: 473, lr: 5.27e-03, grad_scale: 32.0 2023-06-21 10:11:49,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=940038.0, ans=0.2 2023-06-21 10:12:06,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=940098.0, ans=0.1 2023-06-21 10:12:41,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=940158.0, ans=0.125 2023-06-21 10:13:33,267 INFO [train.py:996] (1/4) Epoch 6, batch 4250, loss[loss=0.3114, simple_loss=0.3879, pruned_loss=0.1174, over 21576.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3243, pruned_loss=0.08552, over 4266502.70 frames. ], batch size: 441, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:13:46,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=940338.0, ans=0.0 2023-06-21 10:13:52,366 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 3.226e+02 3.853e+02 4.783e+02 9.792e+02, threshold=7.707e+02, percent-clipped=2.0 2023-06-21 10:14:05,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=940398.0, ans=0.2 2023-06-21 10:14:47,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=940578.0, ans=0.125 2023-06-21 10:14:48,082 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-21 10:14:58,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=22.5 2023-06-21 10:15:15,852 INFO [train.py:996] (1/4) Epoch 6, batch 4300, loss[loss=0.2313, simple_loss=0.3013, pruned_loss=0.08063, over 21305.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3294, pruned_loss=0.0882, over 4266997.85 frames. ], batch size: 159, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:15:43,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=940698.0, ans=0.0 2023-06-21 10:15:43,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940698.0, ans=0.1 2023-06-21 10:16:52,253 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-21 10:17:01,944 INFO [train.py:996] (1/4) Epoch 6, batch 4350, loss[loss=0.2155, simple_loss=0.2845, pruned_loss=0.0733, over 21827.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3281, pruned_loss=0.08771, over 4262222.87 frames. ], batch size: 107, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:17:12,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=940938.0, ans=15.0 2023-06-21 10:17:16,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 2.990e+02 3.501e+02 4.556e+02 7.699e+02, threshold=7.002e+02, percent-clipped=0.0 2023-06-21 10:17:23,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=940998.0, ans=0.125 2023-06-21 10:18:43,521 INFO [train.py:996] (1/4) Epoch 6, batch 4400, loss[loss=0.2449, simple_loss=0.328, pruned_loss=0.08092, over 21902.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3228, pruned_loss=0.08768, over 4271225.94 frames. ], batch size: 373, lr: 5.27e-03, grad_scale: 32.0 2023-06-21 10:18:43,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=941238.0, ans=0.125 2023-06-21 10:18:51,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=941238.0, ans=0.125 2023-06-21 10:19:33,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=941358.0, ans=0.2 2023-06-21 10:19:59,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=941418.0, ans=0.0 2023-06-21 10:20:18,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=941478.0, ans=0.125 2023-06-21 10:20:24,828 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:20:26,294 INFO [train.py:996] (1/4) Epoch 6, batch 4450, loss[loss=0.3002, simple_loss=0.3943, pruned_loss=0.103, over 21769.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3325, pruned_loss=0.08966, over 4270311.56 frames. ], batch size: 332, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:20:31,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=941538.0, ans=0.125 2023-06-21 10:20:41,744 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-21 10:20:47,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.365e+02 2.781e+02 3.285e+02 3.917e+02 7.316e+02, threshold=6.570e+02, percent-clipped=2.0 2023-06-21 10:21:26,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=15.0 2023-06-21 10:22:06,936 INFO [train.py:996] (1/4) Epoch 6, batch 4500, loss[loss=0.2612, simple_loss=0.3304, pruned_loss=0.09597, over 20214.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3327, pruned_loss=0.09173, over 4273915.97 frames. ], batch size: 702, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:22:42,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=941898.0, ans=0.1 2023-06-21 10:23:54,920 INFO [train.py:996] (1/4) Epoch 6, batch 4550, loss[loss=0.3339, simple_loss=0.3985, pruned_loss=0.1347, over 21414.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3349, pruned_loss=0.0915, over 4268399.40 frames. ], batch size: 471, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:24:16,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.720e+02 3.042e+02 3.501e+02 7.303e+02, threshold=6.084e+02, percent-clipped=1.0 2023-06-21 10:25:37,654 INFO [train.py:996] (1/4) Epoch 6, batch 4600, loss[loss=0.2325, simple_loss=0.3053, pruned_loss=0.07979, over 21821.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3376, pruned_loss=0.09436, over 4274267.58 frames. ], batch size: 282, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:25:59,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-06-21 10:26:17,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=942498.0, ans=0.125 2023-06-21 10:26:49,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=942618.0, ans=0.125 2023-06-21 10:26:49,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=942618.0, ans=0.125 2023-06-21 10:27:03,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=942678.0, ans=10.0 2023-06-21 10:27:18,154 INFO [train.py:996] (1/4) Epoch 6, batch 4650, loss[loss=0.2269, simple_loss=0.3032, pruned_loss=0.07528, over 21485.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3295, pruned_loss=0.09189, over 4283021.90 frames. ], batch size: 131, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:27:22,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=942738.0, ans=0.5 2023-06-21 10:27:44,640 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.693e+02 3.104e+02 3.574e+02 6.080e+02, threshold=6.208e+02, percent-clipped=0.0 2023-06-21 10:28:20,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=942918.0, ans=0.1 2023-06-21 10:28:21,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-21 10:28:29,466 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-21 10:28:35,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=942918.0, ans=0.0 2023-06-21 10:28:59,766 INFO [train.py:996] (1/4) Epoch 6, batch 4700, loss[loss=0.2156, simple_loss=0.2803, pruned_loss=0.0754, over 21518.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3203, pruned_loss=0.08946, over 4286914.73 frames. ], batch size: 391, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:30:35,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=943278.0, ans=0.125 2023-06-21 10:30:39,992 INFO [train.py:996] (1/4) Epoch 6, batch 4750, loss[loss=0.2584, simple_loss=0.3214, pruned_loss=0.09771, over 21714.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.314, pruned_loss=0.08835, over 4280626.82 frames. ], batch size: 391, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:30:58,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=943338.0, ans=0.1 2023-06-21 10:31:00,521 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.759e+02 3.391e+02 4.138e+02 8.179e+02, threshold=6.782e+02, percent-clipped=2.0 2023-06-21 10:31:16,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=943398.0, ans=0.07 2023-06-21 10:31:20,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=943458.0, ans=0.0 2023-06-21 10:32:20,454 INFO [train.py:996] (1/4) Epoch 6, batch 4800, loss[loss=0.2759, simple_loss=0.3833, pruned_loss=0.08425, over 19810.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3147, pruned_loss=0.08851, over 4279268.47 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:32:50,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=943698.0, ans=0.1 2023-06-21 10:32:59,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=943698.0, ans=0.07 2023-06-21 10:33:30,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=943818.0, ans=0.0 2023-06-21 10:33:59,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-21 10:34:00,638 INFO [train.py:996] (1/4) Epoch 6, batch 4850, loss[loss=0.2579, simple_loss=0.3208, pruned_loss=0.09748, over 21640.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3134, pruned_loss=0.08737, over 4272645.39 frames. ], batch size: 230, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:34:26,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=943998.0, ans=0.1 2023-06-21 10:34:28,261 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.010e+02 3.635e+02 4.678e+02 6.819e+02, threshold=7.270e+02, percent-clipped=1.0 2023-06-21 10:34:33,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=943998.0, ans=0.0 2023-06-21 10:34:51,563 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.74 vs. limit=10.0 2023-06-21 10:35:18,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=944118.0, ans=0.125 2023-06-21 10:35:22,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=944178.0, ans=0.0 2023-06-21 10:35:36,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=944178.0, ans=0.0 2023-06-21 10:35:42,673 INFO [train.py:996] (1/4) Epoch 6, batch 4900, loss[loss=0.2676, simple_loss=0.338, pruned_loss=0.09858, over 21675.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.315, pruned_loss=0.08879, over 4281959.87 frames. ], batch size: 389, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:36:04,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=944238.0, ans=0.1 2023-06-21 10:36:24,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=944298.0, ans=0.1 2023-06-21 10:36:37,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=944358.0, ans=0.125 2023-06-21 10:36:46,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-06-21 10:37:08,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-21 10:37:34,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.76 vs. limit=15.0 2023-06-21 10:37:36,982 INFO [train.py:996] (1/4) Epoch 6, batch 4950, loss[loss=0.2254, simple_loss=0.3316, pruned_loss=0.05958, over 21137.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3192, pruned_loss=0.08686, over 4272082.61 frames. ], batch size: 548, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:37:43,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=944538.0, ans=0.04949747468305833 2023-06-21 10:37:47,402 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-06-21 10:37:53,940 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.748e+02 3.392e+02 4.071e+02 6.752e+02, threshold=6.784e+02, percent-clipped=0.0 2023-06-21 10:38:28,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=944718.0, ans=0.125 2023-06-21 10:38:41,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-21 10:38:47,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=944778.0, ans=0.125 2023-06-21 10:38:57,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=944778.0, ans=0.125 2023-06-21 10:39:10,129 INFO [train.py:996] (1/4) Epoch 6, batch 5000, loss[loss=0.3189, simple_loss=0.3908, pruned_loss=0.1235, over 21447.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3186, pruned_loss=0.08333, over 4278220.40 frames. ], batch size: 508, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:39:16,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=944838.0, ans=0.125 2023-06-21 10:39:24,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=944898.0, ans=0.125 2023-06-21 10:39:45,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-21 10:40:10,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=945018.0, ans=0.125 2023-06-21 10:40:43,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-21 10:40:43,881 INFO [train.py:996] (1/4) Epoch 6, batch 5050, loss[loss=0.311, simple_loss=0.3589, pruned_loss=0.1315, over 21684.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3206, pruned_loss=0.08593, over 4287188.95 frames. ], batch size: 473, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:41:00,724 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.826e+02 3.138e+02 3.685e+02 6.329e+02, threshold=6.276e+02, percent-clipped=0.0 2023-06-21 10:41:10,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=945198.0, ans=0.125 2023-06-21 10:41:10,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=945198.0, ans=0.125 2023-06-21 10:41:45,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=945318.0, ans=0.0 2023-06-21 10:42:16,824 INFO [train.py:996] (1/4) Epoch 6, batch 5100, loss[loss=0.2312, simple_loss=0.3052, pruned_loss=0.07859, over 21845.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3193, pruned_loss=0.08622, over 4285048.79 frames. ], batch size: 332, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:42:40,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=945498.0, ans=0.07 2023-06-21 10:43:04,096 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:43:25,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=945618.0, ans=0.2 2023-06-21 10:43:56,807 INFO [train.py:996] (1/4) Epoch 6, batch 5150, loss[loss=0.2317, simple_loss=0.2931, pruned_loss=0.0852, over 21396.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3183, pruned_loss=0.08746, over 4290784.24 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:44:19,125 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.377e+02 2.990e+02 3.465e+02 4.317e+02 6.616e+02, threshold=6.931e+02, percent-clipped=1.0 2023-06-21 10:44:24,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=945798.0, ans=0.125 2023-06-21 10:44:26,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=945798.0, ans=0.125 2023-06-21 10:44:48,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=945858.0, ans=0.125 2023-06-21 10:45:13,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=945918.0, ans=0.125 2023-06-21 10:45:38,147 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-21 10:45:42,239 INFO [train.py:996] (1/4) Epoch 6, batch 5200, loss[loss=0.3461, simple_loss=0.4109, pruned_loss=0.1407, over 21548.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.32, pruned_loss=0.08822, over 4291003.93 frames. ], batch size: 508, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:45:47,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=946038.0, ans=10.0 2023-06-21 10:46:16,395 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:46:20,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-21 10:46:34,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=15.0 2023-06-21 10:46:41,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-21 10:47:17,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=946278.0, ans=0.0 2023-06-21 10:47:21,848 INFO [train.py:996] (1/4) Epoch 6, batch 5250, loss[loss=0.236, simple_loss=0.3274, pruned_loss=0.07227, over 21786.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3223, pruned_loss=0.08601, over 4283143.01 frames. ], batch size: 282, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:47:36,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=946398.0, ans=0.125 2023-06-21 10:47:39,380 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.923e+02 3.658e+02 4.333e+02 7.638e+02, threshold=7.316e+02, percent-clipped=1.0 2023-06-21 10:48:11,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=946458.0, ans=0.125 2023-06-21 10:49:00,831 INFO [train.py:996] (1/4) Epoch 6, batch 5300, loss[loss=0.2585, simple_loss=0.3176, pruned_loss=0.09967, over 21904.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3216, pruned_loss=0.08662, over 4283351.08 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:49:05,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=946638.0, ans=0.025 2023-06-21 10:49:10,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=946638.0, ans=0.95 2023-06-21 10:49:37,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=946758.0, ans=0.125 2023-06-21 10:49:40,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=946758.0, ans=0.125 2023-06-21 10:49:46,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=946758.0, ans=0.2 2023-06-21 10:49:48,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=946758.0, ans=0.1 2023-06-21 10:50:39,174 INFO [train.py:996] (1/4) Epoch 6, batch 5350, loss[loss=0.298, simple_loss=0.3395, pruned_loss=0.1283, over 21816.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3207, pruned_loss=0.08816, over 4286063.71 frames. ], batch size: 508, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:50:58,324 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.908e+02 3.182e+02 3.564e+02 5.714e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-21 10:51:05,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=22.5 2023-06-21 10:51:16,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-21 10:51:57,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=947118.0, ans=0.125 2023-06-21 10:52:13,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=947178.0, ans=0.1 2023-06-21 10:52:18,091 INFO [train.py:996] (1/4) Epoch 6, batch 5400, loss[loss=0.235, simple_loss=0.3074, pruned_loss=0.08131, over 21920.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3195, pruned_loss=0.08978, over 4296891.28 frames. ], batch size: 118, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:52:18,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=947238.0, ans=0.04949747468305833 2023-06-21 10:52:29,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=12.0 2023-06-21 10:52:39,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=947298.0, ans=0.125 2023-06-21 10:52:56,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=947358.0, ans=0.0 2023-06-21 10:53:35,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-06-21 10:53:37,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=947478.0, ans=0.2 2023-06-21 10:53:53,494 INFO [train.py:996] (1/4) Epoch 6, batch 5450, loss[loss=0.2063, simple_loss=0.2948, pruned_loss=0.05891, over 21617.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3208, pruned_loss=0.08837, over 4295442.70 frames. ], batch size: 230, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:54:12,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=947598.0, ans=0.125 2023-06-21 10:54:17,104 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.387e+02 2.977e+02 3.575e+02 4.401e+02 6.671e+02, threshold=7.149e+02, percent-clipped=1.0 2023-06-21 10:54:19,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-21 10:55:07,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-21 10:55:19,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.77 vs. limit=10.0 2023-06-21 10:55:34,820 INFO [train.py:996] (1/4) Epoch 6, batch 5500, loss[loss=0.2749, simple_loss=0.363, pruned_loss=0.09341, over 21734.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3255, pruned_loss=0.08514, over 4283722.02 frames. ], batch size: 351, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:57:06,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=948078.0, ans=0.125 2023-06-21 10:57:18,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.11 vs. limit=12.0 2023-06-21 10:57:18,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-21 10:57:20,571 INFO [train.py:996] (1/4) Epoch 6, batch 5550, loss[loss=0.1391, simple_loss=0.1956, pruned_loss=0.04129, over 16216.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3238, pruned_loss=0.08173, over 4281987.02 frames. ], batch size: 61, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:57:45,151 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.700e+02 3.215e+02 3.984e+02 5.956e+02, threshold=6.431e+02, percent-clipped=0.0 2023-06-21 10:58:14,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=948258.0, ans=0.125 2023-06-21 10:58:31,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=948318.0, ans=0.0 2023-06-21 10:59:07,564 INFO [train.py:996] (1/4) Epoch 6, batch 5600, loss[loss=0.2599, simple_loss=0.3469, pruned_loss=0.08646, over 21661.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3214, pruned_loss=0.07897, over 4276840.37 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 10:59:09,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=948438.0, ans=0.125 2023-06-21 10:59:16,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-21 10:59:57,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=948558.0, ans=0.0 2023-06-21 11:00:19,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=948618.0, ans=0.125 2023-06-21 11:00:21,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=948618.0, ans=0.0 2023-06-21 11:00:30,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=948678.0, ans=0.1 2023-06-21 11:00:46,526 INFO [train.py:996] (1/4) Epoch 6, batch 5650, loss[loss=0.2474, simple_loss=0.3249, pruned_loss=0.08497, over 21724.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3254, pruned_loss=0.08184, over 4275058.56 frames. ], batch size: 389, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:01:10,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-21 11:01:10,938 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.904e+02 3.593e+02 4.833e+02 7.419e+02, threshold=7.185e+02, percent-clipped=8.0 2023-06-21 11:01:14,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=948798.0, ans=0.07 2023-06-21 11:02:20,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=948978.0, ans=0.125 2023-06-21 11:02:28,412 INFO [train.py:996] (1/4) Epoch 6, batch 5700, loss[loss=0.2299, simple_loss=0.3096, pruned_loss=0.0751, over 21616.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3258, pruned_loss=0.08412, over 4277208.74 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:02:31,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-21 11:02:37,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949038.0, ans=0.1 2023-06-21 11:02:46,921 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:03:05,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=949158.0, ans=0.5 2023-06-21 11:03:16,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=949158.0, ans=0.1 2023-06-21 11:03:24,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=949158.0, ans=0.125 2023-06-21 11:04:07,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=949278.0, ans=0.125 2023-06-21 11:04:14,742 INFO [train.py:996] (1/4) Epoch 6, batch 5750, loss[loss=0.2048, simple_loss=0.2933, pruned_loss=0.05813, over 21698.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3183, pruned_loss=0.08024, over 4267237.85 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:04:18,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=949338.0, ans=0.035 2023-06-21 11:04:21,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=949338.0, ans=0.1 2023-06-21 11:04:34,859 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.640e+02 3.214e+02 3.808e+02 7.764e+02, threshold=6.428e+02, percent-clipped=2.0 2023-06-21 11:05:56,169 INFO [train.py:996] (1/4) Epoch 6, batch 5800, loss[loss=0.2187, simple_loss=0.3014, pruned_loss=0.06804, over 21263.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.318, pruned_loss=0.07938, over 4268142.55 frames. ], batch size: 144, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:06:20,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=949698.0, ans=0.125 2023-06-21 11:06:30,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=949698.0, ans=0.125 2023-06-21 11:06:43,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-21 11:07:14,771 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:07:39,076 INFO [train.py:996] (1/4) Epoch 6, batch 5850, loss[loss=0.1815, simple_loss=0.283, pruned_loss=0.03996, over 21783.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3159, pruned_loss=0.07528, over 4265926.20 frames. ], batch size: 332, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:08:03,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.386e+02 2.763e+02 3.432e+02 5.220e+02, threshold=5.525e+02, percent-clipped=0.0 2023-06-21 11:08:29,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950058.0, ans=0.1 2023-06-21 11:08:43,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=950118.0, ans=0.05 2023-06-21 11:09:01,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=950178.0, ans=0.04949747468305833 2023-06-21 11:09:18,229 INFO [train.py:996] (1/4) Epoch 6, batch 5900, loss[loss=0.3149, simple_loss=0.3647, pruned_loss=0.1325, over 21611.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3078, pruned_loss=0.06927, over 4270916.43 frames. ], batch size: 507, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:09:41,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=950298.0, ans=0.125 2023-06-21 11:10:15,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=950358.0, ans=0.125 2023-06-21 11:10:17,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=950358.0, ans=0.125 2023-06-21 11:10:33,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=950418.0, ans=0.09899494936611666 2023-06-21 11:10:33,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-21 11:10:39,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-21 11:10:43,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-21 11:10:57,226 INFO [train.py:996] (1/4) Epoch 6, batch 5950, loss[loss=0.2357, simple_loss=0.2947, pruned_loss=0.08829, over 21216.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3075, pruned_loss=0.07345, over 4280142.19 frames. ], batch size: 176, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 11:10:59,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=950538.0, ans=0.025 2023-06-21 11:11:05,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-21 11:11:22,245 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 2.586e+02 3.198e+02 3.905e+02 6.345e+02, threshold=6.395e+02, percent-clipped=3.0 2023-06-21 11:11:40,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=950658.0, ans=0.125 2023-06-21 11:11:50,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=950658.0, ans=0.125 2023-06-21 11:11:59,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=950658.0, ans=0.0 2023-06-21 11:12:10,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=950718.0, ans=0.125 2023-06-21 11:12:11,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-21 11:12:18,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=950778.0, ans=0.125 2023-06-21 11:12:37,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=950778.0, ans=0.125 2023-06-21 11:12:40,605 INFO [train.py:996] (1/4) Epoch 6, batch 6000, loss[loss=0.2206, simple_loss=0.28, pruned_loss=0.08065, over 21806.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3039, pruned_loss=0.07735, over 4285660.58 frames. ], batch size: 118, lr: 5.24e-03, grad_scale: 32.0 2023-06-21 11:12:40,606 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 11:12:57,290 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2656, simple_loss=0.3626, pruned_loss=0.08426, over 1796401.00 frames. 2023-06-21 11:12:57,291 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-21 11:13:24,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=950898.0, ans=0.125 2023-06-21 11:13:42,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-21 11:13:51,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950958.0, ans=0.1 2023-06-21 11:13:57,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=950958.0, ans=0.0 2023-06-21 11:14:25,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=951078.0, ans=0.125 2023-06-21 11:14:31,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=951078.0, ans=0.0 2023-06-21 11:14:43,499 INFO [train.py:996] (1/4) Epoch 6, batch 6050, loss[loss=0.2312, simple_loss=0.2846, pruned_loss=0.08888, over 21922.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3, pruned_loss=0.07844, over 4284484.76 frames. ], batch size: 113, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:14:48,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=951138.0, ans=0.2 2023-06-21 11:15:16,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.804e+02 3.230e+02 3.761e+02 6.873e+02, threshold=6.459e+02, percent-clipped=1.0 2023-06-21 11:15:34,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-06-21 11:16:12,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=951378.0, ans=0.125 2023-06-21 11:16:16,196 INFO [train.py:996] (1/4) Epoch 6, batch 6100, loss[loss=0.2336, simple_loss=0.3047, pruned_loss=0.08125, over 21656.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2995, pruned_loss=0.07734, over 4283580.25 frames. ], batch size: 263, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:16:32,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=15.0 2023-06-21 11:16:44,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=951498.0, ans=0.1 2023-06-21 11:16:57,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=951498.0, ans=0.125 2023-06-21 11:17:05,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=951558.0, ans=0.2 2023-06-21 11:17:18,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-21 11:17:32,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=951618.0, ans=0.1 2023-06-21 11:17:41,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=951678.0, ans=0.0 2023-06-21 11:17:55,285 INFO [train.py:996] (1/4) Epoch 6, batch 6150, loss[loss=0.1968, simple_loss=0.2648, pruned_loss=0.0644, over 21078.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.303, pruned_loss=0.08025, over 4280025.53 frames. ], batch size: 143, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:18:33,347 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.602e+02 3.011e+02 3.655e+02 5.167e+02, threshold=6.022e+02, percent-clipped=0.0 2023-06-21 11:18:53,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=951858.0, ans=0.2 2023-06-21 11:19:06,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=951918.0, ans=0.1 2023-06-21 11:19:07,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=951918.0, ans=0.1 2023-06-21 11:19:39,846 INFO [train.py:996] (1/4) Epoch 6, batch 6200, loss[loss=0.2459, simple_loss=0.3126, pruned_loss=0.0896, over 21534.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3085, pruned_loss=0.08093, over 4271173.22 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:19:50,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-21 11:20:21,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-21 11:21:04,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=952278.0, ans=0.0 2023-06-21 11:21:20,427 INFO [train.py:996] (1/4) Epoch 6, batch 6250, loss[loss=0.2228, simple_loss=0.3193, pruned_loss=0.06312, over 21636.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3134, pruned_loss=0.08053, over 4270387.05 frames. ], batch size: 263, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:21:39,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=952338.0, ans=0.0 2023-06-21 11:21:53,722 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.114e+02 4.042e+02 5.400e+02 9.374e+02, threshold=8.084e+02, percent-clipped=17.0 2023-06-21 11:22:11,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-21 11:22:23,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=952518.0, ans=0.2 2023-06-21 11:22:39,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=952578.0, ans=0.125 2023-06-21 11:22:58,316 INFO [train.py:996] (1/4) Epoch 6, batch 6300, loss[loss=0.2268, simple_loss=0.3107, pruned_loss=0.07145, over 21497.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3166, pruned_loss=0.07949, over 4270532.02 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:23:01,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=952638.0, ans=0.2 2023-06-21 11:23:13,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952638.0, ans=0.1 2023-06-21 11:23:25,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-21 11:23:55,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=952758.0, ans=0.125 2023-06-21 11:24:39,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=952878.0, ans=0.0 2023-06-21 11:24:48,243 INFO [train.py:996] (1/4) Epoch 6, batch 6350, loss[loss=0.2726, simple_loss=0.3342, pruned_loss=0.1055, over 21774.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3195, pruned_loss=0.08383, over 4272990.26 frames. ], batch size: 298, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:25:07,001 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-21 11:25:17,176 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 3.069e+02 3.635e+02 4.276e+02 7.885e+02, threshold=7.269e+02, percent-clipped=0.0 2023-06-21 11:25:20,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=952998.0, ans=0.125 2023-06-21 11:26:02,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=953178.0, ans=0.0 2023-06-21 11:26:28,455 INFO [train.py:996] (1/4) Epoch 6, batch 6400, loss[loss=0.2565, simple_loss=0.3302, pruned_loss=0.09141, over 21452.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3257, pruned_loss=0.08872, over 4278119.46 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:26:30,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=953238.0, ans=0.04949747468305833 2023-06-21 11:26:30,909 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-21 11:26:41,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=953238.0, ans=0.0 2023-06-21 11:26:50,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=953298.0, ans=0.0 2023-06-21 11:26:59,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=953298.0, ans=0.0 2023-06-21 11:27:10,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=953358.0, ans=0.125 2023-06-21 11:27:57,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=953478.0, ans=0.0 2023-06-21 11:28:12,563 INFO [train.py:996] (1/4) Epoch 6, batch 6450, loss[loss=0.2295, simple_loss=0.2979, pruned_loss=0.0806, over 21183.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3309, pruned_loss=0.08909, over 4271165.14 frames. ], batch size: 143, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:28:15,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=953538.0, ans=0.125 2023-06-21 11:28:28,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=953598.0, ans=0.125 2023-06-21 11:28:36,811 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.859e+02 3.374e+02 4.192e+02 6.332e+02, threshold=6.748e+02, percent-clipped=0.0 2023-06-21 11:28:39,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-21 11:28:40,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=953598.0, ans=0.0 2023-06-21 11:29:15,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=953718.0, ans=0.1 2023-06-21 11:29:54,932 INFO [train.py:996] (1/4) Epoch 6, batch 6500, loss[loss=0.2348, simple_loss=0.2928, pruned_loss=0.08839, over 21208.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3243, pruned_loss=0.08801, over 4274939.48 frames. ], batch size: 176, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:31:35,555 INFO [train.py:996] (1/4) Epoch 6, batch 6550, loss[loss=0.2317, simple_loss=0.3379, pruned_loss=0.0627, over 21305.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3222, pruned_loss=0.08637, over 4281025.12 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:31:48,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=954138.0, ans=0.125 2023-06-21 11:31:58,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=954198.0, ans=0.125 2023-06-21 11:31:59,233 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.779e+02 3.082e+02 3.818e+02 7.032e+02, threshold=6.164e+02, percent-clipped=1.0 2023-06-21 11:32:24,924 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-21 11:32:35,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=954318.0, ans=15.0 2023-06-21 11:32:44,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.19 vs. limit=10.0 2023-06-21 11:33:14,272 INFO [train.py:996] (1/4) Epoch 6, batch 6600, loss[loss=0.1973, simple_loss=0.2899, pruned_loss=0.05229, over 20972.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3159, pruned_loss=0.08626, over 4285215.27 frames. ], batch size: 608, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:33:27,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=954438.0, ans=0.0 2023-06-21 11:33:38,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=954498.0, ans=0.1 2023-06-21 11:34:12,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=954618.0, ans=0.125 2023-06-21 11:34:52,641 INFO [train.py:996] (1/4) Epoch 6, batch 6650, loss[loss=0.2273, simple_loss=0.2834, pruned_loss=0.08562, over 21522.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.308, pruned_loss=0.08283, over 4276071.46 frames. ], batch size: 195, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:35:08,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=954798.0, ans=0.125 2023-06-21 11:35:21,648 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.585e+02 3.021e+02 3.677e+02 6.066e+02, threshold=6.041e+02, percent-clipped=0.0 2023-06-21 11:35:32,038 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-21 11:35:45,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=954858.0, ans=0.0 2023-06-21 11:36:30,472 INFO [train.py:996] (1/4) Epoch 6, batch 6700, loss[loss=0.2842, simple_loss=0.3396, pruned_loss=0.1144, over 21536.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3034, pruned_loss=0.08326, over 4276415.79 frames. ], batch size: 442, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:36:39,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=955038.0, ans=0.0 2023-06-21 11:37:09,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=955158.0, ans=0.0 2023-06-21 11:37:46,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=955218.0, ans=0.1 2023-06-21 11:38:09,242 INFO [train.py:996] (1/4) Epoch 6, batch 6750, loss[loss=0.2597, simple_loss=0.3198, pruned_loss=0.09978, over 21677.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2999, pruned_loss=0.08317, over 4269753.97 frames. ], batch size: 441, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:38:30,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=955398.0, ans=0.125 2023-06-21 11:38:32,818 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.820e+02 3.249e+02 3.969e+02 6.943e+02, threshold=6.498e+02, percent-clipped=2.0 2023-06-21 11:39:44,804 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-21 11:39:47,075 INFO [train.py:996] (1/4) Epoch 6, batch 6800, loss[loss=0.2488, simple_loss=0.3037, pruned_loss=0.09696, over 21538.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3026, pruned_loss=0.08536, over 4274184.30 frames. ], batch size: 414, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:39:47,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-21 11:40:00,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-21 11:40:25,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=955758.0, ans=0.07 2023-06-21 11:40:26,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=955758.0, ans=0.125 2023-06-21 11:40:43,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=955818.0, ans=0.5 2023-06-21 11:40:56,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=955818.0, ans=0.125 2023-06-21 11:40:56,990 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-21 11:40:59,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=955818.0, ans=0.04949747468305833 2023-06-21 11:41:24,266 INFO [train.py:996] (1/4) Epoch 6, batch 6850, loss[loss=0.2463, simple_loss=0.3089, pruned_loss=0.09182, over 21686.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3025, pruned_loss=0.08699, over 4278943.76 frames. ], batch size: 389, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:41:48,456 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.884e+02 3.443e+02 4.128e+02 6.086e+02, threshold=6.887e+02, percent-clipped=0.0 2023-06-21 11:42:05,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=956058.0, ans=0.125 2023-06-21 11:42:22,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-21 11:43:00,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=956178.0, ans=0.125 2023-06-21 11:43:04,857 INFO [train.py:996] (1/4) Epoch 6, batch 6900, loss[loss=0.341, simple_loss=0.4484, pruned_loss=0.1168, over 19813.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3055, pruned_loss=0.08788, over 4279218.51 frames. ], batch size: 702, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:44:45,068 INFO [train.py:996] (1/4) Epoch 6, batch 6950, loss[loss=0.1772, simple_loss=0.2684, pruned_loss=0.04303, over 21385.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3083, pruned_loss=0.08345, over 4278535.67 frames. ], batch size: 211, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:45:11,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=956598.0, ans=0.1 2023-06-21 11:45:11,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=956598.0, ans=0.125 2023-06-21 11:45:13,812 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.660e+02 3.171e+02 3.598e+02 5.873e+02, threshold=6.343e+02, percent-clipped=0.0 2023-06-21 11:45:14,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=956598.0, ans=0.125 2023-06-21 11:45:57,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=12.0 2023-06-21 11:46:02,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=956718.0, ans=0.0 2023-06-21 11:46:24,056 INFO [train.py:996] (1/4) Epoch 6, batch 7000, loss[loss=0.2155, simple_loss=0.2798, pruned_loss=0.0756, over 21770.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3122, pruned_loss=0.08671, over 4282601.53 frames. ], batch size: 112, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:46:27,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=956838.0, ans=0.125 2023-06-21 11:47:23,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=956958.0, ans=0.1 2023-06-21 11:48:04,821 INFO [train.py:996] (1/4) Epoch 6, batch 7050, loss[loss=0.2669, simple_loss=0.3285, pruned_loss=0.1026, over 20819.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.311, pruned_loss=0.08578, over 4276052.20 frames. ], batch size: 608, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:48:36,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=8.0 2023-06-21 11:48:38,895 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.010e+02 3.429e+02 4.410e+02 6.547e+02, threshold=6.858e+02, percent-clipped=1.0 2023-06-21 11:48:39,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957198.0, ans=0.0 2023-06-21 11:49:02,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-21 11:49:03,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=957258.0, ans=0.0 2023-06-21 11:49:08,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=957258.0, ans=0.125 2023-06-21 11:49:29,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=957378.0, ans=0.125 2023-06-21 11:49:45,405 INFO [train.py:996] (1/4) Epoch 6, batch 7100, loss[loss=0.2303, simple_loss=0.3329, pruned_loss=0.06389, over 19882.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3151, pruned_loss=0.08745, over 4268832.40 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:50:52,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-21 11:51:25,094 INFO [train.py:996] (1/4) Epoch 6, batch 7150, loss[loss=0.3058, simple_loss=0.3679, pruned_loss=0.1219, over 21801.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.312, pruned_loss=0.08464, over 4274462.31 frames. ], batch size: 118, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:52:08,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.713e+02 3.100e+02 3.583e+02 6.411e+02, threshold=6.200e+02, percent-clipped=0.0 2023-06-21 11:52:23,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=957858.0, ans=0.0 2023-06-21 11:53:12,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=957978.0, ans=0.5 2023-06-21 11:53:14,726 INFO [train.py:996] (1/4) Epoch 6, batch 7200, loss[loss=0.2366, simple_loss=0.2936, pruned_loss=0.08981, over 21658.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3146, pruned_loss=0.08689, over 4278062.53 frames. ], batch size: 298, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:53:15,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=958038.0, ans=0.0 2023-06-21 11:54:04,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=958158.0, ans=0.125 2023-06-21 11:54:11,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=15.0 2023-06-21 11:54:12,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=958218.0, ans=0.125 2023-06-21 11:54:54,070 INFO [train.py:996] (1/4) Epoch 6, batch 7250, loss[loss=0.1961, simple_loss=0.2532, pruned_loss=0.06944, over 20783.00 frames. ], tot_loss[loss=0.242, simple_loss=0.31, pruned_loss=0.08701, over 4274198.01 frames. ], batch size: 608, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 11:55:25,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=958398.0, ans=0.125 2023-06-21 11:55:28,077 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.745e+02 3.061e+02 4.034e+02 7.842e+02, threshold=6.122e+02, percent-clipped=5.0 2023-06-21 11:55:41,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=958458.0, ans=0.09899494936611666 2023-06-21 11:55:41,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=958458.0, ans=0.125 2023-06-21 11:56:01,150 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-21 11:56:07,656 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-21 11:56:27,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=958578.0, ans=0.2 2023-06-21 11:56:33,696 INFO [train.py:996] (1/4) Epoch 6, batch 7300, loss[loss=0.2135, simple_loss=0.2689, pruned_loss=0.07905, over 21781.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3048, pruned_loss=0.08595, over 4267917.85 frames. ], batch size: 118, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 11:57:13,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=958698.0, ans=0.125 2023-06-21 11:57:19,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=958758.0, ans=0.1 2023-06-21 11:57:21,166 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:57:26,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=958758.0, ans=0.0 2023-06-21 11:57:52,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=958878.0, ans=0.0 2023-06-21 11:57:55,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-21 11:58:21,345 INFO [train.py:996] (1/4) Epoch 6, batch 7350, loss[loss=0.2327, simple_loss=0.2868, pruned_loss=0.08933, over 20736.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3026, pruned_loss=0.08567, over 4262804.93 frames. ], batch size: 609, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 11:58:52,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.849e+02 3.277e+02 4.064e+02 7.126e+02, threshold=6.555e+02, percent-clipped=2.0 2023-06-21 11:59:24,926 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:59:35,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.65 vs. limit=22.5 2023-06-21 11:59:53,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=959178.0, ans=0.125 2023-06-21 11:59:54,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-21 12:00:07,530 INFO [train.py:996] (1/4) Epoch 6, batch 7400, loss[loss=0.2196, simple_loss=0.289, pruned_loss=0.07512, over 21354.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3102, pruned_loss=0.08749, over 4256026.39 frames. ], batch size: 159, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:00:09,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=959238.0, ans=0.0 2023-06-21 12:00:34,375 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-21 12:00:53,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=959418.0, ans=0.0 2023-06-21 12:01:05,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-21 12:01:48,243 INFO [train.py:996] (1/4) Epoch 6, batch 7450, loss[loss=0.2549, simple_loss=0.3042, pruned_loss=0.1028, over 21303.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3071, pruned_loss=0.0868, over 4260611.94 frames. ], batch size: 160, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:01:48,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=959538.0, ans=0.0 2023-06-21 12:02:13,854 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.784e+02 3.192e+02 3.792e+02 7.564e+02, threshold=6.383e+02, percent-clipped=1.0 2023-06-21 12:03:16,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=959778.0, ans=0.125 2023-06-21 12:03:31,308 INFO [train.py:996] (1/4) Epoch 6, batch 7500, loss[loss=0.3005, simple_loss=0.3865, pruned_loss=0.1073, over 21625.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3126, pruned_loss=0.0886, over 4271342.83 frames. ], batch size: 263, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:03:40,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=959838.0, ans=0.125 2023-06-21 12:03:43,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=959838.0, ans=0.0 2023-06-21 12:03:58,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959898.0, ans=0.1 2023-06-21 12:04:00,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.43 vs. limit=6.0 2023-06-21 12:05:10,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.57 vs. limit=15.0 2023-06-21 12:05:15,600 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:05:16,807 INFO [train.py:996] (1/4) Epoch 6, batch 7550, loss[loss=0.2192, simple_loss=0.3093, pruned_loss=0.06451, over 21654.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3192, pruned_loss=0.08723, over 4268883.12 frames. ], batch size: 263, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:05:41,241 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.64 vs. limit=15.0 2023-06-21 12:05:41,922 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 3.161e+02 3.642e+02 4.665e+02 7.611e+02, threshold=7.284e+02, percent-clipped=6.0 2023-06-21 12:06:24,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=960318.0, ans=0.1 2023-06-21 12:06:26,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=960318.0, ans=0.95 2023-06-21 12:06:48,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=960378.0, ans=0.125 2023-06-21 12:06:56,334 INFO [train.py:996] (1/4) Epoch 6, batch 7600, loss[loss=0.2368, simple_loss=0.291, pruned_loss=0.09125, over 20953.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3176, pruned_loss=0.08538, over 4263251.87 frames. ], batch size: 613, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:06:57,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=960438.0, ans=15.0 2023-06-21 12:07:04,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=960438.0, ans=0.0 2023-06-21 12:07:25,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=960498.0, ans=0.2 2023-06-21 12:07:35,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-21 12:08:02,239 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:08:34,763 INFO [train.py:996] (1/4) Epoch 6, batch 7650, loss[loss=0.29, simple_loss=0.3325, pruned_loss=0.1237, over 21814.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3161, pruned_loss=0.08674, over 4275879.37 frames. ], batch size: 508, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:08:36,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=960738.0, ans=0.125 2023-06-21 12:08:49,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960738.0, ans=0.1 2023-06-21 12:08:52,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=960798.0, ans=0.125 2023-06-21 12:09:01,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 3.034e+02 3.412e+02 4.046e+02 6.566e+02, threshold=6.823e+02, percent-clipped=0.0 2023-06-21 12:09:47,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=960918.0, ans=0.2 2023-06-21 12:10:00,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=960978.0, ans=0.125 2023-06-21 12:10:04,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=960978.0, ans=0.125 2023-06-21 12:10:17,237 INFO [train.py:996] (1/4) Epoch 6, batch 7700, loss[loss=0.2598, simple_loss=0.323, pruned_loss=0.09831, over 21840.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3205, pruned_loss=0.09096, over 4285447.59 frames. ], batch size: 282, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:10:44,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=961098.0, ans=0.125 2023-06-21 12:11:00,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=961158.0, ans=0.125 2023-06-21 12:11:37,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-21 12:11:48,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=961278.0, ans=0.0 2023-06-21 12:11:50,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=961278.0, ans=0.2 2023-06-21 12:11:57,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=961338.0, ans=0.05 2023-06-21 12:11:59,166 INFO [train.py:996] (1/4) Epoch 6, batch 7750, loss[loss=0.2925, simple_loss=0.3933, pruned_loss=0.09582, over 21769.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.325, pruned_loss=0.09059, over 4278487.47 frames. ], batch size: 332, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:12:06,678 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-21 12:12:11,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-21 12:12:18,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=961398.0, ans=0.1 2023-06-21 12:12:30,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961398.0, ans=0.1 2023-06-21 12:12:34,943 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 3.114e+02 3.576e+02 4.204e+02 7.368e+02, threshold=7.152e+02, percent-clipped=1.0 2023-06-21 12:12:46,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-21 12:13:10,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=961518.0, ans=0.125 2023-06-21 12:13:11,953 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:13:22,551 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:13:24,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=961578.0, ans=0.125 2023-06-21 12:13:39,891 INFO [train.py:996] (1/4) Epoch 6, batch 7800, loss[loss=0.2256, simple_loss=0.295, pruned_loss=0.0781, over 21668.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3259, pruned_loss=0.09017, over 4265576.91 frames. ], batch size: 263, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:14:30,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=961758.0, ans=0.125 2023-06-21 12:15:15,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=961878.0, ans=0.125 2023-06-21 12:15:15,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=961878.0, ans=0.125 2023-06-21 12:15:18,177 INFO [train.py:996] (1/4) Epoch 6, batch 7850, loss[loss=0.2238, simple_loss=0.2821, pruned_loss=0.08274, over 21472.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.318, pruned_loss=0.08916, over 4273376.05 frames. ], batch size: 195, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:16:02,012 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.425e+02 2.926e+02 3.453e+02 4.214e+02 9.317e+02, threshold=6.905e+02, percent-clipped=1.0 2023-06-21 12:16:51,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=962178.0, ans=0.0 2023-06-21 12:17:00,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=962238.0, ans=0.0 2023-06-21 12:17:01,337 INFO [train.py:996] (1/4) Epoch 6, batch 7900, loss[loss=0.2742, simple_loss=0.3713, pruned_loss=0.08858, over 21634.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3143, pruned_loss=0.08898, over 4278652.95 frames. ], batch size: 414, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:17:36,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=962298.0, ans=0.125 2023-06-21 12:17:59,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=962358.0, ans=0.07 2023-06-21 12:18:20,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=962418.0, ans=0.125 2023-06-21 12:18:26,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=962478.0, ans=0.09899494936611666 2023-06-21 12:18:42,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=962478.0, ans=0.2 2023-06-21 12:18:51,637 INFO [train.py:996] (1/4) Epoch 6, batch 7950, loss[loss=0.2754, simple_loss=0.3409, pruned_loss=0.105, over 21469.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.317, pruned_loss=0.08738, over 4275761.98 frames. ], batch size: 194, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:18:55,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=962538.0, ans=0.125 2023-06-21 12:19:29,547 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.498e+02 4.372e+02 5.089e+02 1.068e+03, threshold=8.743e+02, percent-clipped=8.0 2023-06-21 12:20:19,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=962778.0, ans=0.125 2023-06-21 12:20:39,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=962778.0, ans=0.125 2023-06-21 12:20:44,509 INFO [train.py:996] (1/4) Epoch 6, batch 8000, loss[loss=0.2663, simple_loss=0.3475, pruned_loss=0.09257, over 21814.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3238, pruned_loss=0.09012, over 4271760.04 frames. ], batch size: 282, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:20:48,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=962838.0, ans=0.0 2023-06-21 12:21:12,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=962898.0, ans=0.1 2023-06-21 12:22:28,306 INFO [train.py:996] (1/4) Epoch 6, batch 8050, loss[loss=0.2798, simple_loss=0.3599, pruned_loss=0.09984, over 21719.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3274, pruned_loss=0.09079, over 4264093.32 frames. ], batch size: 351, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:22:38,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=963138.0, ans=0.0 2023-06-21 12:22:55,398 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 2.992e+02 3.416e+02 4.104e+02 8.130e+02, threshold=6.832e+02, percent-clipped=0.0 2023-06-21 12:23:02,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=963258.0, ans=0.2 2023-06-21 12:23:05,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.37 vs. limit=15.0 2023-06-21 12:23:18,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=22.5 2023-06-21 12:23:26,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-21 12:24:07,610 INFO [train.py:996] (1/4) Epoch 6, batch 8100, loss[loss=0.2783, simple_loss=0.3421, pruned_loss=0.1073, over 21518.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3283, pruned_loss=0.09294, over 4275633.96 frames. ], batch size: 131, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:24:43,506 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=22.5 2023-06-21 12:25:35,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-21 12:25:39,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-21 12:25:41,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=963678.0, ans=0.125 2023-06-21 12:25:44,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=963678.0, ans=0.125 2023-06-21 12:25:44,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=963678.0, ans=0.0 2023-06-21 12:25:50,203 INFO [train.py:996] (1/4) Epoch 6, batch 8150, loss[loss=0.2658, simple_loss=0.3857, pruned_loss=0.07291, over 21159.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3353, pruned_loss=0.09424, over 4276104.55 frames. ], batch size: 548, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:26:15,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=963798.0, ans=0.04949747468305833 2023-06-21 12:26:36,233 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.091e+02 3.490e+02 4.370e+02 7.436e+02, threshold=6.980e+02, percent-clipped=1.0 2023-06-21 12:26:39,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=963858.0, ans=0.125 2023-06-21 12:27:03,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-21 12:27:04,101 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:27:27,543 INFO [train.py:996] (1/4) Epoch 6, batch 8200, loss[loss=0.2668, simple_loss=0.3167, pruned_loss=0.1084, over 21848.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3285, pruned_loss=0.09148, over 4278027.47 frames. ], batch size: 373, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:28:39,616 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-21 12:29:07,633 INFO [train.py:996] (1/4) Epoch 6, batch 8250, loss[loss=0.2993, simple_loss=0.3845, pruned_loss=0.107, over 21606.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3268, pruned_loss=0.09153, over 4274831.84 frames. ], batch size: 389, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:29:56,750 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.891e+02 3.432e+02 4.145e+02 7.025e+02, threshold=6.865e+02, percent-clipped=1.0 2023-06-21 12:30:46,907 INFO [train.py:996] (1/4) Epoch 6, batch 8300, loss[loss=0.2135, simple_loss=0.2904, pruned_loss=0.06826, over 21383.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3252, pruned_loss=0.08789, over 4277868.87 frames. ], batch size: 211, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:31:06,579 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-06-21 12:31:13,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-21 12:31:29,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=964698.0, ans=0.125 2023-06-21 12:31:37,506 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:31:42,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-21 12:31:50,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=964758.0, ans=0.5 2023-06-21 12:31:56,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=964818.0, ans=0.2 2023-06-21 12:32:33,461 INFO [train.py:996] (1/4) Epoch 6, batch 8350, loss[loss=0.241, simple_loss=0.3186, pruned_loss=0.08169, over 21679.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3238, pruned_loss=0.08535, over 4277987.46 frames. ], batch size: 263, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:32:49,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-21 12:33:06,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=964998.0, ans=0.125 2023-06-21 12:33:17,177 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.748e+02 3.092e+02 3.699e+02 5.409e+02, threshold=6.184e+02, percent-clipped=0.0 2023-06-21 12:33:33,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-21 12:33:40,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=965118.0, ans=0.125 2023-06-21 12:34:01,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=965178.0, ans=0.125 2023-06-21 12:34:14,493 INFO [train.py:996] (1/4) Epoch 6, batch 8400, loss[loss=0.1785, simple_loss=0.2397, pruned_loss=0.05872, over 21865.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3187, pruned_loss=0.08155, over 4271398.57 frames. ], batch size: 107, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:35:03,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=965358.0, ans=0.125 2023-06-21 12:35:13,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-06-21 12:35:16,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-21 12:35:21,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=965418.0, ans=0.0 2023-06-21 12:35:42,883 INFO [train.py:996] (1/4) Epoch 6, batch 8450, loss[loss=0.2355, simple_loss=0.2926, pruned_loss=0.08922, over 21324.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3165, pruned_loss=0.08099, over 4276088.68 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:35:55,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=965538.0, ans=0.125 2023-06-21 12:36:03,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=965538.0, ans=0.0 2023-06-21 12:36:28,774 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.529e+02 3.064e+02 3.775e+02 6.261e+02, threshold=6.127e+02, percent-clipped=1.0 2023-06-21 12:36:40,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-21 12:36:48,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=965718.0, ans=0.0 2023-06-21 12:36:48,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=965718.0, ans=0.04949747468305833 2023-06-21 12:36:59,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=965718.0, ans=0.1 2023-06-21 12:37:09,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=965778.0, ans=0.0 2023-06-21 12:37:17,958 INFO [train.py:996] (1/4) Epoch 6, batch 8500, loss[loss=0.2248, simple_loss=0.2847, pruned_loss=0.08243, over 21705.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3147, pruned_loss=0.08326, over 4269151.58 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:37:43,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=965838.0, ans=0.05 2023-06-21 12:38:39,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-21 12:38:58,627 INFO [train.py:996] (1/4) Epoch 6, batch 8550, loss[loss=0.2455, simple_loss=0.3314, pruned_loss=0.0798, over 21789.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3189, pruned_loss=0.08615, over 4259564.46 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:39:40,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=966198.0, ans=0.125 2023-06-21 12:39:47,343 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.008e+02 3.313e+02 4.045e+02 7.159e+02, threshold=6.625e+02, percent-clipped=3.0 2023-06-21 12:40:05,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=966318.0, ans=0.125 2023-06-21 12:41:10,951 INFO [train.py:996] (1/4) Epoch 6, batch 8600, loss[loss=0.2823, simple_loss=0.3539, pruned_loss=0.1054, over 21585.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3237, pruned_loss=0.08739, over 4263346.60 frames. ], batch size: 389, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:41:41,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=966498.0, ans=0.125 2023-06-21 12:41:59,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=966558.0, ans=0.2 2023-06-21 12:42:52,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=966678.0, ans=0.0 2023-06-21 12:42:55,494 INFO [train.py:996] (1/4) Epoch 6, batch 8650, loss[loss=0.1977, simple_loss=0.2971, pruned_loss=0.04909, over 21846.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3293, pruned_loss=0.08991, over 4265174.59 frames. ], batch size: 316, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:42:56,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.03 vs. limit=12.0 2023-06-21 12:43:01,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966738.0, ans=0.1 2023-06-21 12:43:22,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=966798.0, ans=0.125 2023-06-21 12:43:23,702 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.940e+02 3.541e+02 4.015e+02 7.663e+02, threshold=7.081e+02, percent-clipped=3.0 2023-06-21 12:43:25,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=966858.0, ans=0.0 2023-06-21 12:43:50,653 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:44:28,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=967038.0, ans=0.125 2023-06-21 12:44:29,613 INFO [train.py:996] (1/4) Epoch 6, batch 8700, loss[loss=0.2298, simple_loss=0.2918, pruned_loss=0.08393, over 21619.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3209, pruned_loss=0.08623, over 4261135.57 frames. ], batch size: 332, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:45:08,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=967158.0, ans=0.0 2023-06-21 12:45:10,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=967158.0, ans=0.0 2023-06-21 12:45:19,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=967218.0, ans=0.95 2023-06-21 12:45:22,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-21 12:45:32,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967218.0, ans=0.1 2023-06-21 12:45:49,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=967278.0, ans=0.2 2023-06-21 12:46:04,004 INFO [train.py:996] (1/4) Epoch 6, batch 8750, loss[loss=0.246, simple_loss=0.3148, pruned_loss=0.08864, over 21889.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.318, pruned_loss=0.08724, over 4270653.93 frames. ], batch size: 118, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:46:17,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=967338.0, ans=0.125 2023-06-21 12:46:34,033 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.061e+02 3.811e+02 4.792e+02 9.884e+02, threshold=7.621e+02, percent-clipped=4.0 2023-06-21 12:47:42,809 INFO [train.py:996] (1/4) Epoch 6, batch 8800, loss[loss=0.2873, simple_loss=0.3562, pruned_loss=0.1092, over 21512.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3276, pruned_loss=0.09087, over 4267320.79 frames. ], batch size: 211, lr: 5.20e-03, grad_scale: 32.0 2023-06-21 12:48:53,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=967818.0, ans=0.125 2023-06-21 12:49:18,417 INFO [train.py:996] (1/4) Epoch 6, batch 8850, loss[loss=0.2158, simple_loss=0.3025, pruned_loss=0.06455, over 21700.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3345, pruned_loss=0.09259, over 4274523.68 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:49:48,828 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.854e+02 3.378e+02 4.143e+02 7.151e+02, threshold=6.757e+02, percent-clipped=0.0 2023-06-21 12:50:54,516 INFO [train.py:996] (1/4) Epoch 6, batch 8900, loss[loss=0.2293, simple_loss=0.2871, pruned_loss=0.08577, over 21260.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3288, pruned_loss=0.09108, over 4271656.75 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:51:15,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=968298.0, ans=0.125 2023-06-21 12:51:57,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=968418.0, ans=0.1 2023-06-21 12:52:27,951 INFO [train.py:996] (1/4) Epoch 6, batch 8950, loss[loss=0.2569, simple_loss=0.3269, pruned_loss=0.09346, over 21659.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3273, pruned_loss=0.09019, over 4273493.68 frames. ], batch size: 298, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:53:10,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=12.0 2023-06-21 12:53:12,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.104e+02 3.637e+02 4.159e+02 7.258e+02, threshold=7.275e+02, percent-clipped=2.0 2023-06-21 12:53:20,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=968658.0, ans=0.2 2023-06-21 12:54:02,941 INFO [train.py:996] (1/4) Epoch 6, batch 9000, loss[loss=0.331, simple_loss=0.4256, pruned_loss=0.1183, over 20743.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3234, pruned_loss=0.09051, over 4268906.34 frames. ], batch size: 607, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:54:02,941 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 12:54:25,119 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2624, simple_loss=0.3599, pruned_loss=0.08239, over 1796401.00 frames. 2023-06-21 12:54:25,120 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24415MB 2023-06-21 12:54:45,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=968898.0, ans=0.04949747468305833 2023-06-21 12:55:04,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=968958.0, ans=0.125 2023-06-21 12:55:17,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-21 12:55:18,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=969018.0, ans=0.0 2023-06-21 12:56:01,356 INFO [train.py:996] (1/4) Epoch 6, batch 9050, loss[loss=0.1447, simple_loss=0.2043, pruned_loss=0.04255, over 16345.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3164, pruned_loss=0.08542, over 4266186.18 frames. ], batch size: 63, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:56:01,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=969138.0, ans=0.125 2023-06-21 12:56:22,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=969138.0, ans=0.1 2023-06-21 12:56:37,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=969198.0, ans=0.2 2023-06-21 12:56:41,593 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.936e+02 3.440e+02 3.853e+02 8.730e+02, threshold=6.881e+02, percent-clipped=1.0 2023-06-21 12:56:52,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=969258.0, ans=0.125 2023-06-21 12:57:43,020 INFO [train.py:996] (1/4) Epoch 6, batch 9100, loss[loss=0.2783, simple_loss=0.362, pruned_loss=0.09735, over 21581.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3241, pruned_loss=0.08835, over 4267396.57 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 12:57:54,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=969438.0, ans=0.125 2023-06-21 12:58:10,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-21 12:58:26,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=969558.0, ans=0.125 2023-06-21 12:58:26,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=969558.0, ans=0.2 2023-06-21 12:59:18,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=969678.0, ans=0.1 2023-06-21 12:59:27,362 INFO [train.py:996] (1/4) Epoch 6, batch 9150, loss[loss=0.2235, simple_loss=0.3165, pruned_loss=0.06529, over 21796.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3257, pruned_loss=0.0847, over 4269731.60 frames. ], batch size: 298, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 12:59:40,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-21 12:59:57,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.788e+02 3.229e+02 4.446e+02 7.555e+02, threshold=6.457e+02, percent-clipped=3.0 2023-06-21 13:00:40,579 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:01:00,364 INFO [train.py:996] (1/4) Epoch 6, batch 9200, loss[loss=0.2825, simple_loss=0.3571, pruned_loss=0.1039, over 21793.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3259, pruned_loss=0.08305, over 4266713.08 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:01:36,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=970158.0, ans=0.035 2023-06-21 13:01:45,130 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:02:36,685 INFO [train.py:996] (1/4) Epoch 6, batch 9250, loss[loss=0.2785, simple_loss=0.349, pruned_loss=0.104, over 21796.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3299, pruned_loss=0.08642, over 4260065.02 frames. ], batch size: 124, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:02:55,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=970398.0, ans=0.0 2023-06-21 13:03:06,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=970458.0, ans=0.125 2023-06-21 13:03:07,705 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.070e+02 3.502e+02 4.094e+02 6.605e+02, threshold=7.004e+02, percent-clipped=1.0 2023-06-21 13:03:09,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=970458.0, ans=22.5 2023-06-21 13:03:43,382 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-21 13:04:13,724 INFO [train.py:996] (1/4) Epoch 6, batch 9300, loss[loss=0.3214, simple_loss=0.397, pruned_loss=0.1229, over 21388.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3241, pruned_loss=0.08645, over 4247804.21 frames. ], batch size: 507, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:04:14,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=970638.0, ans=0.125 2023-06-21 13:04:14,166 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:04:15,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=970638.0, ans=0.125 2023-06-21 13:05:06,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=970758.0, ans=0.125 2023-06-21 13:05:24,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=22.5 2023-06-21 13:05:50,435 INFO [train.py:996] (1/4) Epoch 6, batch 9350, loss[loss=0.2796, simple_loss=0.3582, pruned_loss=0.1005, over 21611.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3296, pruned_loss=0.0882, over 4245774.55 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:05:57,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=970938.0, ans=0.0 2023-06-21 13:06:07,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=970998.0, ans=0.0 2023-06-21 13:06:31,422 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.002e+02 3.519e+02 4.065e+02 7.578e+02, threshold=7.038e+02, percent-clipped=1.0 2023-06-21 13:07:26,211 INFO [train.py:996] (1/4) Epoch 6, batch 9400, loss[loss=0.2526, simple_loss=0.3174, pruned_loss=0.09389, over 21739.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3328, pruned_loss=0.08861, over 4252276.92 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:07:37,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-06-21 13:08:40,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=971478.0, ans=0.0 2023-06-21 13:08:51,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=971478.0, ans=0.125 2023-06-21 13:08:56,681 INFO [train.py:996] (1/4) Epoch 6, batch 9450, loss[loss=0.2137, simple_loss=0.2811, pruned_loss=0.07313, over 21800.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3235, pruned_loss=0.08693, over 4248522.44 frames. ], batch size: 118, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:08:58,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=971538.0, ans=0.02 2023-06-21 13:09:41,566 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.152e+02 3.709e+02 4.839e+02 7.749e+02, threshold=7.417e+02, percent-clipped=1.0 2023-06-21 13:10:05,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=971718.0, ans=0.0 2023-06-21 13:10:17,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=971778.0, ans=0.125 2023-06-21 13:10:32,102 INFO [train.py:996] (1/4) Epoch 6, batch 9500, loss[loss=0.1858, simple_loss=0.2708, pruned_loss=0.05044, over 21581.00 frames. ], tot_loss[loss=0.244, simple_loss=0.317, pruned_loss=0.08545, over 4244454.46 frames. ], batch size: 263, lr: 5.19e-03, grad_scale: 8.0 2023-06-21 13:11:42,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=972018.0, ans=0.125 2023-06-21 13:12:01,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.72 vs. limit=22.5 2023-06-21 13:12:04,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-21 13:12:07,946 INFO [train.py:996] (1/4) Epoch 6, batch 9550, loss[loss=0.2453, simple_loss=0.3397, pruned_loss=0.0754, over 21765.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3218, pruned_loss=0.08835, over 4254548.74 frames. ], batch size: 247, lr: 5.19e-03, grad_scale: 8.0 2023-06-21 13:12:31,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972198.0, ans=0.1 2023-06-21 13:12:46,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=972198.0, ans=0.0 2023-06-21 13:13:00,558 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.847e+02 3.325e+02 4.189e+02 8.114e+02, threshold=6.651e+02, percent-clipped=1.0 2023-06-21 13:13:28,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=972378.0, ans=0.5 2023-06-21 13:13:33,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972378.0, ans=0.1 2023-06-21 13:13:43,374 INFO [train.py:996] (1/4) Epoch 6, batch 9600, loss[loss=0.2449, simple_loss=0.3063, pruned_loss=0.09173, over 21287.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3239, pruned_loss=0.09047, over 4264871.82 frames. ], batch size: 143, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 13:13:47,072 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:14:12,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=972498.0, ans=0.125 2023-06-21 13:14:37,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=972558.0, ans=0.125 2023-06-21 13:15:25,385 INFO [train.py:996] (1/4) Epoch 6, batch 9650, loss[loss=0.3465, simple_loss=0.3942, pruned_loss=0.1494, over 21453.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3245, pruned_loss=0.0903, over 4272662.46 frames. ], batch size: 471, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 13:16:12,606 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 2.974e+02 3.476e+02 4.202e+02 8.291e+02, threshold=6.952e+02, percent-clipped=2.0 2023-06-21 13:16:45,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=972978.0, ans=0.2 2023-06-21 13:16:57,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=972978.0, ans=0.125 2023-06-21 13:17:06,066 INFO [train.py:996] (1/4) Epoch 6, batch 9700, loss[loss=0.268, simple_loss=0.3411, pruned_loss=0.09741, over 21563.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3297, pruned_loss=0.09165, over 4274574.63 frames. ], batch size: 471, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:17:33,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=973098.0, ans=0.125 2023-06-21 13:17:37,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.16 vs. limit=15.0 2023-06-21 13:17:38,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=973098.0, ans=0.0 2023-06-21 13:18:02,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973218.0, ans=0.1 2023-06-21 13:18:16,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=973278.0, ans=0.125 2023-06-21 13:18:35,241 INFO [train.py:996] (1/4) Epoch 6, batch 9750, loss[loss=0.2457, simple_loss=0.3493, pruned_loss=0.07103, over 20817.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3227, pruned_loss=0.08982, over 4274706.19 frames. ], batch size: 607, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:19:13,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=973398.0, ans=0.125 2023-06-21 13:19:16,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=973398.0, ans=0.0 2023-06-21 13:19:23,214 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.880e+02 3.330e+02 4.100e+02 8.108e+02, threshold=6.660e+02, percent-clipped=1.0 2023-06-21 13:19:23,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=973458.0, ans=0.125 2023-06-21 13:19:31,650 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-21 13:19:53,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=973578.0, ans=0.125 2023-06-21 13:20:02,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=973578.0, ans=0.125 2023-06-21 13:20:09,969 INFO [train.py:996] (1/4) Epoch 6, batch 9800, loss[loss=0.2195, simple_loss=0.2989, pruned_loss=0.07006, over 16704.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3227, pruned_loss=0.08977, over 4253345.54 frames. ], batch size: 65, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:20:59,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973758.0, ans=0.1 2023-06-21 13:21:04,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=973758.0, ans=0.2 2023-06-21 13:21:22,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=973818.0, ans=0.0 2023-06-21 13:21:25,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=973878.0, ans=0.125 2023-06-21 13:21:37,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=973878.0, ans=0.0 2023-06-21 13:21:40,039 INFO [train.py:996] (1/4) Epoch 6, batch 9850, loss[loss=0.2373, simple_loss=0.2872, pruned_loss=0.09364, over 21394.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.318, pruned_loss=0.08938, over 4259984.88 frames. ], batch size: 473, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:22:15,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=973998.0, ans=0.0 2023-06-21 13:22:32,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.858e+02 3.118e+02 3.826e+02 5.863e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-21 13:22:32,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=974058.0, ans=0.0 2023-06-21 13:22:37,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-21 13:22:48,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-21 13:23:15,701 INFO [train.py:996] (1/4) Epoch 6, batch 9900, loss[loss=0.2689, simple_loss=0.328, pruned_loss=0.1049, over 21747.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3147, pruned_loss=0.08903, over 4235356.57 frames. ], batch size: 247, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:23:51,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-21 13:24:13,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=974358.0, ans=0.0 2023-06-21 13:24:21,743 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=22.5 2023-06-21 13:24:22,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=974418.0, ans=0.125 2023-06-21 13:24:25,837 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:24:32,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-06-21 13:24:56,239 INFO [train.py:996] (1/4) Epoch 6, batch 9950, loss[loss=0.238, simple_loss=0.3032, pruned_loss=0.08636, over 21835.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3153, pruned_loss=0.09123, over 4243259.05 frames. ], batch size: 98, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:25:07,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=974538.0, ans=0.125 2023-06-21 13:25:39,903 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 2.967e+02 3.408e+02 4.209e+02 6.972e+02, threshold=6.817e+02, percent-clipped=1.0 2023-06-21 13:25:43,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-21 13:25:47,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=974658.0, ans=0.1 2023-06-21 13:25:53,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=974718.0, ans=0.0 2023-06-21 13:26:03,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=974718.0, ans=0.125 2023-06-21 13:26:06,738 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2023-06-21 13:26:32,388 INFO [train.py:996] (1/4) Epoch 6, batch 10000, loss[loss=0.2559, simple_loss=0.3283, pruned_loss=0.09176, over 21965.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3096, pruned_loss=0.08941, over 4255142.18 frames. ], batch size: 373, lr: 5.18e-03, grad_scale: 32.0 2023-06-21 13:27:05,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=974898.0, ans=0.125 2023-06-21 13:27:57,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=975078.0, ans=0.125 2023-06-21 13:28:00,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-21 13:28:04,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=975078.0, ans=0.125 2023-06-21 13:28:06,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=975138.0, ans=0.05 2023-06-21 13:28:07,388 INFO [train.py:996] (1/4) Epoch 6, batch 10050, loss[loss=0.2001, simple_loss=0.2876, pruned_loss=0.05635, over 21870.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3117, pruned_loss=0.08888, over 4264578.49 frames. ], batch size: 372, lr: 5.18e-03, grad_scale: 32.0 2023-06-21 13:28:25,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=975138.0, ans=0.2 2023-06-21 13:28:42,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=975198.0, ans=0.125 2023-06-21 13:28:51,172 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.710e+02 3.231e+02 4.212e+02 7.416e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-21 13:28:53,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=975258.0, ans=0.125 2023-06-21 13:28:57,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=975258.0, ans=0.125 2023-06-21 13:29:00,872 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:29:27,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=975378.0, ans=0.125 2023-06-21 13:29:29,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-21 13:29:30,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=975378.0, ans=0.0 2023-06-21 13:29:53,556 INFO [train.py:996] (1/4) Epoch 6, batch 10100, loss[loss=0.2042, simple_loss=0.2613, pruned_loss=0.07355, over 21353.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3091, pruned_loss=0.08654, over 4266639.10 frames. ], batch size: 159, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:30:17,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=975498.0, ans=0.125 2023-06-21 13:30:32,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=975558.0, ans=0.0 2023-06-21 13:30:34,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-21 13:31:29,493 INFO [train.py:996] (1/4) Epoch 6, batch 10150, loss[loss=0.2267, simple_loss=0.2987, pruned_loss=0.07729, over 21658.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3147, pruned_loss=0.0895, over 4267463.26 frames. ], batch size: 247, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:31:51,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-21 13:32:02,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=975858.0, ans=6.0 2023-06-21 13:32:04,262 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.136e+02 3.616e+02 4.302e+02 7.230e+02, threshold=7.231e+02, percent-clipped=1.0 2023-06-21 13:33:04,719 INFO [train.py:996] (1/4) Epoch 6, batch 10200, loss[loss=0.2697, simple_loss=0.3236, pruned_loss=0.1079, over 21880.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3162, pruned_loss=0.08893, over 4267486.43 frames. ], batch size: 107, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:33:05,578 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-21 13:33:19,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=976098.0, ans=0.035 2023-06-21 13:33:36,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-21 13:34:01,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-21 13:34:40,824 INFO [train.py:996] (1/4) Epoch 6, batch 10250, loss[loss=0.1827, simple_loss=0.2658, pruned_loss=0.04983, over 21542.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3122, pruned_loss=0.0827, over 4272267.95 frames. ], batch size: 195, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:34:55,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=15.0 2023-06-21 13:35:21,122 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 2.417e+02 2.778e+02 3.535e+02 6.658e+02, threshold=5.557e+02, percent-clipped=0.0 2023-06-21 13:35:21,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=976458.0, ans=0.2 2023-06-21 13:35:28,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-21 13:35:29,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=976458.0, ans=0.125 2023-06-21 13:35:31,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=976458.0, ans=0.1 2023-06-21 13:36:18,477 INFO [train.py:996] (1/4) Epoch 6, batch 10300, loss[loss=0.2337, simple_loss=0.3365, pruned_loss=0.06542, over 21831.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3143, pruned_loss=0.08353, over 4265659.11 frames. ], batch size: 282, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:37:25,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=976818.0, ans=0.0 2023-06-21 13:37:30,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.40 vs. limit=15.0 2023-06-21 13:37:51,333 INFO [train.py:996] (1/4) Epoch 6, batch 10350, loss[loss=0.1527, simple_loss=0.1987, pruned_loss=0.05338, over 21679.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3158, pruned_loss=0.0832, over 4259268.34 frames. ], batch size: 112, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:38:05,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=976998.0, ans=0.0 2023-06-21 13:38:41,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.013e+02 3.424e+02 4.062e+02 6.181e+02, threshold=6.848e+02, percent-clipped=5.0 2023-06-21 13:39:20,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-21 13:39:27,359 INFO [train.py:996] (1/4) Epoch 6, batch 10400, loss[loss=0.1651, simple_loss=0.2129, pruned_loss=0.05862, over 21727.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3117, pruned_loss=0.0829, over 4262588.50 frames. ], batch size: 124, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:39:41,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=977238.0, ans=0.05 2023-06-21 13:40:19,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=977358.0, ans=0.1 2023-06-21 13:40:35,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=977418.0, ans=0.1 2023-06-21 13:41:09,998 INFO [train.py:996] (1/4) Epoch 6, batch 10450, loss[loss=0.2366, simple_loss=0.3147, pruned_loss=0.0793, over 21429.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3173, pruned_loss=0.08633, over 4263848.06 frames. ], batch size: 211, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:41:14,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=977538.0, ans=0.0 2023-06-21 13:41:26,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-21 13:42:01,150 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.285e+02 3.743e+02 4.607e+02 9.328e+02, threshold=7.486e+02, percent-clipped=7.0 2023-06-21 13:42:09,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=977658.0, ans=0.0 2023-06-21 13:42:52,036 INFO [train.py:996] (1/4) Epoch 6, batch 10500, loss[loss=0.2315, simple_loss=0.2829, pruned_loss=0.09005, over 21242.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3158, pruned_loss=0.08484, over 4272216.42 frames. ], batch size: 159, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:43:10,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.47 vs. limit=15.0 2023-06-21 13:43:28,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-21 13:43:51,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=978018.0, ans=0.0 2023-06-21 13:44:27,335 INFO [train.py:996] (1/4) Epoch 6, batch 10550, loss[loss=0.2145, simple_loss=0.2822, pruned_loss=0.07338, over 21795.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.31, pruned_loss=0.0841, over 4259911.59 frames. ], batch size: 317, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:45:12,156 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.729e+02 3.054e+02 3.524e+02 6.998e+02, threshold=6.108e+02, percent-clipped=0.0 2023-06-21 13:45:56,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-21 13:46:03,786 INFO [train.py:996] (1/4) Epoch 6, batch 10600, loss[loss=0.2507, simple_loss=0.341, pruned_loss=0.08021, over 19707.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3063, pruned_loss=0.08315, over 4256224.84 frames. ], batch size: 702, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:46:07,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-21 13:46:19,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-21 13:46:52,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=978558.0, ans=0.0 2023-06-21 13:47:00,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=978618.0, ans=0.1 2023-06-21 13:47:44,897 INFO [train.py:996] (1/4) Epoch 6, batch 10650, loss[loss=0.1888, simple_loss=0.2733, pruned_loss=0.05211, over 21752.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3095, pruned_loss=0.08166, over 4266164.32 frames. ], batch size: 351, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:48:26,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 2.894e+02 3.773e+02 4.928e+02 8.046e+02, threshold=7.546e+02, percent-clipped=12.0 2023-06-21 13:49:09,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=978978.0, ans=0.125 2023-06-21 13:49:22,177 INFO [train.py:996] (1/4) Epoch 6, batch 10700, loss[loss=0.2053, simple_loss=0.2668, pruned_loss=0.07192, over 21362.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3083, pruned_loss=0.08154, over 4262490.53 frames. ], batch size: 211, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:49:33,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.36 vs. limit=6.0 2023-06-21 13:49:44,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=979098.0, ans=0.07 2023-06-21 13:51:05,748 INFO [train.py:996] (1/4) Epoch 6, batch 10750, loss[loss=0.2711, simple_loss=0.363, pruned_loss=0.08958, over 21757.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3199, pruned_loss=0.08708, over 4268574.66 frames. ], batch size: 298, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:51:33,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=979398.0, ans=0.125 2023-06-21 13:51:35,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=8.0 2023-06-21 13:51:40,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=979458.0, ans=0.125 2023-06-21 13:51:41,069 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:51:42,227 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.198e+02 3.607e+02 4.478e+02 7.932e+02, threshold=7.214e+02, percent-clipped=1.0 2023-06-21 13:51:43,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-21 13:52:01,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=979518.0, ans=0.125 2023-06-21 13:52:21,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=979578.0, ans=0.2 2023-06-21 13:52:23,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=979578.0, ans=0.0 2023-06-21 13:52:43,754 INFO [train.py:996] (1/4) Epoch 6, batch 10800, loss[loss=0.2765, simple_loss=0.3517, pruned_loss=0.1006, over 21847.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3256, pruned_loss=0.08852, over 4275863.74 frames. ], batch size: 124, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:52:44,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=979638.0, ans=0.0 2023-06-21 13:53:18,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979758.0, ans=0.0 2023-06-21 13:53:30,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=979758.0, ans=0.125 2023-06-21 13:53:40,715 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-21 13:54:14,989 INFO [train.py:996] (1/4) Epoch 6, batch 10850, loss[loss=0.2062, simple_loss=0.2777, pruned_loss=0.06735, over 21658.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3252, pruned_loss=0.08828, over 4272364.98 frames. ], batch size: 247, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:54:33,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=979998.0, ans=0.0 2023-06-21 13:54:33,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=979998.0, ans=0.125 2023-06-21 13:55:03,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=980058.0, ans=0.125 2023-06-21 13:55:06,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 2.788e+02 3.255e+02 3.917e+02 5.822e+02, threshold=6.509e+02, percent-clipped=0.0 2023-06-21 13:55:47,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=980178.0, ans=0.04949747468305833 2023-06-21 13:55:51,255 INFO [train.py:996] (1/4) Epoch 6, batch 10900, loss[loss=0.2213, simple_loss=0.2985, pruned_loss=0.07211, over 21283.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3169, pruned_loss=0.08564, over 4269104.87 frames. ], batch size: 176, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:56:38,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=980358.0, ans=0.125 2023-06-21 13:57:25,448 INFO [train.py:996] (1/4) Epoch 6, batch 10950, loss[loss=0.2387, simple_loss=0.3039, pruned_loss=0.08678, over 21581.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.313, pruned_loss=0.08414, over 4261605.45 frames. ], batch size: 263, lr: 5.16e-03, grad_scale: 32.0 2023-06-21 13:57:44,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-21 13:58:13,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=22.5 2023-06-21 13:58:15,826 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.774e+02 3.261e+02 3.678e+02 5.101e+02, threshold=6.522e+02, percent-clipped=0.0 2023-06-21 13:58:29,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=980718.0, ans=0.125 2023-06-21 13:58:59,448 INFO [train.py:996] (1/4) Epoch 6, batch 11000, loss[loss=0.2633, simple_loss=0.3192, pruned_loss=0.1037, over 21638.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3124, pruned_loss=0.08534, over 4270260.92 frames. ], batch size: 263, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 13:59:18,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=980898.0, ans=0.09899494936611666 2023-06-21 13:59:51,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=980958.0, ans=0.0 2023-06-21 13:59:54,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=980958.0, ans=0.07 2023-06-21 13:59:57,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=980958.0, ans=0.125 2023-06-21 14:00:06,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=981018.0, ans=0.125 2023-06-21 14:00:11,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=981018.0, ans=0.125 2023-06-21 14:00:36,195 INFO [train.py:996] (1/4) Epoch 6, batch 11050, loss[loss=0.24, simple_loss=0.2995, pruned_loss=0.09021, over 21626.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3125, pruned_loss=0.08736, over 4268589.50 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:00:40,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-21 14:01:28,781 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.910e+02 3.183e+02 3.721e+02 5.949e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-21 14:01:42,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=981318.0, ans=0.0 2023-06-21 14:01:45,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=981318.0, ans=0.025 2023-06-21 14:02:10,618 INFO [train.py:996] (1/4) Epoch 6, batch 11100, loss[loss=0.2455, simple_loss=0.3006, pruned_loss=0.09521, over 21208.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3115, pruned_loss=0.08775, over 4267977.23 frames. ], batch size: 144, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:02:33,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-21 14:03:01,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-21 14:03:32,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-21 14:03:48,514 INFO [train.py:996] (1/4) Epoch 6, batch 11150, loss[loss=0.2954, simple_loss=0.3709, pruned_loss=0.1099, over 21404.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3084, pruned_loss=0.08684, over 4273840.74 frames. ], batch size: 507, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:03:51,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.75 vs. limit=22.5 2023-06-21 14:04:01,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=981738.0, ans=0.2 2023-06-21 14:04:37,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=981858.0, ans=0.125 2023-06-21 14:04:41,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.660e+02 3.131e+02 3.688e+02 5.663e+02, threshold=6.262e+02, percent-clipped=0.0 2023-06-21 14:04:55,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=981918.0, ans=0.2 2023-06-21 14:05:08,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=981978.0, ans=0.125 2023-06-21 14:05:23,019 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=12.0 2023-06-21 14:05:24,955 INFO [train.py:996] (1/4) Epoch 6, batch 11200, loss[loss=0.2129, simple_loss=0.2782, pruned_loss=0.07381, over 21832.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3065, pruned_loss=0.08623, over 4264572.29 frames. ], batch size: 372, lr: 5.16e-03, grad_scale: 32.0 2023-06-21 14:06:18,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=982158.0, ans=0.125 2023-06-21 14:06:21,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=982218.0, ans=0.0 2023-06-21 14:06:25,661 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:06:27,193 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:06:57,263 INFO [train.py:996] (1/4) Epoch 6, batch 11250, loss[loss=0.2263, simple_loss=0.3018, pruned_loss=0.07535, over 21858.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.306, pruned_loss=0.08642, over 4259476.44 frames. ], batch size: 351, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:07:46,952 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.620e+02 2.906e+02 3.338e+02 5.205e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-21 14:08:24,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=982578.0, ans=0.125 2023-06-21 14:08:26,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-21 14:08:28,287 INFO [train.py:996] (1/4) Epoch 6, batch 11300, loss[loss=0.2137, simple_loss=0.2887, pruned_loss=0.0694, over 21806.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3058, pruned_loss=0.08534, over 4257850.37 frames. ], batch size: 282, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:08:29,169 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-21 14:08:30,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=982638.0, ans=0.1 2023-06-21 14:09:02,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=982698.0, ans=0.125 2023-06-21 14:09:43,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=12.0 2023-06-21 14:09:53,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=982878.0, ans=0.0 2023-06-21 14:10:03,369 INFO [train.py:996] (1/4) Epoch 6, batch 11350, loss[loss=0.2569, simple_loss=0.3217, pruned_loss=0.09611, over 21258.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3082, pruned_loss=0.08486, over 4267511.63 frames. ], batch size: 159, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:10:20,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=982938.0, ans=0.04949747468305833 2023-06-21 14:10:32,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=982998.0, ans=0.1 2023-06-21 14:10:40,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=982998.0, ans=0.125 2023-06-21 14:10:44,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=983058.0, ans=0.0 2023-06-21 14:10:45,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=983058.0, ans=10.0 2023-06-21 14:10:53,898 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 2.788e+02 3.178e+02 3.739e+02 7.652e+02, threshold=6.355e+02, percent-clipped=2.0 2023-06-21 14:11:28,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=983178.0, ans=0.0 2023-06-21 14:11:30,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-21 14:11:35,961 INFO [train.py:996] (1/4) Epoch 6, batch 11400, loss[loss=0.265, simple_loss=0.3319, pruned_loss=0.09907, over 19818.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3134, pruned_loss=0.08765, over 4265708.91 frames. ], batch size: 702, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:12:06,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-21 14:12:28,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=983358.0, ans=0.125 2023-06-21 14:12:28,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=983358.0, ans=0.2 2023-06-21 14:12:34,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=983418.0, ans=0.1 2023-06-21 14:12:51,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=983418.0, ans=0.1 2023-06-21 14:13:18,812 INFO [train.py:996] (1/4) Epoch 6, batch 11450, loss[loss=0.2304, simple_loss=0.2933, pruned_loss=0.08374, over 20203.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3141, pruned_loss=0.0862, over 4267743.68 frames. ], batch size: 707, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:13:43,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=983598.0, ans=0.125 2023-06-21 14:13:52,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=983658.0, ans=0.0 2023-06-21 14:14:03,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.362e+02 2.838e+02 3.448e+02 4.254e+02 7.137e+02, threshold=6.896e+02, percent-clipped=4.0 2023-06-21 14:14:14,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=983718.0, ans=0.125 2023-06-21 14:14:32,521 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-21 14:14:36,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=983778.0, ans=0.125 2023-06-21 14:14:46,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=983778.0, ans=0.125 2023-06-21 14:14:55,038 INFO [train.py:996] (1/4) Epoch 6, batch 11500, loss[loss=0.221, simple_loss=0.3074, pruned_loss=0.06736, over 21470.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3175, pruned_loss=0.08754, over 4269295.06 frames. ], batch size: 211, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:15:09,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=983838.0, ans=0.0 2023-06-21 14:15:56,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=984018.0, ans=0.05 2023-06-21 14:16:37,022 INFO [train.py:996] (1/4) Epoch 6, batch 11550, loss[loss=0.3585, simple_loss=0.4654, pruned_loss=0.1258, over 21209.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3228, pruned_loss=0.08728, over 4273047.55 frames. ], batch size: 548, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:16:45,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=984138.0, ans=0.125 2023-06-21 14:17:23,346 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.955e+02 3.350e+02 4.139e+02 7.597e+02, threshold=6.701e+02, percent-clipped=2.0 2023-06-21 14:18:09,013 INFO [train.py:996] (1/4) Epoch 6, batch 11600, loss[loss=0.2821, simple_loss=0.3612, pruned_loss=0.1015, over 21318.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3375, pruned_loss=0.08929, over 4268125.01 frames. ], batch size: 176, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:19:35,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=984678.0, ans=0.125 2023-06-21 14:19:45,203 INFO [train.py:996] (1/4) Epoch 6, batch 11650, loss[loss=0.2648, simple_loss=0.3491, pruned_loss=0.09023, over 21742.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3436, pruned_loss=0.08964, over 4265253.08 frames. ], batch size: 351, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:20:07,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=984798.0, ans=0.125 2023-06-21 14:20:09,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-21 14:20:28,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=984858.0, ans=0.125 2023-06-21 14:20:29,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 3.000e+02 3.616e+02 4.303e+02 7.688e+02, threshold=7.232e+02, percent-clipped=3.0 2023-06-21 14:21:15,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=984978.0, ans=0.125 2023-06-21 14:21:21,188 INFO [train.py:996] (1/4) Epoch 6, batch 11700, loss[loss=0.2491, simple_loss=0.2965, pruned_loss=0.1009, over 21496.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3344, pruned_loss=0.08919, over 4258794.50 frames. ], batch size: 195, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:21:51,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=985098.0, ans=0.2 2023-06-21 14:21:54,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=985098.0, ans=0.125 2023-06-21 14:22:00,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-21 14:22:56,809 INFO [train.py:996] (1/4) Epoch 6, batch 11750, loss[loss=0.2766, simple_loss=0.3341, pruned_loss=0.1095, over 21900.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3252, pruned_loss=0.0887, over 4261514.27 frames. ], batch size: 317, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:23:10,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=985338.0, ans=0.07 2023-06-21 14:23:57,408 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.950e+02 3.559e+02 4.361e+02 6.685e+02, threshold=7.118e+02, percent-clipped=0.0 2023-06-21 14:24:33,551 INFO [train.py:996] (1/4) Epoch 6, batch 11800, loss[loss=0.2509, simple_loss=0.3126, pruned_loss=0.09458, over 21323.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3277, pruned_loss=0.09132, over 4272390.50 frames. ], batch size: 176, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:24:46,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=985638.0, ans=0.07 2023-06-21 14:24:48,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-21 14:26:14,999 INFO [train.py:996] (1/4) Epoch 6, batch 11850, loss[loss=0.2587, simple_loss=0.3356, pruned_loss=0.0909, over 21869.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3293, pruned_loss=0.09021, over 4282027.75 frames. ], batch size: 107, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:26:53,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=986058.0, ans=0.125 2023-06-21 14:26:53,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=986058.0, ans=0.125 2023-06-21 14:27:09,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=986058.0, ans=0.0 2023-06-21 14:27:10,025 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.768e+02 3.132e+02 3.956e+02 6.532e+02, threshold=6.263e+02, percent-clipped=0.0 2023-06-21 14:27:12,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=986058.0, ans=0.0 2023-06-21 14:27:33,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=986178.0, ans=0.2 2023-06-21 14:27:37,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=986178.0, ans=0.2 2023-06-21 14:27:39,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=986178.0, ans=0.125 2023-06-21 14:27:50,721 INFO [train.py:996] (1/4) Epoch 6, batch 11900, loss[loss=0.2736, simple_loss=0.3558, pruned_loss=0.09573, over 21573.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3298, pruned_loss=0.08825, over 4279965.16 frames. ], batch size: 441, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:28:14,867 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:28:45,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-21 14:28:52,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=986418.0, ans=0.125 2023-06-21 14:29:21,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=986478.0, ans=0.125 2023-06-21 14:29:26,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=986538.0, ans=15.0 2023-06-21 14:29:27,037 INFO [train.py:996] (1/4) Epoch 6, batch 11950, loss[loss=0.2162, simple_loss=0.3083, pruned_loss=0.06204, over 21809.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3303, pruned_loss=0.08452, over 4271660.02 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:30:23,374 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.615e+02 3.254e+02 4.193e+02 8.163e+02, threshold=6.508e+02, percent-clipped=5.0 2023-06-21 14:31:01,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=986778.0, ans=0.125 2023-06-21 14:31:03,744 INFO [train.py:996] (1/4) Epoch 6, batch 12000, loss[loss=0.205, simple_loss=0.2649, pruned_loss=0.07252, over 21349.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3253, pruned_loss=0.0833, over 4270877.05 frames. ], batch size: 551, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:31:03,745 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 14:31:23,355 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2642, simple_loss=0.3586, pruned_loss=0.08492, over 1796401.00 frames. 2023-06-21 14:31:23,356 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-21 14:32:09,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=986958.0, ans=0.2 2023-06-21 14:33:01,527 INFO [train.py:996] (1/4) Epoch 6, batch 12050, loss[loss=0.2573, simple_loss=0.3086, pruned_loss=0.103, over 21409.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3215, pruned_loss=0.08545, over 4276848.15 frames. ], batch size: 177, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:33:53,156 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 3.010e+02 3.446e+02 4.017e+02 8.146e+02, threshold=6.892e+02, percent-clipped=4.0 2023-06-21 14:34:06,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-21 14:34:10,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=987318.0, ans=0.0 2023-06-21 14:34:31,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=987378.0, ans=0.2 2023-06-21 14:34:33,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-21 14:34:43,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=987438.0, ans=15.0 2023-06-21 14:34:43,741 INFO [train.py:996] (1/4) Epoch 6, batch 12100, loss[loss=0.2744, simple_loss=0.3491, pruned_loss=0.09988, over 21875.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3266, pruned_loss=0.09021, over 4279243.81 frames. ], batch size: 316, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:35:11,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=987498.0, ans=0.0 2023-06-21 14:35:16,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=987498.0, ans=0.2 2023-06-21 14:35:16,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=987498.0, ans=0.0 2023-06-21 14:35:33,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=987558.0, ans=0.2 2023-06-21 14:35:39,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-21 14:36:16,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=987678.0, ans=0.125 2023-06-21 14:36:27,298 INFO [train.py:996] (1/4) Epoch 6, batch 12150, loss[loss=0.2356, simple_loss=0.3286, pruned_loss=0.0713, over 21645.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3301, pruned_loss=0.09016, over 4278308.21 frames. ], batch size: 263, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:36:46,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=987798.0, ans=0.0 2023-06-21 14:37:19,844 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 3.020e+02 3.614e+02 4.015e+02 8.551e+02, threshold=7.228e+02, percent-clipped=5.0 2023-06-21 14:37:45,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=987978.0, ans=0.125 2023-06-21 14:37:45,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=987978.0, ans=0.125 2023-06-21 14:38:01,606 INFO [train.py:996] (1/4) Epoch 6, batch 12200, loss[loss=0.2093, simple_loss=0.2772, pruned_loss=0.07075, over 21835.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3275, pruned_loss=0.0893, over 4273793.79 frames. ], batch size: 318, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:39:18,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=988218.0, ans=0.0 2023-06-21 14:39:18,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=988218.0, ans=0.2 2023-06-21 14:39:22,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=988278.0, ans=0.1 2023-06-21 14:39:36,219 INFO [train.py:996] (1/4) Epoch 6, batch 12250, loss[loss=0.1649, simple_loss=0.2411, pruned_loss=0.04432, over 21396.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3178, pruned_loss=0.08526, over 4267416.03 frames. ], batch size: 131, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:40:22,485 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 2.454e+02 2.966e+02 3.954e+02 7.953e+02, threshold=5.931e+02, percent-clipped=3.0 2023-06-21 14:40:56,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=988578.0, ans=0.2 2023-06-21 14:41:09,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=988638.0, ans=0.1 2023-06-21 14:41:10,041 INFO [train.py:996] (1/4) Epoch 6, batch 12300, loss[loss=0.2886, simple_loss=0.3751, pruned_loss=0.101, over 21541.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3084, pruned_loss=0.07888, over 4270499.44 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:41:19,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=988638.0, ans=0.125 2023-06-21 14:41:27,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-21 14:41:48,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=988758.0, ans=0.125 2023-06-21 14:41:59,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=988818.0, ans=0.2 2023-06-21 14:42:39,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=988878.0, ans=0.0 2023-06-21 14:42:40,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=988878.0, ans=0.04949747468305833 2023-06-21 14:42:44,766 INFO [train.py:996] (1/4) Epoch 6, batch 12350, loss[loss=0.2732, simple_loss=0.3345, pruned_loss=0.1059, over 21223.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3125, pruned_loss=0.07904, over 4275171.44 frames. ], batch size: 143, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:43:02,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-21 14:43:36,231 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 2.741e+02 3.281e+02 4.325e+02 6.278e+02, threshold=6.562e+02, percent-clipped=1.0 2023-06-21 14:44:18,042 INFO [train.py:996] (1/4) Epoch 6, batch 12400, loss[loss=0.2572, simple_loss=0.3153, pruned_loss=0.0995, over 21823.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3158, pruned_loss=0.08412, over 4282796.50 frames. ], batch size: 298, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:44:30,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=989238.0, ans=0.0 2023-06-21 14:45:32,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=989418.0, ans=0.0 2023-06-21 14:45:36,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=989478.0, ans=0.0 2023-06-21 14:45:52,839 INFO [train.py:996] (1/4) Epoch 6, batch 12450, loss[loss=0.3314, simple_loss=0.3841, pruned_loss=0.1394, over 21459.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3199, pruned_loss=0.08763, over 4287672.53 frames. ], batch size: 471, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:45:57,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=989538.0, ans=0.0 2023-06-21 14:46:55,548 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.846e+02 3.218e+02 3.959e+02 6.466e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 14:47:16,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=989778.0, ans=0.1 2023-06-21 14:47:35,191 INFO [train.py:996] (1/4) Epoch 6, batch 12500, loss[loss=0.29, simple_loss=0.3719, pruned_loss=0.1041, over 21431.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3308, pruned_loss=0.08925, over 4284091.69 frames. ], batch size: 211, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:47:48,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=989838.0, ans=0.125 2023-06-21 14:48:35,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=990018.0, ans=0.0 2023-06-21 14:49:11,536 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-21 14:49:19,716 INFO [train.py:996] (1/4) Epoch 6, batch 12550, loss[loss=0.2341, simple_loss=0.3241, pruned_loss=0.07204, over 21794.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3338, pruned_loss=0.09093, over 4286119.87 frames. ], batch size: 282, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:49:21,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=990138.0, ans=0.1 2023-06-21 14:49:55,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=990198.0, ans=0.0 2023-06-21 14:49:57,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=990198.0, ans=0.125 2023-06-21 14:49:59,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=22.5 2023-06-21 14:50:02,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-21 14:50:04,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-21 14:50:11,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=990258.0, ans=0.125 2023-06-21 14:50:12,189 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 2.946e+02 3.555e+02 3.995e+02 6.725e+02, threshold=7.110e+02, percent-clipped=1.0 2023-06-21 14:50:16,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=990318.0, ans=0.125 2023-06-21 14:50:55,495 INFO [train.py:996] (1/4) Epoch 6, batch 12600, loss[loss=0.2263, simple_loss=0.3029, pruned_loss=0.07488, over 21527.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3341, pruned_loss=0.08947, over 4275164.74 frames. ], batch size: 195, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:51:10,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=990438.0, ans=0.125 2023-06-21 14:51:32,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=990558.0, ans=0.025 2023-06-21 14:52:05,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990678.0, ans=0.1 2023-06-21 14:52:17,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=990678.0, ans=0.125 2023-06-21 14:52:25,058 INFO [train.py:996] (1/4) Epoch 6, batch 12650, loss[loss=0.2877, simple_loss=0.3352, pruned_loss=0.1201, over 21778.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.326, pruned_loss=0.08638, over 4272513.37 frames. ], batch size: 508, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:52:54,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=990798.0, ans=0.2 2023-06-21 14:53:16,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.529e+02 3.007e+02 3.447e+02 6.549e+02, threshold=6.013e+02, percent-clipped=0.0 2023-06-21 14:53:34,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=990918.0, ans=0.04949747468305833 2023-06-21 14:53:34,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=990918.0, ans=0.125 2023-06-21 14:53:54,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=22.5 2023-06-21 14:54:11,991 INFO [train.py:996] (1/4) Epoch 6, batch 12700, loss[loss=0.2565, simple_loss=0.3272, pruned_loss=0.09292, over 21489.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3264, pruned_loss=0.08938, over 4278236.73 frames. ], batch size: 194, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:55:20,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=991218.0, ans=0.2 2023-06-21 14:55:48,669 INFO [train.py:996] (1/4) Epoch 6, batch 12750, loss[loss=0.262, simple_loss=0.3319, pruned_loss=0.09607, over 20709.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.328, pruned_loss=0.09038, over 4277472.03 frames. ], batch size: 607, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:56:04,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=991398.0, ans=0.125 2023-06-21 14:56:27,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=991458.0, ans=0.0 2023-06-21 14:56:36,160 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.913e+02 3.338e+02 4.032e+02 7.736e+02, threshold=6.676e+02, percent-clipped=3.0 2023-06-21 14:57:17,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=991578.0, ans=0.125 2023-06-21 14:57:17,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=991578.0, ans=0.07 2023-06-21 14:57:24,144 INFO [train.py:996] (1/4) Epoch 6, batch 12800, loss[loss=0.2541, simple_loss=0.3274, pruned_loss=0.09037, over 21872.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3266, pruned_loss=0.09006, over 4277962.55 frames. ], batch size: 371, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:57:41,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=991698.0, ans=0.125 2023-06-21 14:57:43,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=991698.0, ans=0.05 2023-06-21 14:58:05,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991758.0, ans=0.1 2023-06-21 14:58:51,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-21 14:58:59,838 INFO [train.py:996] (1/4) Epoch 6, batch 12850, loss[loss=0.2686, simple_loss=0.3592, pruned_loss=0.089, over 21653.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3307, pruned_loss=0.09264, over 4271599.24 frames. ], batch size: 414, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:59:22,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-21 14:59:30,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=991998.0, ans=0.2 2023-06-21 14:59:31,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=991998.0, ans=0.0 2023-06-21 14:59:45,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=992058.0, ans=0.5 2023-06-21 14:59:52,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992058.0, ans=0.1 2023-06-21 14:59:53,148 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.801e+02 3.143e+02 3.622e+02 6.427e+02, threshold=6.286e+02, percent-clipped=0.0 2023-06-21 15:00:14,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-21 15:00:23,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=992178.0, ans=0.2 2023-06-21 15:00:36,423 INFO [train.py:996] (1/4) Epoch 6, batch 12900, loss[loss=0.2171, simple_loss=0.3076, pruned_loss=0.06327, over 21748.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3275, pruned_loss=0.08825, over 4272682.34 frames. ], batch size: 352, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:00:37,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-21 15:00:37,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-21 15:00:38,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=22.5 2023-06-21 15:02:12,322 INFO [train.py:996] (1/4) Epoch 6, batch 12950, loss[loss=0.2286, simple_loss=0.3112, pruned_loss=0.07299, over 21868.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3234, pruned_loss=0.08526, over 4275295.41 frames. ], batch size: 372, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:02:14,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-21 15:02:41,094 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:03:15,919 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.940e+02 3.602e+02 4.409e+02 7.106e+02, threshold=7.204e+02, percent-clipped=2.0 2023-06-21 15:03:19,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=992718.0, ans=0.2 2023-06-21 15:03:23,743 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:03:46,794 INFO [train.py:996] (1/4) Epoch 6, batch 13000, loss[loss=0.2443, simple_loss=0.3195, pruned_loss=0.08455, over 21448.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3232, pruned_loss=0.0861, over 4281186.24 frames. ], batch size: 507, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:04:02,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=992838.0, ans=0.125 2023-06-21 15:04:18,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=992898.0, ans=0.2 2023-06-21 15:05:21,367 INFO [train.py:996] (1/4) Epoch 6, batch 13050, loss[loss=0.2239, simple_loss=0.2975, pruned_loss=0.07516, over 21774.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3185, pruned_loss=0.08376, over 4282399.42 frames. ], batch size: 247, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:05:50,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=993198.0, ans=0.2 2023-06-21 15:06:03,701 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-21 15:06:04,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=993258.0, ans=0.09899494936611666 2023-06-21 15:06:23,888 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.762e+02 3.169e+02 4.003e+02 6.766e+02, threshold=6.339e+02, percent-clipped=0.0 2023-06-21 15:06:47,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-21 15:07:00,667 INFO [train.py:996] (1/4) Epoch 6, batch 13100, loss[loss=0.2583, simple_loss=0.3342, pruned_loss=0.09116, over 21719.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3195, pruned_loss=0.08444, over 4292090.86 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:07:39,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-21 15:08:43,019 INFO [train.py:996] (1/4) Epoch 6, batch 13150, loss[loss=0.1924, simple_loss=0.2705, pruned_loss=0.05717, over 21626.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3257, pruned_loss=0.08829, over 4289222.67 frames. ], batch size: 247, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:09:11,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=993798.0, ans=0.125 2023-06-21 15:09:22,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=993858.0, ans=0.125 2023-06-21 15:09:36,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=993858.0, ans=0.0 2023-06-21 15:09:37,804 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.111e+02 3.957e+02 5.293e+02 1.278e+03, threshold=7.913e+02, percent-clipped=9.0 2023-06-21 15:10:27,532 INFO [train.py:996] (1/4) Epoch 6, batch 13200, loss[loss=0.2678, simple_loss=0.3304, pruned_loss=0.1026, over 21707.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3234, pruned_loss=0.08795, over 4283746.01 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:10:39,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=994038.0, ans=0.2 2023-06-21 15:10:47,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=994098.0, ans=0.0 2023-06-21 15:10:57,530 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-21 15:11:14,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=994158.0, ans=0.1 2023-06-21 15:11:33,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=994218.0, ans=0.125 2023-06-21 15:11:43,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=994278.0, ans=0.2 2023-06-21 15:12:03,180 INFO [train.py:996] (1/4) Epoch 6, batch 13250, loss[loss=0.2437, simple_loss=0.2965, pruned_loss=0.09542, over 21554.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3231, pruned_loss=0.09057, over 4280491.74 frames. ], batch size: 230, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:12:20,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=22.5 2023-06-21 15:12:21,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=994398.0, ans=0.125 2023-06-21 15:12:51,214 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.898e+02 3.245e+02 3.819e+02 5.517e+02, threshold=6.489e+02, percent-clipped=0.0 2023-06-21 15:13:02,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=994518.0, ans=0.125 2023-06-21 15:13:33,150 INFO [train.py:996] (1/4) Epoch 6, batch 13300, loss[loss=0.2471, simple_loss=0.3332, pruned_loss=0.08046, over 21705.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3255, pruned_loss=0.08918, over 4283503.18 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:13:46,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=994638.0, ans=0.125 2023-06-21 15:13:54,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=994698.0, ans=22.5 2023-06-21 15:15:05,596 INFO [train.py:996] (1/4) Epoch 6, batch 13350, loss[loss=0.2943, simple_loss=0.3584, pruned_loss=0.1151, over 21822.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3301, pruned_loss=0.09113, over 4283157.05 frames. ], batch size: 118, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:15:27,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-21 15:15:57,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=995058.0, ans=0.125 2023-06-21 15:16:01,508 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.076e+02 3.846e+02 4.574e+02 8.350e+02, threshold=7.691e+02, percent-clipped=3.0 2023-06-21 15:16:09,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-21 15:16:21,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=995118.0, ans=0.125 2023-06-21 15:16:27,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-21 15:16:44,125 INFO [train.py:996] (1/4) Epoch 6, batch 13400, loss[loss=0.2784, simple_loss=0.3339, pruned_loss=0.1114, over 21454.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3313, pruned_loss=0.09307, over 4282320.93 frames. ], batch size: 194, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:17:54,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=995418.0, ans=0.125 2023-06-21 15:17:56,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=995418.0, ans=0.125 2023-06-21 15:18:02,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=995418.0, ans=0.1 2023-06-21 15:18:15,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=995478.0, ans=0.1 2023-06-21 15:18:25,045 INFO [train.py:996] (1/4) Epoch 6, batch 13450, loss[loss=0.3233, simple_loss=0.3758, pruned_loss=0.1354, over 21365.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.335, pruned_loss=0.09703, over 4288712.78 frames. ], batch size: 471, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:19:04,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=995658.0, ans=0.2 2023-06-21 15:19:20,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=15.0 2023-06-21 15:19:29,865 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.161e+02 3.421e+02 3.980e+02 7.603e+02, threshold=6.841e+02, percent-clipped=0.0 2023-06-21 15:19:44,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=995778.0, ans=0.125 2023-06-21 15:20:06,004 INFO [train.py:996] (1/4) Epoch 6, batch 13500, loss[loss=0.19, simple_loss=0.2474, pruned_loss=0.06626, over 21279.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3245, pruned_loss=0.09339, over 4291711.38 frames. ], batch size: 159, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:20:23,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=995838.0, ans=0.125 2023-06-21 15:21:13,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.02 vs. limit=22.5 2023-06-21 15:21:43,570 INFO [train.py:996] (1/4) Epoch 6, batch 13550, loss[loss=0.2165, simple_loss=0.3068, pruned_loss=0.06313, over 21407.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3282, pruned_loss=0.09206, over 4284904.38 frames. ], batch size: 131, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:21:44,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=996138.0, ans=0.0 2023-06-21 15:21:53,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=996138.0, ans=0.0 2023-06-21 15:21:58,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=996138.0, ans=0.125 2023-06-21 15:22:17,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=996198.0, ans=0.125 2023-06-21 15:22:25,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=996198.0, ans=0.125 2023-06-21 15:22:27,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=996258.0, ans=0.125 2023-06-21 15:22:44,913 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.395e+02 3.030e+02 3.606e+02 4.387e+02 7.560e+02, threshold=7.212e+02, percent-clipped=4.0 2023-06-21 15:23:18,605 INFO [train.py:996] (1/4) Epoch 6, batch 13600, loss[loss=0.2193, simple_loss=0.2906, pruned_loss=0.07406, over 21290.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3312, pruned_loss=0.09336, over 4283686.88 frames. ], batch size: 159, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:23:27,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=996438.0, ans=0.02 2023-06-21 15:23:34,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=996438.0, ans=0.2 2023-06-21 15:23:50,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=996498.0, ans=0.125 2023-06-21 15:23:58,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=996498.0, ans=0.0 2023-06-21 15:24:01,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=996558.0, ans=0.125 2023-06-21 15:24:06,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=996558.0, ans=0.1 2023-06-21 15:24:11,495 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.71 vs. limit=15.0 2023-06-21 15:24:23,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-21 15:24:49,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-21 15:24:58,351 INFO [train.py:996] (1/4) Epoch 6, batch 13650, loss[loss=0.1926, simple_loss=0.2518, pruned_loss=0.06672, over 21457.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3257, pruned_loss=0.08896, over 4277772.44 frames. ], batch size: 212, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:25:19,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=996738.0, ans=0.2 2023-06-21 15:25:34,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=996798.0, ans=0.0 2023-06-21 15:25:51,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=996918.0, ans=0.125 2023-06-21 15:25:54,149 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.919e+02 3.475e+02 4.506e+02 7.169e+02, threshold=6.950e+02, percent-clipped=0.0 2023-06-21 15:26:18,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=996978.0, ans=0.125 2023-06-21 15:26:32,647 INFO [train.py:996] (1/4) Epoch 6, batch 13700, loss[loss=0.3471, simple_loss=0.4031, pruned_loss=0.1455, over 21492.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3182, pruned_loss=0.08738, over 4265737.13 frames. ], batch size: 508, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:26:57,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=997098.0, ans=0.2 2023-06-21 15:27:02,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-21 15:28:01,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=997278.0, ans=0.1 2023-06-21 15:28:15,469 INFO [train.py:996] (1/4) Epoch 6, batch 13750, loss[loss=0.291, simple_loss=0.3598, pruned_loss=0.111, over 21458.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.317, pruned_loss=0.08713, over 4263933.60 frames. ], batch size: 508, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:28:44,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=997398.0, ans=0.125 2023-06-21 15:29:15,006 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.229e+02 4.012e+02 5.672e+02 9.491e+02, threshold=8.024e+02, percent-clipped=9.0 2023-06-21 15:29:54,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=997578.0, ans=0.125 2023-06-21 15:29:58,585 INFO [train.py:996] (1/4) Epoch 6, batch 13800, loss[loss=0.2745, simple_loss=0.3789, pruned_loss=0.08498, over 21884.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3219, pruned_loss=0.0857, over 4264938.72 frames. ], batch size: 317, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:30:13,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-21 15:30:22,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-21 15:30:29,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-21 15:30:40,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=997758.0, ans=0.125 2023-06-21 15:30:57,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=997818.0, ans=0.2 2023-06-21 15:31:35,475 INFO [train.py:996] (1/4) Epoch 6, batch 13850, loss[loss=0.3104, simple_loss=0.383, pruned_loss=0.1188, over 21712.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.329, pruned_loss=0.08767, over 4265505.78 frames. ], batch size: 351, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:31:37,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-21 15:31:46,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=997938.0, ans=0.125 2023-06-21 15:32:02,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=997998.0, ans=0.125 2023-06-21 15:32:08,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=997998.0, ans=0.125 2023-06-21 15:32:28,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=998058.0, ans=0.2 2023-06-21 15:32:43,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.953e+02 3.447e+02 4.211e+02 7.666e+02, threshold=6.893e+02, percent-clipped=0.0 2023-06-21 15:32:54,655 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:33:10,826 INFO [train.py:996] (1/4) Epoch 6, batch 13900, loss[loss=0.2419, simple_loss=0.3094, pruned_loss=0.08718, over 21481.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3318, pruned_loss=0.09028, over 4268685.62 frames. ], batch size: 211, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:34:02,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=998358.0, ans=0.125 2023-06-21 15:34:10,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=998418.0, ans=0.125 2023-06-21 15:34:25,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=998478.0, ans=0.09899494936611666 2023-06-21 15:34:34,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=998478.0, ans=0.1 2023-06-21 15:34:39,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=998478.0, ans=0.125 2023-06-21 15:34:41,887 INFO [train.py:996] (1/4) Epoch 6, batch 13950, loss[loss=0.2345, simple_loss=0.302, pruned_loss=0.08353, over 21514.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3319, pruned_loss=0.0925, over 4280643.40 frames. ], batch size: 131, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:35:07,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=12.0 2023-06-21 15:35:43,664 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.434e+02 3.088e+02 3.493e+02 4.359e+02 6.535e+02, threshold=6.987e+02, percent-clipped=0.0 2023-06-21 15:35:57,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=998778.0, ans=0.125 2023-06-21 15:36:10,579 INFO [train.py:996] (1/4) Epoch 6, batch 14000, loss[loss=0.2461, simple_loss=0.3225, pruned_loss=0.08483, over 21474.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3267, pruned_loss=0.09009, over 4272137.32 frames. ], batch size: 471, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:36:40,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=998898.0, ans=0.2 2023-06-21 15:36:49,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=998958.0, ans=0.125 2023-06-21 15:37:21,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=999018.0, ans=0.2 2023-06-21 15:37:32,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=999078.0, ans=0.1 2023-06-21 15:37:41,084 INFO [train.py:996] (1/4) Epoch 6, batch 14050, loss[loss=0.2264, simple_loss=0.2849, pruned_loss=0.08395, over 21626.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3219, pruned_loss=0.08567, over 4274412.24 frames. ], batch size: 231, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:37:45,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=999138.0, ans=0.125 2023-06-21 15:37:56,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=999138.0, ans=0.1 2023-06-21 15:38:11,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=999198.0, ans=0.0 2023-06-21 15:38:13,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=999198.0, ans=0.125 2023-06-21 15:38:48,434 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.785e+02 3.183e+02 4.255e+02 6.746e+02, threshold=6.366e+02, percent-clipped=0.0 2023-06-21 15:39:16,525 INFO [train.py:996] (1/4) Epoch 6, batch 14100, loss[loss=0.2536, simple_loss=0.3353, pruned_loss=0.0859, over 20672.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3167, pruned_loss=0.08542, over 4262090.70 frames. ], batch size: 607, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:39:18,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=999438.0, ans=0.125 2023-06-21 15:40:31,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=999618.0, ans=0.125 2023-06-21 15:40:49,635 INFO [train.py:996] (1/4) Epoch 6, batch 14150, loss[loss=0.2248, simple_loss=0.3039, pruned_loss=0.07287, over 15875.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3202, pruned_loss=0.08694, over 4248565.69 frames. ], batch size: 62, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:41:47,163 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.804e+02 3.332e+02 4.334e+02 8.014e+02, threshold=6.664e+02, percent-clipped=2.0 2023-06-21 15:41:53,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=999918.0, ans=0.0 2023-06-21 15:42:07,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=999978.0, ans=0.0 2023-06-21 15:42:15,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=999978.0, ans=0.0 2023-06-21 15:42:23,659 INFO [train.py:996] (1/4) Epoch 6, batch 14200, loss[loss=0.2328, simple_loss=0.301, pruned_loss=0.08229, over 21676.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3175, pruned_loss=0.08509, over 4247502.76 frames. ], batch size: 298, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:42:41,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1000098.0, ans=0.025 2023-06-21 15:43:53,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1000278.0, ans=0.0 2023-06-21 15:43:58,945 INFO [train.py:996] (1/4) Epoch 6, batch 14250, loss[loss=0.223, simple_loss=0.2957, pruned_loss=0.0751, over 21643.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3132, pruned_loss=0.08588, over 4257095.93 frames. ], batch size: 263, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:44:59,212 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.649e+02 3.041e+02 3.616e+02 7.648e+02, threshold=6.082e+02, percent-clipped=1.0 2023-06-21 15:44:59,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1000518.0, ans=0.125 2023-06-21 15:45:35,966 INFO [train.py:996] (1/4) Epoch 6, batch 14300, loss[loss=0.2257, simple_loss=0.3015, pruned_loss=0.07496, over 21162.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3182, pruned_loss=0.0871, over 4246971.24 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:45:38,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1000638.0, ans=22.5 2023-06-21 15:45:50,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1000698.0, ans=0.04949747468305833 2023-06-21 15:46:05,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1000698.0, ans=0.125 2023-06-21 15:46:26,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-21 15:46:53,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1000818.0, ans=0.2 2023-06-21 15:46:58,150 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:47:11,420 INFO [train.py:996] (1/4) Epoch 6, batch 14350, loss[loss=0.2325, simple_loss=0.3193, pruned_loss=0.07287, over 21415.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3237, pruned_loss=0.08877, over 4250292.58 frames. ], batch size: 548, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:47:13,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-21 15:47:19,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1000938.0, ans=0.2 2023-06-21 15:47:57,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=12.0 2023-06-21 15:48:18,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.231e+02 3.841e+02 4.769e+02 8.361e+02, threshold=7.683e+02, percent-clipped=10.0 2023-06-21 15:48:45,991 INFO [train.py:996] (1/4) Epoch 6, batch 14400, loss[loss=0.2417, simple_loss=0.3043, pruned_loss=0.08956, over 21807.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3232, pruned_loss=0.0888, over 4251505.83 frames. ], batch size: 441, lr: 5.11e-03, grad_scale: 32.0 2023-06-21 15:49:16,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1001298.0, ans=0.2 2023-06-21 15:49:56,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1001418.0, ans=0.1 2023-06-21 15:50:10,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1001478.0, ans=0.0 2023-06-21 15:50:20,540 INFO [train.py:996] (1/4) Epoch 6, batch 14450, loss[loss=0.2217, simple_loss=0.286, pruned_loss=0.07873, over 21450.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3173, pruned_loss=0.08838, over 4254635.15 frames. ], batch size: 389, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:50:34,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1001598.0, ans=0.1 2023-06-21 15:51:29,731 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.916e+02 3.251e+02 4.168e+02 6.765e+02, threshold=6.503e+02, percent-clipped=0.0 2023-06-21 15:51:48,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1001778.0, ans=0.125 2023-06-21 15:51:55,519 INFO [train.py:996] (1/4) Epoch 6, batch 14500, loss[loss=0.2353, simple_loss=0.3128, pruned_loss=0.07892, over 21234.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.313, pruned_loss=0.08758, over 4257823.90 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:53:31,825 INFO [train.py:996] (1/4) Epoch 6, batch 14550, loss[loss=0.3098, simple_loss=0.3692, pruned_loss=0.1252, over 21752.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3175, pruned_loss=0.08925, over 4266016.43 frames. ], batch size: 298, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:54:18,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1002258.0, ans=0.025 2023-06-21 15:54:29,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1002258.0, ans=0.125 2023-06-21 15:54:39,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-21 15:54:41,349 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 3.127e+02 3.885e+02 5.234e+02 1.064e+03, threshold=7.771e+02, percent-clipped=10.0 2023-06-21 15:54:43,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1002318.0, ans=0.2 2023-06-21 15:54:47,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1002318.0, ans=0.125 2023-06-21 15:55:01,958 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-21 15:55:04,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1002378.0, ans=0.0 2023-06-21 15:55:07,179 INFO [train.py:996] (1/4) Epoch 6, batch 14600, loss[loss=0.2618, simple_loss=0.3364, pruned_loss=0.09362, over 21246.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3249, pruned_loss=0.09266, over 4268881.66 frames. ], batch size: 176, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:55:13,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1002438.0, ans=0.125 2023-06-21 15:55:22,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1002438.0, ans=0.125 2023-06-21 15:55:56,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-21 15:56:41,649 INFO [train.py:996] (1/4) Epoch 6, batch 14650, loss[loss=0.2525, simple_loss=0.3465, pruned_loss=0.07921, over 21754.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.326, pruned_loss=0.09158, over 4262710.14 frames. ], batch size: 351, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:56:48,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1002738.0, ans=0.0 2023-06-21 15:56:55,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1002738.0, ans=0.0 2023-06-21 15:57:14,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1002798.0, ans=0.1 2023-06-21 15:57:24,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-21 15:57:46,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1002918.0, ans=0.2 2023-06-21 15:57:47,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1002918.0, ans=0.125 2023-06-21 15:57:50,599 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.901e+02 3.727e+02 5.131e+02 9.036e+02, threshold=7.453e+02, percent-clipped=4.0 2023-06-21 15:58:21,850 INFO [train.py:996] (1/4) Epoch 6, batch 14700, loss[loss=0.2637, simple_loss=0.3622, pruned_loss=0.08263, over 21631.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3179, pruned_loss=0.085, over 4260999.86 frames. ], batch size: 389, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:58:22,248 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:58:23,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1003038.0, ans=0.0 2023-06-21 15:59:08,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1003158.0, ans=0.0 2023-06-21 15:59:14,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1003158.0, ans=0.125 2023-06-21 15:59:24,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1003218.0, ans=0.125 2023-06-21 15:59:37,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1003278.0, ans=0.1 2023-06-21 15:59:50,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1003278.0, ans=0.125 2023-06-21 15:59:57,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1003338.0, ans=0.125 2023-06-21 15:59:59,095 INFO [train.py:996] (1/4) Epoch 6, batch 14750, loss[loss=0.2605, simple_loss=0.3315, pruned_loss=0.09472, over 21818.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3236, pruned_loss=0.08775, over 4262195.43 frames. ], batch size: 282, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 16:00:10,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1003338.0, ans=0.0 2023-06-21 16:00:49,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1003458.0, ans=0.125 2023-06-21 16:01:04,193 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 3.041e+02 3.594e+02 4.539e+02 8.460e+02, threshold=7.189e+02, percent-clipped=3.0 2023-06-21 16:01:16,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1003518.0, ans=0.125 2023-06-21 16:01:22,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1003578.0, ans=0.0 2023-06-21 16:01:28,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1003578.0, ans=0.125 2023-06-21 16:01:39,219 INFO [train.py:996] (1/4) Epoch 6, batch 14800, loss[loss=0.264, simple_loss=0.3302, pruned_loss=0.09896, over 21620.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3374, pruned_loss=0.09498, over 4256672.64 frames. ], batch size: 247, lr: 5.11e-03, grad_scale: 32.0 2023-06-21 16:01:54,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1003638.0, ans=0.0 2023-06-21 16:02:19,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1003758.0, ans=0.125 2023-06-21 16:02:49,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1003818.0, ans=0.0 2023-06-21 16:03:14,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1003938.0, ans=0.125 2023-06-21 16:03:20,682 INFO [train.py:996] (1/4) Epoch 6, batch 14850, loss[loss=0.3106, simple_loss=0.3849, pruned_loss=0.1181, over 21611.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3297, pruned_loss=0.09406, over 4266310.25 frames. ], batch size: 441, lr: 5.10e-03, grad_scale: 32.0 2023-06-21 16:04:11,261 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:04:11,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.21 vs. limit=15.0 2023-06-21 16:04:27,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1004118.0, ans=0.125 2023-06-21 16:04:28,215 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 3.136e+02 3.769e+02 4.672e+02 7.258e+02, threshold=7.538e+02, percent-clipped=1.0 2023-06-21 16:05:03,047 INFO [train.py:996] (1/4) Epoch 6, batch 14900, loss[loss=0.2273, simple_loss=0.2877, pruned_loss=0.08349, over 21640.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3305, pruned_loss=0.09377, over 4262202.86 frames. ], batch size: 112, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:05:03,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1004238.0, ans=0.125 2023-06-21 16:06:40,449 INFO [train.py:996] (1/4) Epoch 6, batch 14950, loss[loss=0.262, simple_loss=0.3451, pruned_loss=0.08945, over 21796.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3316, pruned_loss=0.09357, over 4260533.88 frames. ], batch size: 118, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:07:41,313 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 2.931e+02 3.491e+02 4.464e+02 7.538e+02, threshold=6.982e+02, percent-clipped=0.0 2023-06-21 16:08:07,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-21 16:08:10,577 INFO [train.py:996] (1/4) Epoch 6, batch 15000, loss[loss=0.2518, simple_loss=0.3276, pruned_loss=0.08799, over 21783.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3331, pruned_loss=0.09529, over 4259745.07 frames. ], batch size: 332, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:08:10,578 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 16:08:22,049 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.5014, 1.5789, 2.5817, 2.6752], device='cuda:1') 2023-06-21 16:08:22,078 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.4975, 2.9348, 2.8406, 2.6891], device='cuda:1') 2023-06-21 16:08:27,117 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.26, simple_loss=0.3558, pruned_loss=0.08209, over 1796401.00 frames. 2023-06-21 16:08:27,118 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-21 16:08:42,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1004838.0, ans=0.1 2023-06-21 16:08:43,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1004838.0, ans=0.125 2023-06-21 16:09:01,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1004898.0, ans=0.2 2023-06-21 16:09:55,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1005078.0, ans=0.125 2023-06-21 16:10:04,147 INFO [train.py:996] (1/4) Epoch 6, batch 15050, loss[loss=0.2211, simple_loss=0.3054, pruned_loss=0.0684, over 21637.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3342, pruned_loss=0.09597, over 4261374.34 frames. ], batch size: 263, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:10:16,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1005138.0, ans=0.125 2023-06-21 16:10:26,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1005138.0, ans=0.0 2023-06-21 16:11:16,857 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 2.976e+02 3.378e+02 4.052e+02 9.524e+02, threshold=6.756e+02, percent-clipped=3.0 2023-06-21 16:11:44,040 INFO [train.py:996] (1/4) Epoch 6, batch 15100, loss[loss=0.2463, simple_loss=0.3258, pruned_loss=0.08344, over 21320.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3365, pruned_loss=0.09583, over 4262085.60 frames. ], batch size: 548, lr: 5.10e-03, grad_scale: 8.0 2023-06-21 16:12:34,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-21 16:12:52,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1005618.0, ans=0.125 2023-06-21 16:13:23,666 INFO [train.py:996] (1/4) Epoch 6, batch 15150, loss[loss=0.2545, simple_loss=0.315, pruned_loss=0.09698, over 21806.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3326, pruned_loss=0.09576, over 4255792.23 frames. ], batch size: 98, lr: 5.10e-03, grad_scale: 8.0 2023-06-21 16:14:12,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.36 vs. limit=10.0 2023-06-21 16:14:24,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-21 16:14:25,206 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 2.946e+02 3.329e+02 3.848e+02 7.712e+02, threshold=6.658e+02, percent-clipped=2.0 2023-06-21 16:14:42,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1005978.0, ans=0.1 2023-06-21 16:14:57,078 INFO [train.py:996] (1/4) Epoch 6, batch 15200, loss[loss=0.1922, simple_loss=0.2799, pruned_loss=0.05228, over 21388.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3238, pruned_loss=0.0913, over 4255916.60 frames. ], batch size: 211, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:15:36,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1006158.0, ans=0.125 2023-06-21 16:15:51,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1006218.0, ans=0.1 2023-06-21 16:15:51,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1006218.0, ans=0.1 2023-06-21 16:16:04,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1006278.0, ans=0.125 2023-06-21 16:16:16,104 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.97 vs. limit=15.0 2023-06-21 16:16:30,313 INFO [train.py:996] (1/4) Epoch 6, batch 15250, loss[loss=0.2313, simple_loss=0.2986, pruned_loss=0.08203, over 21706.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3183, pruned_loss=0.08986, over 4258102.87 frames. ], batch size: 124, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:16:33,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1006338.0, ans=0.125 2023-06-21 16:17:32,963 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.045e+02 3.599e+02 4.449e+02 6.735e+02, threshold=7.197e+02, percent-clipped=2.0 2023-06-21 16:18:15,099 INFO [train.py:996] (1/4) Epoch 6, batch 15300, loss[loss=0.2505, simple_loss=0.315, pruned_loss=0.09304, over 21325.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3217, pruned_loss=0.09296, over 4264937.75 frames. ], batch size: 549, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:18:59,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1006818.0, ans=0.5 2023-06-21 16:19:03,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1006818.0, ans=0.125 2023-06-21 16:19:16,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1006818.0, ans=0.125 2023-06-21 16:19:44,621 INFO [train.py:996] (1/4) Epoch 6, batch 15350, loss[loss=0.285, simple_loss=0.3573, pruned_loss=0.1063, over 21500.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3271, pruned_loss=0.09514, over 4262910.13 frames. ], batch size: 131, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:20:14,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1006998.0, ans=0.1 2023-06-21 16:20:40,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 2.864e+02 3.312e+02 3.850e+02 5.534e+02, threshold=6.625e+02, percent-clipped=0.0 2023-06-21 16:20:50,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1007118.0, ans=0.125 2023-06-21 16:20:54,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1007178.0, ans=0.125 2023-06-21 16:21:12,484 INFO [train.py:996] (1/4) Epoch 6, batch 15400, loss[loss=0.2212, simple_loss=0.302, pruned_loss=0.07018, over 15420.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3273, pruned_loss=0.09269, over 4262945.94 frames. ], batch size: 60, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:21:40,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1007298.0, ans=0.1 2023-06-21 16:22:10,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1007418.0, ans=0.125 2023-06-21 16:22:23,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1007418.0, ans=0.0 2023-06-21 16:22:28,451 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-06-21 16:22:29,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1007478.0, ans=0.0 2023-06-21 16:22:29,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1007478.0, ans=0.0 2023-06-21 16:22:51,011 INFO [train.py:996] (1/4) Epoch 6, batch 15450, loss[loss=0.2454, simple_loss=0.3125, pruned_loss=0.08919, over 21826.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3236, pruned_loss=0.09134, over 4270645.38 frames. ], batch size: 441, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:23:16,495 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-21 16:23:17,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1007598.0, ans=0.2 2023-06-21 16:23:35,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-21 16:23:52,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1007718.0, ans=0.125 2023-06-21 16:23:53,555 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.820e+02 3.206e+02 3.882e+02 5.798e+02, threshold=6.411e+02, percent-clipped=0.0 2023-06-21 16:24:26,138 INFO [train.py:996] (1/4) Epoch 6, batch 15500, loss[loss=0.2212, simple_loss=0.3211, pruned_loss=0.06069, over 20724.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.327, pruned_loss=0.09159, over 4262699.48 frames. ], batch size: 607, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:25:01,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007958.0, ans=0.1 2023-06-21 16:25:03,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1007958.0, ans=0.0 2023-06-21 16:25:46,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1008078.0, ans=0.0 2023-06-21 16:26:05,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-21 16:26:05,926 INFO [train.py:996] (1/4) Epoch 6, batch 15550, loss[loss=0.1981, simple_loss=0.2864, pruned_loss=0.05494, over 21583.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3246, pruned_loss=0.08891, over 4267289.82 frames. ], batch size: 230, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:26:33,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1008198.0, ans=0.09899494936611666 2023-06-21 16:26:45,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1008258.0, ans=0.0 2023-06-21 16:27:08,309 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.789e+02 3.164e+02 3.648e+02 6.720e+02, threshold=6.328e+02, percent-clipped=2.0 2023-06-21 16:27:24,472 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-21 16:27:39,799 INFO [train.py:996] (1/4) Epoch 6, batch 15600, loss[loss=0.2269, simple_loss=0.2904, pruned_loss=0.08168, over 21726.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3192, pruned_loss=0.08769, over 4274330.75 frames. ], batch size: 334, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:27:58,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-21 16:28:29,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1008618.0, ans=0.125 2023-06-21 16:29:07,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1008678.0, ans=0.125 2023-06-21 16:29:13,691 INFO [train.py:996] (1/4) Epoch 6, batch 15650, loss[loss=0.2137, simple_loss=0.2787, pruned_loss=0.07438, over 21176.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3188, pruned_loss=0.08752, over 4274508.96 frames. ], batch size: 548, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:29:41,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-21 16:30:15,988 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.365e+02 3.062e+02 3.560e+02 4.421e+02 6.753e+02, threshold=7.119e+02, percent-clipped=3.0 2023-06-21 16:30:47,542 INFO [train.py:996] (1/4) Epoch 6, batch 15700, loss[loss=0.1934, simple_loss=0.2565, pruned_loss=0.06518, over 15300.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3146, pruned_loss=0.08667, over 4265648.20 frames. ], batch size: 60, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:31:04,743 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-21 16:31:51,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1009218.0, ans=0.125 2023-06-21 16:32:18,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=8.0 2023-06-21 16:32:21,335 INFO [train.py:996] (1/4) Epoch 6, batch 15750, loss[loss=0.2531, simple_loss=0.3075, pruned_loss=0.09938, over 21806.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3104, pruned_loss=0.08636, over 4262589.86 frames. ], batch size: 98, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:32:26,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1009338.0, ans=0.1 2023-06-21 16:32:35,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1009398.0, ans=0.0 2023-06-21 16:33:00,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1009458.0, ans=0.125 2023-06-21 16:33:24,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.861e+02 3.404e+02 4.012e+02 5.531e+02, threshold=6.808e+02, percent-clipped=0.0 2023-06-21 16:33:25,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-21 16:33:54,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1009638.0, ans=0.1 2023-06-21 16:33:55,156 INFO [train.py:996] (1/4) Epoch 6, batch 15800, loss[loss=0.2221, simple_loss=0.2754, pruned_loss=0.08443, over 21473.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3065, pruned_loss=0.08615, over 4261816.17 frames. ], batch size: 230, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:34:04,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1009638.0, ans=15.0 2023-06-21 16:34:57,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1009818.0, ans=0.125 2023-06-21 16:35:29,280 INFO [train.py:996] (1/4) Epoch 6, batch 15850, loss[loss=0.2732, simple_loss=0.3349, pruned_loss=0.1057, over 21361.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3089, pruned_loss=0.08804, over 4259570.24 frames. ], batch size: 471, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:35:47,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1009998.0, ans=0.0 2023-06-21 16:35:54,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1009998.0, ans=0.0 2023-06-21 16:36:24,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-21 16:36:32,241 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.973e+02 3.336e+02 4.018e+02 6.867e+02, threshold=6.671e+02, percent-clipped=1.0 2023-06-21 16:36:34,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1010118.0, ans=0.0 2023-06-21 16:37:02,612 INFO [train.py:996] (1/4) Epoch 6, batch 15900, loss[loss=0.2648, simple_loss=0.3094, pruned_loss=0.1101, over 21331.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3054, pruned_loss=0.08731, over 4258379.00 frames. ], batch size: 473, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:37:10,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1010238.0, ans=0.07 2023-06-21 16:37:46,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1010358.0, ans=0.125 2023-06-21 16:37:50,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1010418.0, ans=0.125 2023-06-21 16:38:05,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1010418.0, ans=0.05 2023-06-21 16:38:15,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1010478.0, ans=0.125 2023-06-21 16:38:20,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-21 16:38:36,322 INFO [train.py:996] (1/4) Epoch 6, batch 15950, loss[loss=0.1895, simple_loss=0.2914, pruned_loss=0.04373, over 21762.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3058, pruned_loss=0.08533, over 4250083.61 frames. ], batch size: 351, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:38:42,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1010538.0, ans=0.125 2023-06-21 16:38:50,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1010598.0, ans=0.1 2023-06-21 16:39:15,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.74 vs. limit=15.0 2023-06-21 16:39:21,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-21 16:39:40,130 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.630e+02 3.032e+02 3.635e+02 5.664e+02, threshold=6.064e+02, percent-clipped=0.0 2023-06-21 16:40:06,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1010778.0, ans=0.1 2023-06-21 16:40:10,505 INFO [train.py:996] (1/4) Epoch 6, batch 16000, loss[loss=0.2618, simple_loss=0.3469, pruned_loss=0.08832, over 21647.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3089, pruned_loss=0.08423, over 4253859.56 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:40:15,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1010838.0, ans=0.0 2023-06-21 16:40:15,913 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-21 16:40:21,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1010838.0, ans=0.0 2023-06-21 16:40:36,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1010898.0, ans=0.0 2023-06-21 16:41:01,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1011018.0, ans=0.0 2023-06-21 16:41:26,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-21 16:41:40,562 INFO [train.py:996] (1/4) Epoch 6, batch 16050, loss[loss=0.2585, simple_loss=0.3467, pruned_loss=0.08516, over 21795.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3103, pruned_loss=0.0823, over 4261099.57 frames. ], batch size: 282, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:41:46,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1011138.0, ans=0.125 2023-06-21 16:42:12,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1011258.0, ans=0.125 2023-06-21 16:42:42,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1011318.0, ans=0.1 2023-06-21 16:42:43,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1011318.0, ans=0.05 2023-06-21 16:42:44,670 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.881e+02 3.788e+02 4.822e+02 9.882e+02, threshold=7.576e+02, percent-clipped=9.0 2023-06-21 16:43:13,228 INFO [train.py:996] (1/4) Epoch 6, batch 16100, loss[loss=0.2863, simple_loss=0.3503, pruned_loss=0.1111, over 21603.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3177, pruned_loss=0.08441, over 4266173.92 frames. ], batch size: 471, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:43:15,093 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:43:18,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1011438.0, ans=0.125 2023-06-21 16:43:27,396 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-21 16:43:29,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-21 16:44:08,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-21 16:44:20,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-21 16:44:42,430 INFO [train.py:996] (1/4) Epoch 6, batch 16150, loss[loss=0.3006, simple_loss=0.3507, pruned_loss=0.1253, over 21796.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.319, pruned_loss=0.08642, over 4275569.57 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:44:47,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.11 vs. limit=15.0 2023-06-21 16:44:55,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1011738.0, ans=0.125 2023-06-21 16:45:07,722 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-21 16:45:47,591 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.047e+02 3.537e+02 4.143e+02 9.363e+02, threshold=7.074e+02, percent-clipped=2.0 2023-06-21 16:46:16,807 INFO [train.py:996] (1/4) Epoch 6, batch 16200, loss[loss=0.2653, simple_loss=0.3331, pruned_loss=0.09873, over 21674.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.323, pruned_loss=0.08816, over 4277991.20 frames. ], batch size: 263, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:46:24,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1012038.0, ans=0.1 2023-06-21 16:47:51,838 INFO [train.py:996] (1/4) Epoch 6, batch 16250, loss[loss=0.2474, simple_loss=0.3211, pruned_loss=0.08681, over 21422.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3217, pruned_loss=0.08829, over 4270873.16 frames. ], batch size: 471, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:49:01,828 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.762e+02 3.334e+02 4.108e+02 7.386e+02, threshold=6.668e+02, percent-clipped=1.0 2023-06-21 16:49:11,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1012578.0, ans=0.1 2023-06-21 16:49:26,006 INFO [train.py:996] (1/4) Epoch 6, batch 16300, loss[loss=0.202, simple_loss=0.2715, pruned_loss=0.06629, over 21233.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3152, pruned_loss=0.08364, over 4272590.64 frames. ], batch size: 159, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:50:02,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.94 vs. limit=22.5 2023-06-21 16:50:23,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1012818.0, ans=0.0 2023-06-21 16:50:24,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1012818.0, ans=0.125 2023-06-21 16:50:56,868 INFO [train.py:996] (1/4) Epoch 6, batch 16350, loss[loss=0.318, simple_loss=0.3784, pruned_loss=0.1288, over 21606.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3154, pruned_loss=0.08487, over 4262524.24 frames. ], batch size: 415, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:51:00,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1012938.0, ans=0.125 2023-06-21 16:51:01,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1012938.0, ans=0.0 2023-06-21 16:51:20,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1012998.0, ans=0.1 2023-06-21 16:52:11,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.615e+02 3.247e+02 3.873e+02 7.213e+02, threshold=6.493e+02, percent-clipped=3.0 2023-06-21 16:52:30,804 INFO [train.py:996] (1/4) Epoch 6, batch 16400, loss[loss=0.2795, simple_loss=0.3364, pruned_loss=0.1113, over 21909.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3175, pruned_loss=0.08616, over 4269540.20 frames. ], batch size: 118, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:52:32,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1013238.0, ans=0.0 2023-06-21 16:52:59,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1013298.0, ans=0.0 2023-06-21 16:53:25,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1013358.0, ans=0.0 2023-06-21 16:53:40,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1013418.0, ans=0.125 2023-06-21 16:54:04,532 INFO [train.py:996] (1/4) Epoch 6, batch 16450, loss[loss=0.3264, simple_loss=0.3629, pruned_loss=0.1449, over 21767.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3197, pruned_loss=0.08846, over 4268237.53 frames. ], batch size: 508, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:55:11,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1013718.0, ans=0.0 2023-06-21 16:55:19,951 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 2.857e+02 3.262e+02 3.717e+02 6.839e+02, threshold=6.523e+02, percent-clipped=2.0 2023-06-21 16:55:39,129 INFO [train.py:996] (1/4) Epoch 6, batch 16500, loss[loss=0.1929, simple_loss=0.2453, pruned_loss=0.07025, over 21908.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3193, pruned_loss=0.08899, over 4278456.89 frames. ], batch size: 107, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:55:56,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1013838.0, ans=0.0 2023-06-21 16:57:17,893 INFO [train.py:996] (1/4) Epoch 6, batch 16550, loss[loss=0.2549, simple_loss=0.3195, pruned_loss=0.09515, over 21441.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3163, pruned_loss=0.08603, over 4275508.66 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:57:59,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1014198.0, ans=0.0 2023-06-21 16:58:14,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1014258.0, ans=10.0 2023-06-21 16:58:29,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 2.954e+02 3.435e+02 4.498e+02 9.143e+02, threshold=6.870e+02, percent-clipped=8.0 2023-06-21 16:58:43,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1014378.0, ans=0.1 2023-06-21 16:58:46,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.03 vs. limit=6.0 2023-06-21 16:58:47,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1014378.0, ans=0.0 2023-06-21 16:58:54,060 INFO [train.py:996] (1/4) Epoch 6, batch 16600, loss[loss=0.3455, simple_loss=0.4318, pruned_loss=0.1296, over 21705.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3264, pruned_loss=0.08991, over 4279270.12 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:00:34,745 INFO [train.py:996] (1/4) Epoch 6, batch 16650, loss[loss=0.254, simple_loss=0.3351, pruned_loss=0.08643, over 20693.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3372, pruned_loss=0.09343, over 4274550.67 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:00:35,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1014738.0, ans=6.0 2023-06-21 17:01:28,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1014858.0, ans=0.0 2023-06-21 17:01:52,122 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.123e+02 3.593e+02 4.644e+02 7.930e+02, threshold=7.186e+02, percent-clipped=2.0 2023-06-21 17:01:55,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1014978.0, ans=10.0 2023-06-21 17:01:55,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=15.0 2023-06-21 17:02:21,297 INFO [train.py:996] (1/4) Epoch 6, batch 16700, loss[loss=0.2831, simple_loss=0.3848, pruned_loss=0.0907, over 20691.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3411, pruned_loss=0.09441, over 4262628.87 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 17:02:28,372 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-21 17:03:02,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1015158.0, ans=0.1 2023-06-21 17:03:36,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2023-06-21 17:03:58,847 INFO [train.py:996] (1/4) Epoch 6, batch 16750, loss[loss=0.326, simple_loss=0.4077, pruned_loss=0.1222, over 21583.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3408, pruned_loss=0.09562, over 4260282.65 frames. ], batch size: 414, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 17:04:22,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1015398.0, ans=0.07 2023-06-21 17:04:30,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1015398.0, ans=0.02 2023-06-21 17:04:39,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1015458.0, ans=0.035 2023-06-21 17:04:51,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1015458.0, ans=0.125 2023-06-21 17:05:03,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1015518.0, ans=0.125 2023-06-21 17:05:12,287 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.506e+02 4.253e+02 6.038e+02 1.079e+03, threshold=8.506e+02, percent-clipped=10.0 2023-06-21 17:05:34,885 INFO [train.py:996] (1/4) Epoch 6, batch 16800, loss[loss=0.3387, simple_loss=0.395, pruned_loss=0.1412, over 21615.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3454, pruned_loss=0.096, over 4265000.24 frames. ], batch size: 471, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:05:58,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1015698.0, ans=0.05 2023-06-21 17:06:39,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1015818.0, ans=0.125 2023-06-21 17:07:08,792 INFO [train.py:996] (1/4) Epoch 6, batch 16850, loss[loss=0.2293, simple_loss=0.3024, pruned_loss=0.07807, over 21892.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3405, pruned_loss=0.09476, over 4268781.98 frames. ], batch size: 414, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:07:19,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1015938.0, ans=0.0 2023-06-21 17:08:05,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1016058.0, ans=0.07 2023-06-21 17:08:21,510 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.931e+02 3.423e+02 4.482e+02 7.655e+02, threshold=6.845e+02, percent-clipped=0.0 2023-06-21 17:08:43,847 INFO [train.py:996] (1/4) Epoch 6, batch 16900, loss[loss=0.2325, simple_loss=0.2995, pruned_loss=0.08276, over 21566.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3336, pruned_loss=0.09266, over 4279367.51 frames. ], batch size: 389, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:09:27,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-21 17:09:28,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1016358.0, ans=0.1 2023-06-21 17:09:29,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1016358.0, ans=0.125 2023-06-21 17:09:37,232 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:09:50,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1016418.0, ans=0.5 2023-06-21 17:10:03,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-21 17:10:16,671 INFO [train.py:996] (1/4) Epoch 6, batch 16950, loss[loss=0.2135, simple_loss=0.2827, pruned_loss=0.07211, over 21925.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3258, pruned_loss=0.09063, over 4285468.23 frames. ], batch size: 316, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:10:40,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=12.0 2023-06-21 17:10:56,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1016598.0, ans=0.125 2023-06-21 17:11:09,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1016658.0, ans=0.0 2023-06-21 17:11:26,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.733e+02 3.000e+02 3.564e+02 5.984e+02, threshold=6.000e+02, percent-clipped=0.0 2023-06-21 17:11:38,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1016778.0, ans=0.125 2023-06-21 17:11:50,083 INFO [train.py:996] (1/4) Epoch 6, batch 17000, loss[loss=0.2123, simple_loss=0.276, pruned_loss=0.07434, over 21235.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3208, pruned_loss=0.09005, over 4285344.36 frames. ], batch size: 608, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:12:23,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1016898.0, ans=0.04949747468305833 2023-06-21 17:12:49,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1017018.0, ans=0.125 2023-06-21 17:13:16,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1017078.0, ans=0.0 2023-06-21 17:13:20,260 INFO [train.py:996] (1/4) Epoch 6, batch 17050, loss[loss=0.2661, simple_loss=0.3589, pruned_loss=0.08661, over 21837.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3296, pruned_loss=0.09358, over 4289596.27 frames. ], batch size: 371, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:13:20,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1017138.0, ans=0.2 2023-06-21 17:13:46,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1017138.0, ans=0.125 2023-06-21 17:14:30,021 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 3.084e+02 3.616e+02 4.433e+02 7.180e+02, threshold=7.232e+02, percent-clipped=5.0 2023-06-21 17:14:36,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1017378.0, ans=0.2 2023-06-21 17:14:44,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-21 17:14:49,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1017378.0, ans=0.2 2023-06-21 17:14:52,389 INFO [train.py:996] (1/4) Epoch 6, batch 17100, loss[loss=0.2394, simple_loss=0.3095, pruned_loss=0.08469, over 21453.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.329, pruned_loss=0.094, over 4289481.60 frames. ], batch size: 131, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:14:54,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1017438.0, ans=10.0 2023-06-21 17:14:58,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1017438.0, ans=0.0 2023-06-21 17:15:00,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1017438.0, ans=0.125 2023-06-21 17:15:12,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1017438.0, ans=0.125 2023-06-21 17:15:34,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1017498.0, ans=0.125 2023-06-21 17:15:36,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017558.0, ans=0.1 2023-06-21 17:16:20,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1017678.0, ans=0.0 2023-06-21 17:16:23,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=12.0 2023-06-21 17:16:24,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1017678.0, ans=0.125 2023-06-21 17:16:26,697 INFO [train.py:996] (1/4) Epoch 6, batch 17150, loss[loss=0.212, simple_loss=0.2888, pruned_loss=0.0676, over 21669.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3246, pruned_loss=0.09335, over 4284449.76 frames. ], batch size: 230, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:17:13,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1017858.0, ans=0.1 2023-06-21 17:17:25,098 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-21 17:17:39,113 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 2.784e+02 3.036e+02 3.616e+02 5.334e+02, threshold=6.072e+02, percent-clipped=0.0 2023-06-21 17:17:41,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1017978.0, ans=0.0 2023-06-21 17:18:05,520 INFO [train.py:996] (1/4) Epoch 6, batch 17200, loss[loss=0.2361, simple_loss=0.3067, pruned_loss=0.08273, over 21813.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3234, pruned_loss=0.09257, over 4284855.94 frames. ], batch size: 247, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:18:33,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-21 17:19:40,270 INFO [train.py:996] (1/4) Epoch 6, batch 17250, loss[loss=0.2655, simple_loss=0.3488, pruned_loss=0.09112, over 21348.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3278, pruned_loss=0.09499, over 4282857.19 frames. ], batch size: 176, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:19:51,504 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=12.0 2023-06-21 17:20:54,739 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.506e+02 3.363e+02 4.058e+02 5.457e+02 1.011e+03, threshold=8.116e+02, percent-clipped=16.0 2023-06-21 17:21:04,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-21 17:21:05,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1018578.0, ans=0.0 2023-06-21 17:21:09,815 INFO [train.py:996] (1/4) Epoch 6, batch 17300, loss[loss=0.2635, simple_loss=0.3346, pruned_loss=0.09614, over 21742.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3365, pruned_loss=0.09787, over 4282317.93 frames. ], batch size: 298, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:22:15,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1018818.0, ans=0.1 2023-06-21 17:22:17,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=12.0 2023-06-21 17:22:40,268 INFO [train.py:996] (1/4) Epoch 6, batch 17350, loss[loss=0.2222, simple_loss=0.3111, pruned_loss=0.06667, over 21804.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3365, pruned_loss=0.09697, over 4282734.05 frames. ], batch size: 316, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:22:49,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1018938.0, ans=0.125 2023-06-21 17:23:36,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1019058.0, ans=0.0 2023-06-21 17:23:45,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=12.0 2023-06-21 17:23:56,313 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.958e+02 3.315e+02 3.844e+02 7.686e+02, threshold=6.630e+02, percent-clipped=0.0 2023-06-21 17:24:11,635 INFO [train.py:996] (1/4) Epoch 6, batch 17400, loss[loss=0.205, simple_loss=0.2531, pruned_loss=0.07842, over 21227.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3315, pruned_loss=0.09341, over 4275883.75 frames. ], batch size: 143, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:24:25,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1019238.0, ans=0.0 2023-06-21 17:24:39,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1019298.0, ans=0.2 2023-06-21 17:25:22,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1019418.0, ans=0.2 2023-06-21 17:25:42,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-21 17:25:46,424 INFO [train.py:996] (1/4) Epoch 6, batch 17450, loss[loss=0.2268, simple_loss=0.2733, pruned_loss=0.09017, over 20008.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3268, pruned_loss=0.09057, over 4267860.30 frames. ], batch size: 704, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:25:48,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1019538.0, ans=0.0 2023-06-21 17:25:53,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-21 17:25:57,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1019538.0, ans=0.0 2023-06-21 17:26:06,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1019598.0, ans=0.0 2023-06-21 17:26:47,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1019718.0, ans=0.07 2023-06-21 17:26:48,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-21 17:26:49,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1019718.0, ans=0.2 2023-06-21 17:27:05,007 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.736e+02 3.255e+02 3.942e+02 6.172e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-21 17:27:06,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1019778.0, ans=0.125 2023-06-21 17:27:10,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1019778.0, ans=0.125 2023-06-21 17:27:18,933 INFO [train.py:996] (1/4) Epoch 6, batch 17500, loss[loss=0.2528, simple_loss=0.3182, pruned_loss=0.09367, over 21913.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3218, pruned_loss=0.08831, over 4271317.88 frames. ], batch size: 316, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:27:56,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1019898.0, ans=0.2 2023-06-21 17:27:58,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1019898.0, ans=0.125 2023-06-21 17:28:11,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1019958.0, ans=0.125 2023-06-21 17:28:14,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1019958.0, ans=0.125 2023-06-21 17:28:50,796 INFO [train.py:996] (1/4) Epoch 6, batch 17550, loss[loss=0.252, simple_loss=0.3378, pruned_loss=0.08316, over 21387.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3214, pruned_loss=0.08658, over 4259334.18 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:28:54,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1020138.0, ans=0.125 2023-06-21 17:29:54,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1020318.0, ans=0.1 2023-06-21 17:30:11,111 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.780e+02 3.218e+02 4.154e+02 6.196e+02, threshold=6.435e+02, percent-clipped=0.0 2023-06-21 17:30:24,475 INFO [train.py:996] (1/4) Epoch 6, batch 17600, loss[loss=0.2563, simple_loss=0.3355, pruned_loss=0.08851, over 21467.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3248, pruned_loss=0.08654, over 4244453.23 frames. ], batch size: 194, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:30:39,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1020438.0, ans=0.125 2023-06-21 17:31:26,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1020618.0, ans=0.2 2023-06-21 17:31:59,656 INFO [train.py:996] (1/4) Epoch 6, batch 17650, loss[loss=0.2383, simple_loss=0.3163, pruned_loss=0.0801, over 21558.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3234, pruned_loss=0.0872, over 4251760.04 frames. ], batch size: 441, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:32:15,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1020738.0, ans=0.0 2023-06-21 17:32:23,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1020738.0, ans=0.2 2023-06-21 17:32:36,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1020798.0, ans=0.1 2023-06-21 17:32:36,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1020798.0, ans=0.125 2023-06-21 17:32:41,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1020798.0, ans=0.05 2023-06-21 17:32:59,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1020918.0, ans=0.0 2023-06-21 17:33:20,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.977e+02 3.448e+02 4.059e+02 7.958e+02, threshold=6.896e+02, percent-clipped=7.0 2023-06-21 17:33:47,957 INFO [train.py:996] (1/4) Epoch 6, batch 17700, loss[loss=0.2544, simple_loss=0.3195, pruned_loss=0.09462, over 19974.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3181, pruned_loss=0.08467, over 4244064.10 frames. ], batch size: 703, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:34:30,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1021158.0, ans=0.2 2023-06-21 17:34:36,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1021218.0, ans=0.2 2023-06-21 17:34:53,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1021278.0, ans=0.125 2023-06-21 17:35:18,342 INFO [train.py:996] (1/4) Epoch 6, batch 17750, loss[loss=0.2997, simple_loss=0.3732, pruned_loss=0.113, over 21834.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3246, pruned_loss=0.0878, over 4249866.26 frames. ], batch size: 124, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:35:33,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1021398.0, ans=0.0 2023-06-21 17:36:36,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.843e+02 3.336e+02 3.898e+02 5.169e+02, threshold=6.672e+02, percent-clipped=0.0 2023-06-21 17:36:49,081 INFO [train.py:996] (1/4) Epoch 6, batch 17800, loss[loss=0.2582, simple_loss=0.332, pruned_loss=0.09223, over 21453.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3243, pruned_loss=0.08714, over 4259815.75 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:36:54,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1021638.0, ans=0.2 2023-06-21 17:37:04,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1021698.0, ans=0.0 2023-06-21 17:37:38,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-21 17:38:10,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1021878.0, ans=0.125 2023-06-21 17:38:12,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1021878.0, ans=0.125 2023-06-21 17:38:19,584 INFO [train.py:996] (1/4) Epoch 6, batch 17850, loss[loss=0.2608, simple_loss=0.3275, pruned_loss=0.0971, over 21473.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3251, pruned_loss=0.08787, over 4261959.41 frames. ], batch size: 211, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:38:35,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1021938.0, ans=0.125 2023-06-21 17:38:45,333 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-21 17:38:48,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-21 17:39:37,562 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 3.033e+02 3.417e+02 4.325e+02 8.227e+02, threshold=6.834e+02, percent-clipped=5.0 2023-06-21 17:39:54,785 INFO [train.py:996] (1/4) Epoch 6, batch 17900, loss[loss=0.2922, simple_loss=0.3792, pruned_loss=0.1026, over 21716.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3309, pruned_loss=0.09042, over 4267469.97 frames. ], batch size: 441, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:40:08,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1022298.0, ans=0.0 2023-06-21 17:40:24,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1022358.0, ans=0.125 2023-06-21 17:41:29,382 INFO [train.py:996] (1/4) Epoch 6, batch 17950, loss[loss=0.2547, simple_loss=0.3476, pruned_loss=0.08095, over 21496.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3295, pruned_loss=0.08627, over 4271485.68 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:42:07,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-21 17:42:11,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1022658.0, ans=0.1 2023-06-21 17:42:45,459 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.768e+02 3.134e+02 4.074e+02 6.684e+02, threshold=6.268e+02, percent-clipped=0.0 2023-06-21 17:42:57,279 INFO [train.py:996] (1/4) Epoch 6, batch 18000, loss[loss=0.2349, simple_loss=0.2922, pruned_loss=0.0888, over 21811.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3222, pruned_loss=0.08484, over 4271449.93 frames. ], batch size: 98, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:42:57,280 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 17:43:11,835 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.4141, 2.7478, 1.6921, 1.8789], device='cuda:1') 2023-06-21 17:43:13,460 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2661, simple_loss=0.365, pruned_loss=0.08355, over 1796401.00 frames. 2023-06-21 17:43:13,461 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-21 17:43:34,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.69 vs. limit=22.5 2023-06-21 17:43:57,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1022958.0, ans=0.05 2023-06-21 17:44:15,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1023018.0, ans=0.125 2023-06-21 17:44:17,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1023018.0, ans=0.2 2023-06-21 17:44:23,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1023018.0, ans=0.125 2023-06-21 17:44:26,901 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-21 17:44:38,368 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:44:38,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1023078.0, ans=0.125 2023-06-21 17:44:42,379 INFO [train.py:996] (1/4) Epoch 6, batch 18050, loss[loss=0.281, simple_loss=0.3485, pruned_loss=0.1067, over 21789.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3165, pruned_loss=0.08464, over 4262547.94 frames. ], batch size: 124, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:45:21,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1023258.0, ans=0.125 2023-06-21 17:45:38,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-06-21 17:46:01,044 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.145e+02 3.852e+02 4.625e+02 8.498e+02, threshold=7.705e+02, percent-clipped=3.0 2023-06-21 17:46:18,333 INFO [train.py:996] (1/4) Epoch 6, batch 18100, loss[loss=0.3095, simple_loss=0.38, pruned_loss=0.1195, over 21506.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3227, pruned_loss=0.08723, over 4263843.46 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:47:09,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1023558.0, ans=0.0 2023-06-21 17:47:26,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=22.5 2023-06-21 17:47:52,806 INFO [train.py:996] (1/4) Epoch 6, batch 18150, loss[loss=0.2485, simple_loss=0.3119, pruned_loss=0.09256, over 21482.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3238, pruned_loss=0.08634, over 4269046.79 frames. ], batch size: 195, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:47:56,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-21 17:48:48,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1023918.0, ans=0.125 2023-06-21 17:48:54,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1023918.0, ans=0.125 2023-06-21 17:49:04,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.867e+02 3.393e+02 4.008e+02 7.433e+02, threshold=6.785e+02, percent-clipped=0.0 2023-06-21 17:49:07,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1023978.0, ans=0.95 2023-06-21 17:49:16,094 INFO [train.py:996] (1/4) Epoch 6, batch 18200, loss[loss=0.2057, simple_loss=0.2751, pruned_loss=0.0682, over 21827.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3168, pruned_loss=0.08595, over 4262694.49 frames. ], batch size: 112, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:49:31,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1024038.0, ans=0.125 2023-06-21 17:50:05,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1024158.0, ans=0.125 2023-06-21 17:50:18,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1024218.0, ans=0.125 2023-06-21 17:50:20,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1024218.0, ans=0.125 2023-06-21 17:50:44,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1024278.0, ans=0.025 2023-06-21 17:50:47,202 INFO [train.py:996] (1/4) Epoch 6, batch 18250, loss[loss=0.1933, simple_loss=0.2607, pruned_loss=0.06292, over 21873.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3092, pruned_loss=0.08341, over 4267064.71 frames. ], batch size: 98, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:50:48,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.61 vs. limit=15.0 2023-06-21 17:51:20,333 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-21 17:51:31,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1024458.0, ans=0.0 2023-06-21 17:51:36,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1024458.0, ans=0.0 2023-06-21 17:51:49,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1024518.0, ans=0.0 2023-06-21 17:52:04,324 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.691e+02 3.171e+02 4.057e+02 6.181e+02, threshold=6.342e+02, percent-clipped=0.0 2023-06-21 17:52:09,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1024578.0, ans=0.2 2023-06-21 17:52:12,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1024578.0, ans=0.125 2023-06-21 17:52:15,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1024638.0, ans=0.1 2023-06-21 17:52:16,382 INFO [train.py:996] (1/4) Epoch 6, batch 18300, loss[loss=0.2638, simple_loss=0.3508, pruned_loss=0.08837, over 21795.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3091, pruned_loss=0.08372, over 4265778.64 frames. ], batch size: 298, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:52:24,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1024638.0, ans=0.1 2023-06-21 17:53:03,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1024758.0, ans=0.125 2023-06-21 17:53:24,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1024818.0, ans=0.125 2023-06-21 17:53:26,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1024818.0, ans=0.125 2023-06-21 17:53:32,449 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:53:49,934 INFO [train.py:996] (1/4) Epoch 6, batch 18350, loss[loss=0.2447, simple_loss=0.3574, pruned_loss=0.06597, over 21666.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3153, pruned_loss=0.08402, over 4248417.43 frames. ], batch size: 263, lr: 5.05e-03, grad_scale: 8.0 2023-06-21 17:53:56,395 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:53:59,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1024938.0, ans=0.125 2023-06-21 17:54:03,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1024998.0, ans=0.125 2023-06-21 17:54:27,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-21 17:54:46,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1025058.0, ans=10.0 2023-06-21 17:55:05,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1025118.0, ans=0.125 2023-06-21 17:55:13,009 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.969e+02 3.614e+02 4.589e+02 8.811e+02, threshold=7.228e+02, percent-clipped=4.0 2023-06-21 17:55:15,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1025178.0, ans=0.0 2023-06-21 17:55:24,014 INFO [train.py:996] (1/4) Epoch 6, batch 18400, loss[loss=0.217, simple_loss=0.2816, pruned_loss=0.07625, over 21493.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.31, pruned_loss=0.08279, over 4240472.17 frames. ], batch size: 212, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:55:31,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1025238.0, ans=0.1 2023-06-21 17:55:40,383 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.28 vs. limit=10.0 2023-06-21 17:56:01,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1025298.0, ans=0.125 2023-06-21 17:56:24,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=8.0 2023-06-21 17:56:42,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.34 vs. limit=15.0 2023-06-21 17:56:56,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1025538.0, ans=0.125 2023-06-21 17:56:56,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1025538.0, ans=0.2 2023-06-21 17:56:57,261 INFO [train.py:996] (1/4) Epoch 6, batch 18450, loss[loss=0.2781, simple_loss=0.3987, pruned_loss=0.07873, over 19891.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3074, pruned_loss=0.07849, over 4247094.26 frames. ], batch size: 702, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:57:06,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1025538.0, ans=0.125 2023-06-21 17:58:20,292 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.633e+02 3.038e+02 3.698e+02 5.788e+02, threshold=6.076e+02, percent-clipped=0.0 2023-06-21 17:58:30,576 INFO [train.py:996] (1/4) Epoch 6, batch 18500, loss[loss=0.2811, simple_loss=0.3558, pruned_loss=0.1032, over 21477.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3037, pruned_loss=0.07747, over 4227706.15 frames. ], batch size: 508, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:58:51,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1025898.0, ans=0.125 2023-06-21 17:59:19,754 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:59:33,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1026018.0, ans=0.125 2023-06-21 17:59:48,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1026078.0, ans=0.125 2023-06-21 17:59:58,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1026078.0, ans=0.125 2023-06-21 18:00:02,399 INFO [train.py:996] (1/4) Epoch 6, batch 18550, loss[loss=0.2155, simple_loss=0.2816, pruned_loss=0.07465, over 21719.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2999, pruned_loss=0.07643, over 4220641.95 frames. ], batch size: 316, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:00:04,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1026138.0, ans=0.2 2023-06-21 18:00:10,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1026138.0, ans=0.0 2023-06-21 18:00:47,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1026258.0, ans=0.0 2023-06-21 18:00:54,011 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-06-21 18:01:04,597 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-21 18:01:08,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1026318.0, ans=0.0 2023-06-21 18:01:20,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026378.0, ans=0.1 2023-06-21 18:01:26,095 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.898e+02 3.436e+02 4.284e+02 7.618e+02, threshold=6.872e+02, percent-clipped=4.0 2023-06-21 18:01:34,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1026378.0, ans=0.2 2023-06-21 18:01:36,709 INFO [train.py:996] (1/4) Epoch 6, batch 18600, loss[loss=0.1908, simple_loss=0.2638, pruned_loss=0.05895, over 21336.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2998, pruned_loss=0.07726, over 4220358.52 frames. ], batch size: 131, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:01:43,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1026438.0, ans=0.0 2023-06-21 18:02:26,150 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:02:52,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.78 vs. limit=15.0 2023-06-21 18:02:52,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1026678.0, ans=0.0 2023-06-21 18:03:05,876 INFO [train.py:996] (1/4) Epoch 6, batch 18650, loss[loss=0.2325, simple_loss=0.2906, pruned_loss=0.08725, over 21782.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2992, pruned_loss=0.07722, over 4227905.56 frames. ], batch size: 107, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:03:31,184 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.92 vs. limit=15.0 2023-06-21 18:03:35,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-21 18:03:36,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026798.0, ans=0.1 2023-06-21 18:04:28,177 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.661e+02 3.021e+02 3.518e+02 6.281e+02, threshold=6.043e+02, percent-clipped=0.0 2023-06-21 18:04:38,438 INFO [train.py:996] (1/4) Epoch 6, batch 18700, loss[loss=0.2356, simple_loss=0.2879, pruned_loss=0.09165, over 21136.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.2972, pruned_loss=0.07949, over 4235082.04 frames. ], batch size: 608, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:04:40,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1027038.0, ans=0.0 2023-06-21 18:05:03,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1027098.0, ans=0.0 2023-06-21 18:05:20,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-21 18:05:32,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-21 18:05:44,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027218.0, ans=0.1 2023-06-21 18:06:07,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1027278.0, ans=0.125 2023-06-21 18:06:08,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1027278.0, ans=0.0 2023-06-21 18:06:11,374 INFO [train.py:996] (1/4) Epoch 6, batch 18750, loss[loss=0.2292, simple_loss=0.2987, pruned_loss=0.07982, over 21294.00 frames. ], tot_loss[loss=0.231, simple_loss=0.2985, pruned_loss=0.08182, over 4248540.05 frames. ], batch size: 159, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:06:16,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1027338.0, ans=0.0 2023-06-21 18:06:23,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1027338.0, ans=0.2 2023-06-21 18:06:23,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1027338.0, ans=0.2 2023-06-21 18:07:34,237 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.279e+02 2.900e+02 3.294e+02 3.931e+02 7.649e+02, threshold=6.589e+02, percent-clipped=4.0 2023-06-21 18:07:34,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1027578.0, ans=0.125 2023-06-21 18:07:45,352 INFO [train.py:996] (1/4) Epoch 6, batch 18800, loss[loss=0.1761, simple_loss=0.2456, pruned_loss=0.05326, over 16547.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3051, pruned_loss=0.08279, over 4240407.46 frames. ], batch size: 60, lr: 5.05e-03, grad_scale: 32.0 2023-06-21 18:08:14,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1027698.0, ans=0.0 2023-06-21 18:08:50,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1027818.0, ans=0.125 2023-06-21 18:09:00,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1027818.0, ans=0.125 2023-06-21 18:09:02,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1027878.0, ans=0.1 2023-06-21 18:09:18,547 INFO [train.py:996] (1/4) Epoch 6, batch 18850, loss[loss=0.2652, simple_loss=0.3116, pruned_loss=0.1094, over 20345.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3004, pruned_loss=0.07845, over 4244316.03 frames. ], batch size: 703, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:09:32,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027938.0, ans=0.1 2023-06-21 18:10:09,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1028058.0, ans=0.0 2023-06-21 18:10:41,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 2.569e+02 2.914e+02 3.331e+02 4.866e+02, threshold=5.828e+02, percent-clipped=0.0 2023-06-21 18:10:51,727 INFO [train.py:996] (1/4) Epoch 6, batch 18900, loss[loss=0.1858, simple_loss=0.2549, pruned_loss=0.05833, over 20742.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2984, pruned_loss=0.07913, over 4241290.15 frames. ], batch size: 608, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:11:06,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-21 18:11:07,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1028238.0, ans=0.05 2023-06-21 18:11:20,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1028298.0, ans=0.125 2023-06-21 18:11:36,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1028358.0, ans=0.0 2023-06-21 18:12:14,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1028478.0, ans=0.125 2023-06-21 18:12:20,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1028478.0, ans=0.0 2023-06-21 18:12:23,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1028538.0, ans=0.2 2023-06-21 18:12:24,801 INFO [train.py:996] (1/4) Epoch 6, batch 18950, loss[loss=0.2672, simple_loss=0.3304, pruned_loss=0.1021, over 21668.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2996, pruned_loss=0.08131, over 4250706.82 frames. ], batch size: 263, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:13:20,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1028658.0, ans=0.1 2023-06-21 18:13:48,638 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 3.005e+02 3.685e+02 4.757e+02 8.623e+02, threshold=7.371e+02, percent-clipped=7.0 2023-06-21 18:14:04,046 INFO [train.py:996] (1/4) Epoch 6, batch 19000, loss[loss=0.2409, simple_loss=0.3219, pruned_loss=0.07994, over 20740.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3112, pruned_loss=0.08282, over 4262684.01 frames. ], batch size: 609, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:15:36,829 INFO [train.py:996] (1/4) Epoch 6, batch 19050, loss[loss=0.2487, simple_loss=0.3059, pruned_loss=0.09578, over 21316.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3166, pruned_loss=0.0874, over 4272174.50 frames. ], batch size: 176, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:16:20,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-21 18:16:26,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1029258.0, ans=0.125 2023-06-21 18:16:44,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1029318.0, ans=0.125 2023-06-21 18:16:46,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1029318.0, ans=0.0 2023-06-21 18:16:52,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1029378.0, ans=0.125 2023-06-21 18:16:54,971 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 2.879e+02 3.312e+02 3.801e+02 5.598e+02, threshold=6.624e+02, percent-clipped=0.0 2023-06-21 18:17:10,666 INFO [train.py:996] (1/4) Epoch 6, batch 19100, loss[loss=0.2361, simple_loss=0.2911, pruned_loss=0.09057, over 21223.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3137, pruned_loss=0.08769, over 4273836.20 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:17:28,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-21 18:17:43,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1029498.0, ans=0.0 2023-06-21 18:18:07,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1029558.0, ans=0.0 2023-06-21 18:18:51,226 INFO [train.py:996] (1/4) Epoch 6, batch 19150, loss[loss=0.2528, simple_loss=0.3488, pruned_loss=0.07846, over 21782.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3135, pruned_loss=0.08716, over 4262864.64 frames. ], batch size: 282, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:19:05,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-21 18:19:17,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-21 18:19:46,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1029918.0, ans=0.0 2023-06-21 18:19:46,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1029918.0, ans=0.1 2023-06-21 18:20:01,981 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-21 18:20:17,876 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 3.009e+02 3.594e+02 4.563e+02 7.134e+02, threshold=7.188e+02, percent-clipped=1.0 2023-06-21 18:20:25,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1030038.0, ans=0.1 2023-06-21 18:20:31,689 INFO [train.py:996] (1/4) Epoch 6, batch 19200, loss[loss=0.385, simple_loss=0.4577, pruned_loss=0.1561, over 21496.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3267, pruned_loss=0.08909, over 4268932.78 frames. ], batch size: 507, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:20:58,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030098.0, ans=0.1 2023-06-21 18:21:53,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1030278.0, ans=0.2 2023-06-21 18:22:00,350 INFO [train.py:996] (1/4) Epoch 6, batch 19250, loss[loss=0.1951, simple_loss=0.2888, pruned_loss=0.05073, over 21860.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3252, pruned_loss=0.08408, over 4263085.64 frames. ], batch size: 316, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:22:16,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1030338.0, ans=0.2 2023-06-21 18:22:22,732 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.08 vs. limit=15.0 2023-06-21 18:22:45,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1030458.0, ans=0.5 2023-06-21 18:23:00,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1030518.0, ans=0.0 2023-06-21 18:23:14,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1030578.0, ans=0.125 2023-06-21 18:23:25,536 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 2.595e+02 3.039e+02 3.485e+02 4.997e+02, threshold=6.078e+02, percent-clipped=0.0 2023-06-21 18:23:37,842 INFO [train.py:996] (1/4) Epoch 6, batch 19300, loss[loss=0.2666, simple_loss=0.3245, pruned_loss=0.1044, over 21436.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3219, pruned_loss=0.08373, over 4271729.46 frames. ], batch size: 144, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:23:38,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1030638.0, ans=0.07 2023-06-21 18:24:09,958 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-21 18:24:42,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1030818.0, ans=0.125 2023-06-21 18:24:56,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1030878.0, ans=0.125 2023-06-21 18:25:05,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1030878.0, ans=0.125 2023-06-21 18:25:07,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-21 18:25:14,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030878.0, ans=0.1 2023-06-21 18:25:17,123 INFO [train.py:996] (1/4) Epoch 6, batch 19350, loss[loss=0.2823, simple_loss=0.3899, pruned_loss=0.08733, over 19813.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3153, pruned_loss=0.07917, over 4270320.40 frames. ], batch size: 703, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:25:28,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-21 18:25:32,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1030998.0, ans=0.09899494936611666 2023-06-21 18:26:14,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1031118.0, ans=0.0 2023-06-21 18:26:15,094 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-21 18:26:19,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1031118.0, ans=0.0 2023-06-21 18:26:38,756 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.597e+02 3.186e+02 4.013e+02 6.947e+02, threshold=6.372e+02, percent-clipped=2.0 2023-06-21 18:26:50,821 INFO [train.py:996] (1/4) Epoch 6, batch 19400, loss[loss=0.2309, simple_loss=0.3058, pruned_loss=0.07802, over 21906.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3112, pruned_loss=0.07773, over 4273790.10 frames. ], batch size: 333, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:26:51,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1031238.0, ans=0.125 2023-06-21 18:26:55,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.47 vs. limit=22.5 2023-06-21 18:27:57,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1031418.0, ans=0.125 2023-06-21 18:28:22,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1031538.0, ans=0.0 2023-06-21 18:28:24,286 INFO [train.py:996] (1/4) Epoch 6, batch 19450, loss[loss=0.2417, simple_loss=0.2895, pruned_loss=0.09697, over 21578.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3101, pruned_loss=0.08037, over 4284367.53 frames. ], batch size: 247, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:28:42,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1031598.0, ans=0.2 2023-06-21 18:29:08,940 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-21 18:29:19,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-21 18:29:20,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1031718.0, ans=0.125 2023-06-21 18:29:22,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-21 18:29:51,191 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.868e+02 3.154e+02 3.556e+02 5.974e+02, threshold=6.308e+02, percent-clipped=0.0 2023-06-21 18:29:58,638 INFO [train.py:996] (1/4) Epoch 6, batch 19500, loss[loss=0.2144, simple_loss=0.2759, pruned_loss=0.07648, over 21510.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3071, pruned_loss=0.08239, over 4272902.35 frames. ], batch size: 230, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:30:00,523 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:30:43,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1031958.0, ans=0.125 2023-06-21 18:31:00,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1032018.0, ans=0.125 2023-06-21 18:31:34,624 INFO [train.py:996] (1/4) Epoch 6, batch 19550, loss[loss=0.2127, simple_loss=0.3018, pruned_loss=0.06179, over 21628.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3063, pruned_loss=0.08137, over 4275287.88 frames. ], batch size: 230, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:31:52,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1032198.0, ans=0.125 2023-06-21 18:32:20,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1032258.0, ans=0.125 2023-06-21 18:32:25,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1032318.0, ans=0.04949747468305833 2023-06-21 18:32:54,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1032378.0, ans=0.0 2023-06-21 18:32:55,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.95 vs. limit=6.0 2023-06-21 18:33:00,133 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.941e+02 3.432e+02 4.314e+02 8.392e+02, threshold=6.865e+02, percent-clipped=2.0 2023-06-21 18:33:07,588 INFO [train.py:996] (1/4) Epoch 6, batch 19600, loss[loss=0.235, simple_loss=0.2956, pruned_loss=0.08721, over 21522.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3072, pruned_loss=0.08199, over 4284626.05 frames. ], batch size: 211, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:33:19,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1032438.0, ans=0.125 2023-06-21 18:33:24,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1032498.0, ans=0.0 2023-06-21 18:34:11,241 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-21 18:34:23,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-06-21 18:34:30,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1032678.0, ans=0.125 2023-06-21 18:34:31,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-21 18:34:42,388 INFO [train.py:996] (1/4) Epoch 6, batch 19650, loss[loss=0.2307, simple_loss=0.3027, pruned_loss=0.07933, over 21899.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3133, pruned_loss=0.08729, over 4290679.21 frames. ], batch size: 316, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:35:31,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1032858.0, ans=0.1 2023-06-21 18:35:31,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1032858.0, ans=0.05 2023-06-21 18:36:09,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1032978.0, ans=0.125 2023-06-21 18:36:10,806 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 3.307e+02 3.868e+02 4.631e+02 9.125e+02, threshold=7.736e+02, percent-clipped=5.0 2023-06-21 18:36:16,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032978.0, ans=0.1 2023-06-21 18:36:23,457 INFO [train.py:996] (1/4) Epoch 6, batch 19700, loss[loss=0.2284, simple_loss=0.3376, pruned_loss=0.05955, over 20793.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3183, pruned_loss=0.08788, over 4285602.64 frames. ], batch size: 608, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:37:09,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1033158.0, ans=0.125 2023-06-21 18:37:23,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1033218.0, ans=0.2 2023-06-21 18:37:58,189 INFO [train.py:996] (1/4) Epoch 6, batch 19750, loss[loss=0.2663, simple_loss=0.3453, pruned_loss=0.09366, over 21648.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3287, pruned_loss=0.09034, over 4282005.89 frames. ], batch size: 263, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:38:18,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-06-21 18:38:23,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1033398.0, ans=0.125 2023-06-21 18:38:41,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1033458.0, ans=0.125 2023-06-21 18:38:56,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1033518.0, ans=0.0 2023-06-21 18:39:12,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1033578.0, ans=0.0 2023-06-21 18:39:24,364 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 3.146e+02 3.443e+02 4.036e+02 6.788e+02, threshold=6.886e+02, percent-clipped=0.0 2023-06-21 18:39:31,801 INFO [train.py:996] (1/4) Epoch 6, batch 19800, loss[loss=0.1848, simple_loss=0.2571, pruned_loss=0.05623, over 21551.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3277, pruned_loss=0.09097, over 4288688.44 frames. ], batch size: 212, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:39:32,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1033638.0, ans=0.0 2023-06-21 18:39:35,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1033638.0, ans=0.2 2023-06-21 18:39:47,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1033638.0, ans=0.0 2023-06-21 18:40:10,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1033758.0, ans=0.05 2023-06-21 18:40:16,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1033758.0, ans=0.125 2023-06-21 18:41:11,658 INFO [train.py:996] (1/4) Epoch 6, batch 19850, loss[loss=0.1796, simple_loss=0.2479, pruned_loss=0.05563, over 21302.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3198, pruned_loss=0.08511, over 4288808.80 frames. ], batch size: 131, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:41:31,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1033998.0, ans=0.0 2023-06-21 18:41:33,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1033998.0, ans=0.125 2023-06-21 18:42:28,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-06-21 18:42:30,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.29 vs. limit=10.0 2023-06-21 18:42:33,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.771e+02 3.274e+02 3.883e+02 8.276e+02, threshold=6.549e+02, percent-clipped=3.0 2023-06-21 18:42:44,454 INFO [train.py:996] (1/4) Epoch 6, batch 19900, loss[loss=0.2492, simple_loss=0.314, pruned_loss=0.09221, over 21796.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3183, pruned_loss=0.08198, over 4288176.61 frames. ], batch size: 371, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:43:14,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-21 18:43:26,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1034358.0, ans=0.2 2023-06-21 18:43:30,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1034358.0, ans=0.125 2023-06-21 18:43:44,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1034418.0, ans=0.125 2023-06-21 18:44:09,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1034478.0, ans=0.125 2023-06-21 18:44:19,351 INFO [train.py:996] (1/4) Epoch 6, batch 19950, loss[loss=0.2667, simple_loss=0.3122, pruned_loss=0.1106, over 21845.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3138, pruned_loss=0.08212, over 4282084.54 frames. ], batch size: 107, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:44:43,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1034598.0, ans=0.125 2023-06-21 18:44:49,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1034598.0, ans=0.125 2023-06-21 18:45:22,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1034718.0, ans=0.125 2023-06-21 18:45:32,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1034778.0, ans=0.125 2023-06-21 18:45:47,376 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.877e+02 3.269e+02 3.817e+02 6.552e+02, threshold=6.538e+02, percent-clipped=1.0 2023-06-21 18:45:53,428 INFO [train.py:996] (1/4) Epoch 6, batch 20000, loss[loss=0.2228, simple_loss=0.3081, pruned_loss=0.06882, over 21809.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3153, pruned_loss=0.08316, over 4289476.00 frames. ], batch size: 298, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:45:54,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-21 18:46:19,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-21 18:46:20,219 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:46:30,357 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:46:45,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1034958.0, ans=0.125 2023-06-21 18:46:46,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1035018.0, ans=0.125 2023-06-21 18:47:10,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-21 18:47:26,471 INFO [train.py:996] (1/4) Epoch 6, batch 20050, loss[loss=0.2448, simple_loss=0.3137, pruned_loss=0.08798, over 21903.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.316, pruned_loss=0.0853, over 4290262.48 frames. ], batch size: 316, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:48:19,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1035318.0, ans=0.2 2023-06-21 18:48:54,571 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 2.772e+02 3.158e+02 3.739e+02 6.638e+02, threshold=6.316e+02, percent-clipped=1.0 2023-06-21 18:48:58,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1035378.0, ans=0.0 2023-06-21 18:49:00,939 INFO [train.py:996] (1/4) Epoch 6, batch 20100, loss[loss=0.2408, simple_loss=0.3307, pruned_loss=0.07551, over 21686.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3177, pruned_loss=0.08691, over 4296462.25 frames. ], batch size: 263, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:49:15,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1035438.0, ans=0.0 2023-06-21 18:49:18,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-21 18:49:23,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-06-21 18:50:23,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1035678.0, ans=0.2 2023-06-21 18:50:38,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1035678.0, ans=0.125 2023-06-21 18:50:45,517 INFO [train.py:996] (1/4) Epoch 6, batch 20150, loss[loss=0.2725, simple_loss=0.351, pruned_loss=0.09702, over 21793.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3275, pruned_loss=0.0905, over 4294755.08 frames. ], batch size: 118, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:50:53,849 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:51:31,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-21 18:52:00,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1035918.0, ans=0.125 2023-06-21 18:52:17,965 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.643e+02 3.447e+02 4.089e+02 4.717e+02 8.481e+02, threshold=8.179e+02, percent-clipped=7.0 2023-06-21 18:52:22,523 INFO [train.py:996] (1/4) Epoch 6, batch 20200, loss[loss=0.2734, simple_loss=0.3659, pruned_loss=0.09043, over 21807.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3335, pruned_loss=0.09338, over 4281913.47 frames. ], batch size: 316, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:52:27,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1036038.0, ans=0.125 2023-06-21 18:53:02,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1036158.0, ans=0.0 2023-06-21 18:54:02,157 INFO [train.py:996] (1/4) Epoch 6, batch 20250, loss[loss=0.2571, simple_loss=0.3263, pruned_loss=0.09398, over 21756.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3335, pruned_loss=0.09197, over 4280589.80 frames. ], batch size: 112, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:54:16,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1036398.0, ans=0.0 2023-06-21 18:55:22,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.562e+02 2.955e+02 3.343e+02 6.024e+02, threshold=5.910e+02, percent-clipped=0.0 2023-06-21 18:55:30,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1036638.0, ans=0.125 2023-06-21 18:55:31,046 INFO [train.py:996] (1/4) Epoch 6, batch 20300, loss[loss=0.2243, simple_loss=0.3083, pruned_loss=0.07015, over 21626.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3312, pruned_loss=0.08864, over 4266961.91 frames. ], batch size: 263, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 18:55:34,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1036638.0, ans=0.0 2023-06-21 18:55:43,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1036638.0, ans=0.125 2023-06-21 18:56:45,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1036878.0, ans=0.125 2023-06-21 18:56:59,740 INFO [train.py:996] (1/4) Epoch 6, batch 20350, loss[loss=0.2651, simple_loss=0.3301, pruned_loss=0.1001, over 21790.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3314, pruned_loss=0.08866, over 4263812.56 frames. ], batch size: 247, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 18:57:00,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1036938.0, ans=0.0 2023-06-21 18:57:10,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036938.0, ans=0.1 2023-06-21 18:57:51,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-21 18:58:18,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1037178.0, ans=0.125 2023-06-21 18:58:27,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1037178.0, ans=0.125 2023-06-21 18:58:34,347 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.934e+02 3.391e+02 4.276e+02 6.956e+02, threshold=6.781e+02, percent-clipped=4.0 2023-06-21 18:58:37,439 INFO [train.py:996] (1/4) Epoch 6, batch 20400, loss[loss=0.295, simple_loss=0.3602, pruned_loss=0.1149, over 21783.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3333, pruned_loss=0.09115, over 4256912.64 frames. ], batch size: 332, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:58:46,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1037238.0, ans=0.1 2023-06-21 18:59:22,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1037358.0, ans=0.0 2023-06-21 18:59:45,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=12.0 2023-06-21 19:00:03,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037478.0, ans=0.1 2023-06-21 19:00:05,781 INFO [train.py:996] (1/4) Epoch 6, batch 20450, loss[loss=0.2683, simple_loss=0.3344, pruned_loss=0.1011, over 21822.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3345, pruned_loss=0.09388, over 4254946.27 frames. ], batch size: 124, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:00:33,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-21 19:00:43,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1037658.0, ans=0.125 2023-06-21 19:00:45,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037658.0, ans=0.1 2023-06-21 19:00:48,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1037658.0, ans=0.0 2023-06-21 19:01:36,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.567e+02 4.308e+02 5.221e+02 9.242e+02, threshold=8.616e+02, percent-clipped=7.0 2023-06-21 19:01:39,735 INFO [train.py:996] (1/4) Epoch 6, batch 20500, loss[loss=0.26, simple_loss=0.3189, pruned_loss=0.1006, over 21897.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3292, pruned_loss=0.09395, over 4253764.40 frames. ], batch size: 107, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:01:40,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1037838.0, ans=0.125 2023-06-21 19:02:05,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1037898.0, ans=0.1 2023-06-21 19:02:10,591 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-21 19:02:26,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1037958.0, ans=0.0 2023-06-21 19:02:36,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1038018.0, ans=0.0 2023-06-21 19:02:43,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1038018.0, ans=0.0 2023-06-21 19:02:53,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-21 19:03:03,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1038078.0, ans=0.5 2023-06-21 19:03:14,145 INFO [train.py:996] (1/4) Epoch 6, batch 20550, loss[loss=0.2142, simple_loss=0.2867, pruned_loss=0.07087, over 21386.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3218, pruned_loss=0.09222, over 4253295.15 frames. ], batch size: 131, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:04:06,112 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:04:45,835 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.022e+02 3.769e+02 4.529e+02 7.328e+02, threshold=7.538e+02, percent-clipped=0.0 2023-06-21 19:04:48,965 INFO [train.py:996] (1/4) Epoch 6, batch 20600, loss[loss=0.2779, simple_loss=0.3275, pruned_loss=0.1141, over 21224.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3242, pruned_loss=0.09055, over 4251506.01 frames. ], batch size: 159, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:05:49,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1038618.0, ans=15.0 2023-06-21 19:05:53,615 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:06:21,547 INFO [train.py:996] (1/4) Epoch 6, batch 20650, loss[loss=0.1849, simple_loss=0.2518, pruned_loss=0.05897, over 21532.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3203, pruned_loss=0.09065, over 4253766.56 frames. ], batch size: 263, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:06:50,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1038798.0, ans=0.125 2023-06-21 19:06:50,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1038798.0, ans=0.0 2023-06-21 19:07:55,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.764e+02 3.114e+02 3.548e+02 5.059e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-21 19:07:57,288 INFO [train.py:996] (1/4) Epoch 6, batch 20700, loss[loss=0.2444, simple_loss=0.3254, pruned_loss=0.08173, over 21824.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.314, pruned_loss=0.08736, over 4236713.28 frames. ], batch size: 371, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:08:17,973 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-21 19:08:26,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1039098.0, ans=0.1 2023-06-21 19:08:59,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1039218.0, ans=0.035 2023-06-21 19:09:20,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1039278.0, ans=0.0 2023-06-21 19:09:37,894 INFO [train.py:996] (1/4) Epoch 6, batch 20750, loss[loss=0.2694, simple_loss=0.397, pruned_loss=0.07088, over 20769.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3163, pruned_loss=0.08579, over 4239329.50 frames. ], batch size: 607, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:09:46,613 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.50 vs. limit=22.5 2023-06-21 19:10:09,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1039398.0, ans=0.125 2023-06-21 19:10:50,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1039518.0, ans=0.0 2023-06-21 19:11:07,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1039578.0, ans=0.125 2023-06-21 19:11:11,333 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 3.291e+02 4.399e+02 5.919e+02 1.160e+03, threshold=8.798e+02, percent-clipped=22.0 2023-06-21 19:11:12,891 INFO [train.py:996] (1/4) Epoch 6, batch 20800, loss[loss=0.2371, simple_loss=0.2934, pruned_loss=0.0904, over 21730.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3209, pruned_loss=0.08785, over 4238822.48 frames. ], batch size: 124, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:11:28,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1039698.0, ans=0.0 2023-06-21 19:11:38,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1039698.0, ans=0.0 2023-06-21 19:11:46,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1039758.0, ans=0.2 2023-06-21 19:12:00,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.18 vs. limit=22.5 2023-06-21 19:12:02,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1039758.0, ans=0.125 2023-06-21 19:12:19,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-21 19:12:45,676 INFO [train.py:996] (1/4) Epoch 6, batch 20850, loss[loss=0.1889, simple_loss=0.2581, pruned_loss=0.05987, over 21580.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3117, pruned_loss=0.0853, over 4244171.53 frames. ], batch size: 230, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:12:45,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1039938.0, ans=0.1 2023-06-21 19:12:53,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-21 19:13:06,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-21 19:13:25,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1040058.0, ans=0.2 2023-06-21 19:13:27,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-21 19:13:43,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1040118.0, ans=0.05 2023-06-21 19:14:17,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.782e+02 3.461e+02 4.341e+02 9.177e+02, threshold=6.922e+02, percent-clipped=2.0 2023-06-21 19:14:18,808 INFO [train.py:996] (1/4) Epoch 6, batch 20900, loss[loss=0.2266, simple_loss=0.2968, pruned_loss=0.07816, over 21259.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3147, pruned_loss=0.08676, over 4246210.54 frames. ], batch size: 159, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:14:28,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1040238.0, ans=0.125 2023-06-21 19:14:36,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1040298.0, ans=0.125 2023-06-21 19:15:51,576 INFO [train.py:996] (1/4) Epoch 6, batch 20950, loss[loss=0.1901, simple_loss=0.2648, pruned_loss=0.05771, over 21331.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3104, pruned_loss=0.08292, over 4251412.26 frames. ], batch size: 194, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:16:07,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1040598.0, ans=0.125 2023-06-21 19:16:18,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1040598.0, ans=0.125 2023-06-21 19:17:17,884 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.529e+02 2.877e+02 3.319e+02 6.338e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-21 19:17:19,463 INFO [train.py:996] (1/4) Epoch 6, batch 21000, loss[loss=0.233, simple_loss=0.3008, pruned_loss=0.08258, over 21883.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3088, pruned_loss=0.08345, over 4261985.77 frames. ], batch size: 351, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:17:19,463 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 19:17:28,613 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8009, 3.1428, 3.2670, 3.1159], device='cuda:1') 2023-06-21 19:17:35,752 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2688, simple_loss=0.3681, pruned_loss=0.08473, over 1796401.00 frames. 2023-06-21 19:17:35,752 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-21 19:17:36,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1040838.0, ans=0.125 2023-06-21 19:17:41,163 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-21 19:18:12,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-21 19:18:32,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1041018.0, ans=0.0 2023-06-21 19:18:35,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1041018.0, ans=0.125 2023-06-21 19:18:55,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1041078.0, ans=0.125 2023-06-21 19:19:08,162 INFO [train.py:996] (1/4) Epoch 6, batch 21050, loss[loss=0.2262, simple_loss=0.2866, pruned_loss=0.08287, over 21248.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.306, pruned_loss=0.08364, over 4256525.67 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:19:18,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1041138.0, ans=0.04949747468305833 2023-06-21 19:19:26,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.08 vs. limit=10.0 2023-06-21 19:19:30,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1041198.0, ans=0.0 2023-06-21 19:19:41,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1041258.0, ans=0.0 2023-06-21 19:20:05,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1041318.0, ans=0.1 2023-06-21 19:20:34,869 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.795e+02 3.116e+02 3.832e+02 6.783e+02, threshold=6.232e+02, percent-clipped=3.0 2023-06-21 19:20:36,441 INFO [train.py:996] (1/4) Epoch 6, batch 21100, loss[loss=0.2424, simple_loss=0.3011, pruned_loss=0.0919, over 21513.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3016, pruned_loss=0.0824, over 4262961.96 frames. ], batch size: 414, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:21:25,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1041558.0, ans=0.1 2023-06-21 19:22:01,002 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:22:10,048 INFO [train.py:996] (1/4) Epoch 6, batch 21150, loss[loss=0.245, simple_loss=0.3565, pruned_loss=0.06674, over 19768.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2975, pruned_loss=0.0825, over 4267489.57 frames. ], batch size: 703, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:23:18,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1041918.0, ans=0.0 2023-06-21 19:23:41,596 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.856e+02 3.274e+02 4.026e+02 6.885e+02, threshold=6.548e+02, percent-clipped=2.0 2023-06-21 19:23:43,128 INFO [train.py:996] (1/4) Epoch 6, batch 21200, loss[loss=0.2245, simple_loss=0.2822, pruned_loss=0.0834, over 21348.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2929, pruned_loss=0.08107, over 4253499.46 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 32.0 2023-06-21 19:25:05,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1042278.0, ans=0.125 2023-06-21 19:25:06,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1042278.0, ans=0.125 2023-06-21 19:25:09,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1042278.0, ans=0.2 2023-06-21 19:25:12,305 INFO [train.py:996] (1/4) Epoch 6, batch 21250, loss[loss=0.2211, simple_loss=0.2854, pruned_loss=0.07841, over 21161.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2921, pruned_loss=0.08181, over 4253103.91 frames. ], batch size: 143, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:26:22,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1042578.0, ans=0.125 2023-06-21 19:26:22,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-21 19:26:41,154 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.008e+02 3.484e+02 4.132e+02 8.300e+02, threshold=6.969e+02, percent-clipped=3.0 2023-06-21 19:26:41,174 INFO [train.py:996] (1/4) Epoch 6, batch 21300, loss[loss=0.2649, simple_loss=0.3315, pruned_loss=0.09912, over 21850.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.2992, pruned_loss=0.08428, over 4250463.85 frames. ], batch size: 351, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:26:55,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1042698.0, ans=0.125 2023-06-21 19:27:46,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1042818.0, ans=0.2 2023-06-21 19:28:14,537 INFO [train.py:996] (1/4) Epoch 6, batch 21350, loss[loss=0.2092, simple_loss=0.279, pruned_loss=0.06968, over 21144.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3034, pruned_loss=0.08448, over 4256673.93 frames. ], batch size: 143, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:28:24,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1042938.0, ans=0.1 2023-06-21 19:29:11,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1043058.0, ans=0.125 2023-06-21 19:29:35,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1043178.0, ans=0.2 2023-06-21 19:29:49,101 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.778e+02 3.087e+02 3.779e+02 5.135e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-21 19:29:49,131 INFO [train.py:996] (1/4) Epoch 6, batch 21400, loss[loss=0.3503, simple_loss=0.3989, pruned_loss=0.1508, over 21338.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3068, pruned_loss=0.08343, over 4261943.67 frames. ], batch size: 507, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:29:50,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1043238.0, ans=0.0 2023-06-21 19:31:05,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1043418.0, ans=0.2 2023-06-21 19:31:17,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=15.0 2023-06-21 19:31:22,531 INFO [train.py:996] (1/4) Epoch 6, batch 21450, loss[loss=0.2551, simple_loss=0.3162, pruned_loss=0.09698, over 21462.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3109, pruned_loss=0.08544, over 4263219.82 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:31:24,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1043538.0, ans=0.0 2023-06-21 19:31:25,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1043538.0, ans=0.0 2023-06-21 19:31:31,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1043538.0, ans=0.1 2023-06-21 19:32:55,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-21 19:32:55,666 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 2.827e+02 3.165e+02 3.718e+02 5.694e+02, threshold=6.329e+02, percent-clipped=0.0 2023-06-21 19:32:55,686 INFO [train.py:996] (1/4) Epoch 6, batch 21500, loss[loss=0.2238, simple_loss=0.2889, pruned_loss=0.07937, over 21723.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.309, pruned_loss=0.08715, over 4271770.33 frames. ], batch size: 351, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:33:49,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1043958.0, ans=0.2 2023-06-21 19:34:02,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=15.0 2023-06-21 19:34:05,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1044018.0, ans=0.125 2023-06-21 19:34:29,719 INFO [train.py:996] (1/4) Epoch 6, batch 21550, loss[loss=0.2107, simple_loss=0.2766, pruned_loss=0.07247, over 21611.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3026, pruned_loss=0.0845, over 4265240.19 frames. ], batch size: 298, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:34:33,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1044138.0, ans=0.125 2023-06-21 19:35:13,822 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-21 19:36:03,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1044438.0, ans=0.125 2023-06-21 19:36:04,911 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.859e+02 3.362e+02 4.302e+02 8.120e+02, threshold=6.725e+02, percent-clipped=2.0 2023-06-21 19:36:04,931 INFO [train.py:996] (1/4) Epoch 6, batch 21600, loss[loss=0.2283, simple_loss=0.2935, pruned_loss=0.08155, over 21246.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.2996, pruned_loss=0.08309, over 4258083.36 frames. ], batch size: 159, lr: 5.00e-03, grad_scale: 32.0 2023-06-21 19:36:48,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1044558.0, ans=0.2 2023-06-21 19:36:50,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1044558.0, ans=0.125 2023-06-21 19:36:59,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1044558.0, ans=0.1 2023-06-21 19:37:09,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-21 19:37:39,253 INFO [train.py:996] (1/4) Epoch 6, batch 21650, loss[loss=0.1879, simple_loss=0.2742, pruned_loss=0.05081, over 21309.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3042, pruned_loss=0.08108, over 4252705.87 frames. ], batch size: 131, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:38:03,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1044798.0, ans=0.125 2023-06-21 19:38:25,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1044858.0, ans=0.125 2023-06-21 19:38:26,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1044858.0, ans=0.1 2023-06-21 19:38:41,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1044858.0, ans=10.0 2023-06-21 19:38:41,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1044858.0, ans=0.0 2023-06-21 19:38:50,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-21 19:38:54,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1044918.0, ans=0.125 2023-06-21 19:38:55,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-21 19:39:13,275 INFO [train.py:996] (1/4) Epoch 6, batch 21700, loss[loss=0.2364, simple_loss=0.2973, pruned_loss=0.08774, over 21557.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3041, pruned_loss=0.07987, over 4257908.99 frames. ], batch size: 391, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:39:14,694 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.795e+02 3.316e+02 4.085e+02 7.380e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-21 19:40:00,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1045158.0, ans=0.0 2023-06-21 19:40:01,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1045158.0, ans=0.0 2023-06-21 19:40:18,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1045218.0, ans=0.125 2023-06-21 19:40:34,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1045278.0, ans=0.125 2023-06-21 19:40:45,985 INFO [train.py:996] (1/4) Epoch 6, batch 21750, loss[loss=0.2125, simple_loss=0.2676, pruned_loss=0.07868, over 21227.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.301, pruned_loss=0.08017, over 4254823.45 frames. ], batch size: 551, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:40:47,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1045338.0, ans=0.0 2023-06-21 19:40:53,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1045338.0, ans=0.0 2023-06-21 19:41:55,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1045518.0, ans=0.125 2023-06-21 19:42:19,841 INFO [train.py:996] (1/4) Epoch 6, batch 21800, loss[loss=0.2528, simple_loss=0.3511, pruned_loss=0.07729, over 19911.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2988, pruned_loss=0.08089, over 4238174.29 frames. ], batch size: 702, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:42:21,225 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.706e+02 3.025e+02 3.711e+02 5.699e+02, threshold=6.051e+02, percent-clipped=0.0 2023-06-21 19:42:48,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1045698.0, ans=0.125 2023-06-21 19:43:38,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-21 19:43:51,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.54 vs. limit=10.0 2023-06-21 19:43:53,867 INFO [train.py:996] (1/4) Epoch 6, batch 21850, loss[loss=0.235, simple_loss=0.3063, pruned_loss=0.08185, over 21472.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3028, pruned_loss=0.08148, over 4238828.61 frames. ], batch size: 131, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:44:17,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1045998.0, ans=0.125 2023-06-21 19:44:19,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.97 vs. limit=10.0 2023-06-21 19:44:20,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1045998.0, ans=0.125 2023-06-21 19:44:21,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1045998.0, ans=0.0 2023-06-21 19:45:15,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1046178.0, ans=0.0 2023-06-21 19:45:26,629 INFO [train.py:996] (1/4) Epoch 6, batch 21900, loss[loss=0.2343, simple_loss=0.2939, pruned_loss=0.08742, over 21775.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3044, pruned_loss=0.0825, over 4249192.42 frames. ], batch size: 333, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:45:28,140 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.966e+02 3.405e+02 4.071e+02 6.584e+02, threshold=6.811e+02, percent-clipped=2.0 2023-06-21 19:45:39,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1046238.0, ans=0.2 2023-06-21 19:45:52,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1046298.0, ans=0.1 2023-06-21 19:46:09,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-21 19:46:12,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-21 19:46:30,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-21 19:46:31,072 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-21 19:47:00,485 INFO [train.py:996] (1/4) Epoch 6, batch 21950, loss[loss=0.1745, simple_loss=0.2524, pruned_loss=0.04832, over 21500.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2997, pruned_loss=0.08209, over 4261016.51 frames. ], batch size: 212, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:47:43,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1046658.0, ans=0.125 2023-06-21 19:48:25,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1046778.0, ans=0.125 2023-06-21 19:48:34,342 INFO [train.py:996] (1/4) Epoch 6, batch 22000, loss[loss=0.3285, simple_loss=0.3627, pruned_loss=0.1472, over 21391.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2946, pruned_loss=0.07943, over 4260413.07 frames. ], batch size: 507, lr: 5.00e-03, grad_scale: 32.0 2023-06-21 19:48:40,677 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.414e+02 2.927e+02 3.631e+02 6.492e+02, threshold=5.855e+02, percent-clipped=0.0 2023-06-21 19:48:42,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1046838.0, ans=0.0 2023-06-21 19:49:45,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1047018.0, ans=0.2 2023-06-21 19:50:14,050 INFO [train.py:996] (1/4) Epoch 6, batch 22050, loss[loss=0.2295, simple_loss=0.3103, pruned_loss=0.07436, over 21254.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2992, pruned_loss=0.08118, over 4259497.24 frames. ], batch size: 159, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:50:46,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1047198.0, ans=0.125 2023-06-21 19:51:01,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1047258.0, ans=0.125 2023-06-21 19:51:13,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1047318.0, ans=0.125 2023-06-21 19:51:48,059 INFO [train.py:996] (1/4) Epoch 6, batch 22100, loss[loss=0.2474, simple_loss=0.3151, pruned_loss=0.08983, over 21438.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3103, pruned_loss=0.08677, over 4269227.26 frames. ], batch size: 211, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:51:51,161 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 3.410e+02 3.908e+02 4.704e+02 7.568e+02, threshold=7.817e+02, percent-clipped=7.0 2023-06-21 19:52:11,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-21 19:52:26,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-21 19:53:22,040 INFO [train.py:996] (1/4) Epoch 6, batch 22150, loss[loss=0.2831, simple_loss=0.3346, pruned_loss=0.1158, over 21747.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3137, pruned_loss=0.08855, over 4278936.07 frames. ], batch size: 508, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:53:24,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1047738.0, ans=0.2 2023-06-21 19:53:50,104 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.54 vs. limit=10.0 2023-06-21 19:54:01,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1047858.0, ans=0.125 2023-06-21 19:54:01,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1047858.0, ans=0.2 2023-06-21 19:54:06,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1047858.0, ans=0.1 2023-06-21 19:54:51,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-21 19:54:53,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1047978.0, ans=0.0 2023-06-21 19:55:00,791 INFO [train.py:996] (1/4) Epoch 6, batch 22200, loss[loss=0.2715, simple_loss=0.3602, pruned_loss=0.09145, over 21900.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3153, pruned_loss=0.08916, over 4282894.62 frames. ], batch size: 316, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:55:08,463 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.486e+02 3.160e+02 3.693e+02 4.495e+02 7.335e+02, threshold=7.385e+02, percent-clipped=0.0 2023-06-21 19:55:16,349 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:55:19,744 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-21 19:55:26,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1048098.0, ans=0.1 2023-06-21 19:55:26,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1048098.0, ans=0.1 2023-06-21 19:55:35,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1048158.0, ans=0.125 2023-06-21 19:55:52,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1048218.0, ans=0.2 2023-06-21 19:56:06,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-21 19:56:34,370 INFO [train.py:996] (1/4) Epoch 6, batch 22250, loss[loss=0.3159, simple_loss=0.3756, pruned_loss=0.1281, over 21397.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3232, pruned_loss=0.09111, over 4288117.72 frames. ], batch size: 471, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:56:43,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1048338.0, ans=0.0 2023-06-21 19:57:19,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1048518.0, ans=0.0 2023-06-21 19:57:57,993 INFO [train.py:996] (1/4) Epoch 6, batch 22300, loss[loss=0.2693, simple_loss=0.3266, pruned_loss=0.106, over 21321.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3246, pruned_loss=0.09228, over 4274576.72 frames. ], batch size: 176, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 19:58:04,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1048638.0, ans=0.125 2023-06-21 19:58:05,484 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 3.056e+02 3.498e+02 3.964e+02 6.113e+02, threshold=6.996e+02, percent-clipped=0.0 2023-06-21 19:58:07,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1048638.0, ans=0.2 2023-06-21 19:58:11,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1048638.0, ans=0.0 2023-06-21 19:58:17,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1048698.0, ans=0.125 2023-06-21 19:58:17,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1048698.0, ans=0.0 2023-06-21 19:58:41,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1048758.0, ans=0.125 2023-06-21 19:58:54,840 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-21 19:59:27,786 INFO [train.py:996] (1/4) Epoch 6, batch 22350, loss[loss=0.1913, simple_loss=0.2684, pruned_loss=0.05714, over 21542.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3229, pruned_loss=0.09263, over 4279006.02 frames. ], batch size: 212, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 19:59:44,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1048998.0, ans=0.2 2023-06-21 20:00:04,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1049058.0, ans=0.125 2023-06-21 20:01:00,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1049238.0, ans=0.09899494936611666 2023-06-21 20:01:01,231 INFO [train.py:996] (1/4) Epoch 6, batch 22400, loss[loss=0.2136, simple_loss=0.2887, pruned_loss=0.06928, over 21559.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3175, pruned_loss=0.08832, over 4288155.58 frames. ], batch size: 230, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:01:04,241 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.868e+02 3.552e+02 4.171e+02 5.869e+02, threshold=7.104e+02, percent-clipped=0.0 2023-06-21 20:01:14,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=12.0 2023-06-21 20:01:39,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1049358.0, ans=0.125 2023-06-21 20:02:04,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1049418.0, ans=0.07 2023-06-21 20:02:34,799 INFO [train.py:996] (1/4) Epoch 6, batch 22450, loss[loss=0.1895, simple_loss=0.2484, pruned_loss=0.0653, over 21199.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3127, pruned_loss=0.08795, over 4272761.71 frames. ], batch size: 548, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:02:41,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1049538.0, ans=0.2 2023-06-21 20:02:44,504 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=15.0 2023-06-21 20:03:35,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1049718.0, ans=0.125 2023-06-21 20:04:08,832 INFO [train.py:996] (1/4) Epoch 6, batch 22500, loss[loss=0.2352, simple_loss=0.2961, pruned_loss=0.08715, over 21779.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3072, pruned_loss=0.08712, over 4273002.50 frames. ], batch size: 107, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:04:10,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1049838.0, ans=0.125 2023-06-21 20:04:11,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.833e+02 3.380e+02 4.088e+02 7.765e+02, threshold=6.760e+02, percent-clipped=2.0 2023-06-21 20:04:25,803 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:04:30,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1049898.0, ans=0.07 2023-06-21 20:04:33,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1049898.0, ans=0.125 2023-06-21 20:04:37,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1049958.0, ans=0.125 2023-06-21 20:05:01,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1050018.0, ans=0.125 2023-06-21 20:05:06,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1050018.0, ans=0.0 2023-06-21 20:05:12,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1050018.0, ans=0.125 2023-06-21 20:05:42,955 INFO [train.py:996] (1/4) Epoch 6, batch 22550, loss[loss=0.2237, simple_loss=0.2956, pruned_loss=0.07594, over 21312.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3116, pruned_loss=0.08738, over 4278232.70 frames. ], batch size: 176, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:05:46,945 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-21 20:06:05,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1050198.0, ans=0.0 2023-06-21 20:06:43,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1050318.0, ans=0.04949747468305833 2023-06-21 20:06:59,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1050318.0, ans=0.125 2023-06-21 20:07:14,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1050378.0, ans=0.2 2023-06-21 20:07:18,455 INFO [train.py:996] (1/4) Epoch 6, batch 22600, loss[loss=0.1704, simple_loss=0.2108, pruned_loss=0.06499, over 16557.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3149, pruned_loss=0.08858, over 4280891.29 frames. ], batch size: 61, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:07:21,330 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.122e+02 3.804e+02 4.633e+02 7.875e+02, threshold=7.609e+02, percent-clipped=4.0 2023-06-21 20:07:23,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1050438.0, ans=0.125 2023-06-21 20:07:32,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-21 20:08:08,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1050558.0, ans=0.1 2023-06-21 20:08:14,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1050558.0, ans=0.125 2023-06-21 20:08:47,118 INFO [train.py:996] (1/4) Epoch 6, batch 22650, loss[loss=0.2194, simple_loss=0.2862, pruned_loss=0.07627, over 21686.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3109, pruned_loss=0.08785, over 4275904.71 frames. ], batch size: 112, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:08:56,599 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-21 20:09:59,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1050918.0, ans=0.125 2023-06-21 20:10:19,381 INFO [train.py:996] (1/4) Epoch 6, batch 22700, loss[loss=0.2126, simple_loss=0.2771, pruned_loss=0.07408, over 21812.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3048, pruned_loss=0.08678, over 4281077.48 frames. ], batch size: 118, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:10:23,800 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.096e+02 3.667e+02 4.331e+02 7.482e+02, threshold=7.334e+02, percent-clipped=0.0 2023-06-21 20:10:26,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1051038.0, ans=0.2 2023-06-21 20:10:37,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1051098.0, ans=0.125 2023-06-21 20:11:06,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=17.26 vs. limit=15.0 2023-06-21 20:11:37,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1051278.0, ans=0.0 2023-06-21 20:11:44,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-21 20:11:53,615 INFO [train.py:996] (1/4) Epoch 6, batch 22750, loss[loss=0.293, simple_loss=0.3557, pruned_loss=0.1151, over 21670.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3061, pruned_loss=0.0887, over 4273334.09 frames. ], batch size: 351, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:12:07,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1051398.0, ans=0.1 2023-06-21 20:12:50,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-21 20:13:00,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1051518.0, ans=0.2 2023-06-21 20:13:26,410 INFO [train.py:996] (1/4) Epoch 6, batch 22800, loss[loss=0.2328, simple_loss=0.3161, pruned_loss=0.07472, over 21794.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3109, pruned_loss=0.09188, over 4279119.66 frames. ], batch size: 112, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:13:28,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1051638.0, ans=0.125 2023-06-21 20:13:30,786 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 2.968e+02 3.368e+02 3.965e+02 6.508e+02, threshold=6.737e+02, percent-clipped=0.0 2023-06-21 20:14:15,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1051758.0, ans=0.0 2023-06-21 20:14:47,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1051878.0, ans=0.2 2023-06-21 20:14:59,377 INFO [train.py:996] (1/4) Epoch 6, batch 22850, loss[loss=0.2498, simple_loss=0.3023, pruned_loss=0.09862, over 21859.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3054, pruned_loss=0.09041, over 4279027.51 frames. ], batch size: 373, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:15:33,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2023-06-21 20:16:00,221 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-21 20:16:09,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1052118.0, ans=0.0 2023-06-21 20:16:30,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-21 20:16:34,405 INFO [train.py:996] (1/4) Epoch 6, batch 22900, loss[loss=0.2304, simple_loss=0.3316, pruned_loss=0.06463, over 21802.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3106, pruned_loss=0.08953, over 4271440.80 frames. ], batch size: 282, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:16:39,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.845e+02 3.273e+02 3.939e+02 6.144e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-21 20:17:49,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1052418.0, ans=0.1 2023-06-21 20:18:15,285 INFO [train.py:996] (1/4) Epoch 6, batch 22950, loss[loss=0.2624, simple_loss=0.3107, pruned_loss=0.1071, over 21890.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3282, pruned_loss=0.08938, over 4274565.85 frames. ], batch size: 107, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:18:46,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1052598.0, ans=0.125 2023-06-21 20:19:20,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1052718.0, ans=0.125 2023-06-21 20:19:37,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1052778.0, ans=0.125 2023-06-21 20:19:49,043 INFO [train.py:996] (1/4) Epoch 6, batch 23000, loss[loss=0.2142, simple_loss=0.2805, pruned_loss=0.07396, over 21213.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3256, pruned_loss=0.08676, over 4269550.37 frames. ], batch size: 608, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:19:53,521 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.747e+02 3.155e+02 3.821e+02 7.452e+02, threshold=6.310e+02, percent-clipped=2.0 2023-06-21 20:20:00,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1052838.0, ans=0.125 2023-06-21 20:20:22,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-21 20:20:49,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1053018.0, ans=0.125 2023-06-21 20:21:06,924 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-21 20:21:29,272 INFO [train.py:996] (1/4) Epoch 6, batch 23050, loss[loss=0.2646, simple_loss=0.3291, pruned_loss=0.1001, over 21781.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3263, pruned_loss=0.08853, over 4269534.64 frames. ], batch size: 247, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:22:51,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1053378.0, ans=0.0 2023-06-21 20:23:00,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1053378.0, ans=0.125 2023-06-21 20:23:02,818 INFO [train.py:996] (1/4) Epoch 6, batch 23100, loss[loss=0.2344, simple_loss=0.2956, pruned_loss=0.08665, over 21531.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3202, pruned_loss=0.08807, over 4269455.69 frames. ], batch size: 414, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:23:07,167 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.234e+02 3.747e+02 4.482e+02 8.068e+02, threshold=7.493e+02, percent-clipped=4.0 2023-06-21 20:23:22,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1053498.0, ans=0.125 2023-06-21 20:24:35,495 INFO [train.py:996] (1/4) Epoch 6, batch 23150, loss[loss=0.2301, simple_loss=0.2869, pruned_loss=0.08662, over 22019.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3139, pruned_loss=0.08766, over 4265756.10 frames. ], batch size: 103, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:25:42,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1053978.0, ans=0.0 2023-06-21 20:25:58,097 INFO [train.py:996] (1/4) Epoch 6, batch 23200, loss[loss=0.246, simple_loss=0.3197, pruned_loss=0.08613, over 21904.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3133, pruned_loss=0.08868, over 4277575.43 frames. ], batch size: 124, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:26:13,452 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.774e+02 3.196e+02 3.706e+02 6.362e+02, threshold=6.391e+02, percent-clipped=0.0 2023-06-21 20:26:22,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1054098.0, ans=0.125 2023-06-21 20:26:35,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1054158.0, ans=0.0 2023-06-21 20:27:30,819 INFO [train.py:996] (1/4) Epoch 6, batch 23250, loss[loss=0.2754, simple_loss=0.3233, pruned_loss=0.1137, over 21786.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3124, pruned_loss=0.08933, over 4286652.43 frames. ], batch size: 508, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:28:04,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1054458.0, ans=0.0 2023-06-21 20:28:27,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1054518.0, ans=0.125 2023-06-21 20:29:05,991 INFO [train.py:996] (1/4) Epoch 6, batch 23300, loss[loss=0.2443, simple_loss=0.3502, pruned_loss=0.06922, over 21626.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3214, pruned_loss=0.09101, over 4287642.18 frames. ], batch size: 230, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:29:09,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1054638.0, ans=0.1 2023-06-21 20:29:12,270 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 2.961e+02 3.509e+02 4.048e+02 6.618e+02, threshold=7.018e+02, percent-clipped=1.0 2023-06-21 20:30:40,391 INFO [train.py:996] (1/4) Epoch 6, batch 23350, loss[loss=0.2002, simple_loss=0.2593, pruned_loss=0.07059, over 21918.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.325, pruned_loss=0.08928, over 4290586.71 frames. ], batch size: 107, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:31:31,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1055118.0, ans=0.2 2023-06-21 20:31:48,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1055118.0, ans=0.02 2023-06-21 20:32:13,198 INFO [train.py:996] (1/4) Epoch 6, batch 23400, loss[loss=0.1948, simple_loss=0.266, pruned_loss=0.06184, over 21303.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3175, pruned_loss=0.08474, over 4283395.69 frames. ], batch size: 176, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:32:13,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1055238.0, ans=0.2 2023-06-21 20:32:18,971 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.966e+02 3.517e+02 4.346e+02 6.933e+02, threshold=7.034e+02, percent-clipped=0.0 2023-06-21 20:32:49,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-06-21 20:33:47,366 INFO [train.py:996] (1/4) Epoch 6, batch 23450, loss[loss=0.2309, simple_loss=0.2996, pruned_loss=0.08109, over 20772.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3191, pruned_loss=0.08781, over 4284180.11 frames. ], batch size: 608, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:33:48,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-21 20:33:49,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1055538.0, ans=0.0 2023-06-21 20:33:51,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1055538.0, ans=0.0 2023-06-21 20:33:56,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1055538.0, ans=0.125 2023-06-21 20:34:51,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1055718.0, ans=0.2 2023-06-21 20:34:52,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1055718.0, ans=0.2 2023-06-21 20:34:56,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-21 20:35:20,273 INFO [train.py:996] (1/4) Epoch 6, batch 23500, loss[loss=0.2284, simple_loss=0.3077, pruned_loss=0.07453, over 21813.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3196, pruned_loss=0.09008, over 4288802.27 frames. ], batch size: 118, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:35:21,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.05 vs. limit=10.0 2023-06-21 20:35:27,698 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 2.940e+02 3.315e+02 3.870e+02 5.953e+02, threshold=6.630e+02, percent-clipped=0.0 2023-06-21 20:36:21,591 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:36:53,697 INFO [train.py:996] (1/4) Epoch 6, batch 23550, loss[loss=0.1928, simple_loss=0.2456, pruned_loss=0.07006, over 21295.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3127, pruned_loss=0.08915, over 4287030.96 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:37:03,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1056138.0, ans=0.125 2023-06-21 20:37:28,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1056258.0, ans=0.125 2023-06-21 20:37:29,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-06-21 20:38:14,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1056378.0, ans=0.125 2023-06-21 20:38:27,629 INFO [train.py:996] (1/4) Epoch 6, batch 23600, loss[loss=0.2303, simple_loss=0.3063, pruned_loss=0.07715, over 21478.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3146, pruned_loss=0.08885, over 4277176.83 frames. ], batch size: 211, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:38:34,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.807e+02 3.254e+02 4.113e+02 6.430e+02, threshold=6.509e+02, percent-clipped=0.0 2023-06-21 20:38:46,021 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:38:46,053 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:39:06,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-21 20:39:18,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1056558.0, ans=0.125 2023-06-21 20:39:21,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1056558.0, ans=0.125 2023-06-21 20:39:52,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1056678.0, ans=0.125 2023-06-21 20:40:01,261 INFO [train.py:996] (1/4) Epoch 6, batch 23650, loss[loss=0.2538, simple_loss=0.3338, pruned_loss=0.08696, over 21296.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3156, pruned_loss=0.08708, over 4280921.41 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:40:02,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.56 vs. limit=6.0 2023-06-21 20:41:08,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1056918.0, ans=0.0 2023-06-21 20:41:19,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1056978.0, ans=0.1 2023-06-21 20:41:39,621 INFO [train.py:996] (1/4) Epoch 6, batch 23700, loss[loss=0.2587, simple_loss=0.3304, pruned_loss=0.09347, over 21253.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3179, pruned_loss=0.08686, over 4285081.10 frames. ], batch size: 143, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:41:51,807 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.889e+02 3.360e+02 4.132e+02 7.517e+02, threshold=6.720e+02, percent-clipped=1.0 2023-06-21 20:42:35,424 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=12.0 2023-06-21 20:42:51,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1057218.0, ans=0.125 2023-06-21 20:43:20,234 INFO [train.py:996] (1/4) Epoch 6, batch 23750, loss[loss=0.2266, simple_loss=0.3252, pruned_loss=0.06399, over 21685.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3212, pruned_loss=0.08833, over 4288949.97 frames. ], batch size: 441, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:44:03,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1057458.0, ans=0.125 2023-06-21 20:44:19,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1057518.0, ans=0.1 2023-06-21 20:44:32,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1057578.0, ans=0.2 2023-06-21 20:44:55,715 INFO [train.py:996] (1/4) Epoch 6, batch 23800, loss[loss=0.2785, simple_loss=0.3737, pruned_loss=0.0916, over 21757.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3183, pruned_loss=0.08569, over 4284975.56 frames. ], batch size: 351, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:45:03,404 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.613e+02 2.976e+02 3.389e+02 5.789e+02, threshold=5.953e+02, percent-clipped=0.0 2023-06-21 20:45:49,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.29 vs. limit=22.5 2023-06-21 20:46:30,952 INFO [train.py:996] (1/4) Epoch 6, batch 23850, loss[loss=0.2654, simple_loss=0.3394, pruned_loss=0.09569, over 21977.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3282, pruned_loss=0.08845, over 4284834.01 frames. ], batch size: 317, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:46:37,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1057938.0, ans=0.125 2023-06-21 20:46:43,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1057938.0, ans=0.125 2023-06-21 20:46:59,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1058058.0, ans=0.125 2023-06-21 20:47:26,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1058118.0, ans=0.0 2023-06-21 20:47:59,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1058238.0, ans=10.0 2023-06-21 20:48:00,279 INFO [train.py:996] (1/4) Epoch 6, batch 23900, loss[loss=0.2675, simple_loss=0.3416, pruned_loss=0.09675, over 21673.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3365, pruned_loss=0.09116, over 4282357.32 frames. ], batch size: 332, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:48:07,685 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.320e+02 3.834e+02 4.673e+02 6.802e+02, threshold=7.669e+02, percent-clipped=5.0 2023-06-21 20:48:31,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1058358.0, ans=0.125 2023-06-21 20:48:52,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1058418.0, ans=0.0 2023-06-21 20:49:22,099 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-21 20:49:29,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1058478.0, ans=0.125 2023-06-21 20:49:33,461 INFO [train.py:996] (1/4) Epoch 6, batch 23950, loss[loss=0.2256, simple_loss=0.2882, pruned_loss=0.08146, over 20133.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.33, pruned_loss=0.09119, over 4280803.24 frames. ], batch size: 702, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:51:04,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1058778.0, ans=0.0 2023-06-21 20:51:08,211 INFO [train.py:996] (1/4) Epoch 6, batch 24000, loss[loss=0.2927, simple_loss=0.3556, pruned_loss=0.1149, over 20711.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3327, pruned_loss=0.09519, over 4287920.56 frames. ], batch size: 607, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:51:08,212 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 20:51:24,741 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2687, simple_loss=0.3663, pruned_loss=0.08552, over 1796401.00 frames. 2023-06-21 20:51:24,742 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-21 20:51:32,336 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.640e+02 3.190e+02 3.718e+02 4.654e+02 6.990e+02, threshold=7.435e+02, percent-clipped=0.0 2023-06-21 20:51:33,442 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-21 20:51:48,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1058898.0, ans=0.125 2023-06-21 20:52:12,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1058958.0, ans=0.125 2023-06-21 20:52:58,908 INFO [train.py:996] (1/4) Epoch 6, batch 24050, loss[loss=0.2647, simple_loss=0.3403, pruned_loss=0.0945, over 21749.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3336, pruned_loss=0.09528, over 4285210.85 frames. ], batch size: 332, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:53:35,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1059198.0, ans=0.125 2023-06-21 20:53:46,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1059258.0, ans=0.125 2023-06-21 20:54:06,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1059318.0, ans=0.2 2023-06-21 20:54:33,341 INFO [train.py:996] (1/4) Epoch 6, batch 24100, loss[loss=0.3165, simple_loss=0.3794, pruned_loss=0.1268, over 21328.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3329, pruned_loss=0.09282, over 4278696.48 frames. ], batch size: 507, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:54:38,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1059438.0, ans=0.2 2023-06-21 20:54:40,895 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.753e+02 3.093e+02 3.531e+02 5.265e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-21 20:55:40,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.68 vs. limit=15.0 2023-06-21 20:55:54,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1059678.0, ans=0.125 2023-06-21 20:55:56,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1059678.0, ans=0.0 2023-06-21 20:56:05,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-06-21 20:56:07,333 INFO [train.py:996] (1/4) Epoch 6, batch 24150, loss[loss=0.2564, simple_loss=0.3222, pruned_loss=0.09531, over 21925.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3307, pruned_loss=0.09367, over 4284239.71 frames. ], batch size: 316, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:56:44,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1059798.0, ans=0.1 2023-06-21 20:57:02,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1059858.0, ans=0.125 2023-06-21 20:57:13,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1059918.0, ans=0.125 2023-06-21 20:57:14,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=8.0 2023-06-21 20:57:51,181 INFO [train.py:996] (1/4) Epoch 6, batch 24200, loss[loss=0.3576, simple_loss=0.4231, pruned_loss=0.146, over 21481.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3333, pruned_loss=0.09574, over 4285613.03 frames. ], batch size: 508, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:58:05,320 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.135e+02 3.604e+02 4.507e+02 8.443e+02, threshold=7.208e+02, percent-clipped=5.0 2023-06-21 20:58:08,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1060038.0, ans=0.125 2023-06-21 20:58:10,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1060098.0, ans=0.0 2023-06-21 20:58:11,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1060098.0, ans=0.2 2023-06-21 20:58:33,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1060158.0, ans=0.125 2023-06-21 20:58:56,226 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-21 20:59:30,804 INFO [train.py:996] (1/4) Epoch 6, batch 24250, loss[loss=0.2008, simple_loss=0.3234, pruned_loss=0.03913, over 20773.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3297, pruned_loss=0.08813, over 4286466.44 frames. ], batch size: 607, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:59:53,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-21 21:00:30,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1060518.0, ans=0.09899494936611666 2023-06-21 21:00:47,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1060578.0, ans=0.125 2023-06-21 21:01:01,418 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:01:03,895 INFO [train.py:996] (1/4) Epoch 6, batch 24300, loss[loss=0.1862, simple_loss=0.2599, pruned_loss=0.05625, over 21336.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3216, pruned_loss=0.08254, over 4275000.06 frames. ], batch size: 159, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:01:12,932 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.484e+02 3.071e+02 3.742e+02 5.232e+02, threshold=6.142e+02, percent-clipped=0.0 2023-06-21 21:01:55,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1060818.0, ans=0.5 2023-06-21 21:02:37,548 INFO [train.py:996] (1/4) Epoch 6, batch 24350, loss[loss=0.2369, simple_loss=0.3089, pruned_loss=0.08243, over 21807.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.319, pruned_loss=0.0828, over 4281546.71 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:02:38,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-21 21:02:42,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1060938.0, ans=0.125 2023-06-21 21:02:42,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1060938.0, ans=0.0 2023-06-21 21:02:50,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1060938.0, ans=0.125 2023-06-21 21:03:14,199 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:03:20,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1061058.0, ans=10.0 2023-06-21 21:03:22,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-21 21:03:23,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1061058.0, ans=0.07 2023-06-21 21:03:41,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1061118.0, ans=0.125 2023-06-21 21:03:55,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1061178.0, ans=0.0 2023-06-21 21:04:16,763 INFO [train.py:996] (1/4) Epoch 6, batch 24400, loss[loss=0.241, simple_loss=0.319, pruned_loss=0.08154, over 21780.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3234, pruned_loss=0.08679, over 4284679.74 frames. ], batch size: 333, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 21:04:25,988 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.143e+02 3.570e+02 4.226e+02 5.954e+02, threshold=7.140e+02, percent-clipped=0.0 2023-06-21 21:04:29,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1061238.0, ans=0.125 2023-06-21 21:04:34,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1061298.0, ans=0.125 2023-06-21 21:05:22,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1061418.0, ans=0.125 2023-06-21 21:05:30,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1061478.0, ans=0.5 2023-06-21 21:05:51,477 INFO [train.py:996] (1/4) Epoch 6, batch 24450, loss[loss=0.2853, simple_loss=0.3797, pruned_loss=0.09544, over 21463.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3248, pruned_loss=0.08751, over 4279318.64 frames. ], batch size: 471, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:05:55,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1061538.0, ans=0.0 2023-06-21 21:06:02,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1061538.0, ans=0.0 2023-06-21 21:06:49,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1061718.0, ans=0.125 2023-06-21 21:06:52,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1061718.0, ans=0.1 2023-06-21 21:07:24,540 INFO [train.py:996] (1/4) Epoch 6, batch 24500, loss[loss=0.2439, simple_loss=0.3064, pruned_loss=0.0907, over 21217.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3254, pruned_loss=0.08822, over 4280642.66 frames. ], batch size: 143, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:07:33,657 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.842e+02 3.184e+02 3.780e+02 5.341e+02, threshold=6.369e+02, percent-clipped=0.0 2023-06-21 21:07:42,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1061898.0, ans=0.125 2023-06-21 21:08:09,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1061958.0, ans=0.0 2023-06-21 21:08:30,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1062018.0, ans=0.07 2023-06-21 21:08:39,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1062018.0, ans=0.035 2023-06-21 21:08:58,973 INFO [train.py:996] (1/4) Epoch 6, batch 24550, loss[loss=0.274, simple_loss=0.3496, pruned_loss=0.09921, over 21575.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3281, pruned_loss=0.09124, over 4282558.12 frames. ], batch size: 230, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:10:09,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1062318.0, ans=0.1 2023-06-21 21:10:20,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1062378.0, ans=0.2 2023-06-21 21:10:33,716 INFO [train.py:996] (1/4) Epoch 6, batch 24600, loss[loss=0.2103, simple_loss=0.2729, pruned_loss=0.07381, over 21841.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3248, pruned_loss=0.09263, over 4270533.13 frames. ], batch size: 98, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:10:44,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 2.960e+02 3.461e+02 4.086e+02 6.859e+02, threshold=6.922e+02, percent-clipped=1.0 2023-06-21 21:10:50,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1062438.0, ans=0.125 2023-06-21 21:11:47,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1062618.0, ans=0.035 2023-06-21 21:12:08,342 INFO [train.py:996] (1/4) Epoch 6, batch 24650, loss[loss=0.19, simple_loss=0.247, pruned_loss=0.06651, over 21279.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3171, pruned_loss=0.09132, over 4276248.17 frames. ], batch size: 176, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:12:38,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1062798.0, ans=0.2 2023-06-21 21:12:57,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1062858.0, ans=0.1 2023-06-21 21:13:37,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1062978.0, ans=0.0 2023-06-21 21:13:40,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1063038.0, ans=0.125 2023-06-21 21:13:41,567 INFO [train.py:996] (1/4) Epoch 6, batch 24700, loss[loss=0.2081, simple_loss=0.2805, pruned_loss=0.06785, over 21572.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3143, pruned_loss=0.08845, over 4277275.86 frames. ], batch size: 263, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:13:51,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 2.793e+02 3.149e+02 3.525e+02 6.939e+02, threshold=6.298e+02, percent-clipped=1.0 2023-06-21 21:14:07,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1063098.0, ans=0.035 2023-06-21 21:14:08,271 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-21 21:14:09,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1063098.0, ans=0.125 2023-06-21 21:15:06,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1063278.0, ans=0.1 2023-06-21 21:15:15,545 INFO [train.py:996] (1/4) Epoch 6, batch 24750, loss[loss=0.2504, simple_loss=0.3142, pruned_loss=0.09332, over 14492.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3068, pruned_loss=0.0851, over 4265435.07 frames. ], batch size: 60, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:15:24,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1063338.0, ans=0.0 2023-06-21 21:15:37,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1063398.0, ans=0.125 2023-06-21 21:15:44,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=1063398.0, ans=15.0 2023-06-21 21:16:35,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-06-21 21:16:38,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=12.0 2023-06-21 21:16:39,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1063578.0, ans=0.1 2023-06-21 21:16:49,356 INFO [train.py:996] (1/4) Epoch 6, batch 24800, loss[loss=0.2403, simple_loss=0.2921, pruned_loss=0.09425, over 20212.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3016, pruned_loss=0.08478, over 4260409.93 frames. ], batch size: 703, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:16:52,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1063638.0, ans=0.1 2023-06-21 21:17:07,678 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.811e+02 3.326e+02 3.870e+02 1.010e+03, threshold=6.653e+02, percent-clipped=1.0 2023-06-21 21:17:38,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1063758.0, ans=0.05 2023-06-21 21:17:44,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1063758.0, ans=0.125 2023-06-21 21:18:09,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1063878.0, ans=0.0 2023-06-21 21:18:22,770 INFO [train.py:996] (1/4) Epoch 6, batch 24850, loss[loss=0.2566, simple_loss=0.3122, pruned_loss=0.1005, over 21560.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3031, pruned_loss=0.08635, over 4272756.78 frames. ], batch size: 212, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:18:26,767 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-06-21 21:18:31,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.88 vs. limit=10.0 2023-06-21 21:19:06,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1064058.0, ans=0.125 2023-06-21 21:19:09,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1064058.0, ans=0.125 2023-06-21 21:19:39,888 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-21 21:19:57,066 INFO [train.py:996] (1/4) Epoch 6, batch 24900, loss[loss=0.2624, simple_loss=0.3317, pruned_loss=0.09653, over 21595.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.307, pruned_loss=0.08717, over 4279067.42 frames. ], batch size: 263, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:20:01,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1064238.0, ans=0.2 2023-06-21 21:20:15,236 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.136e+02 3.665e+02 4.988e+02 9.346e+02, threshold=7.330e+02, percent-clipped=11.0 2023-06-21 21:20:42,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1064358.0, ans=0.125 2023-06-21 21:20:49,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-21 21:21:08,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1064418.0, ans=0.2 2023-06-21 21:21:16,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1064478.0, ans=0.125 2023-06-21 21:21:38,220 INFO [train.py:996] (1/4) Epoch 6, batch 24950, loss[loss=0.3128, simple_loss=0.3683, pruned_loss=0.1287, over 21787.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3141, pruned_loss=0.09121, over 4278152.67 frames. ], batch size: 441, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:21:38,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1064538.0, ans=0.1 2023-06-21 21:22:04,664 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-21 21:22:30,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1064658.0, ans=0.125 2023-06-21 21:23:18,864 INFO [train.py:996] (1/4) Epoch 6, batch 25000, loss[loss=0.2241, simple_loss=0.2965, pruned_loss=0.07592, over 21462.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3203, pruned_loss=0.09296, over 4275928.05 frames. ], batch size: 389, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:23:36,910 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.934e+02 3.469e+02 4.480e+02 7.234e+02, threshold=6.939e+02, percent-clipped=0.0 2023-06-21 21:23:37,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1064898.0, ans=0.0 2023-06-21 21:23:56,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1064958.0, ans=0.0 2023-06-21 21:23:59,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1064958.0, ans=0.0 2023-06-21 21:24:19,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1065018.0, ans=0.0 2023-06-21 21:24:38,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1065078.0, ans=0.125 2023-06-21 21:24:38,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1065078.0, ans=0.0 2023-06-21 21:24:52,474 INFO [train.py:996] (1/4) Epoch 6, batch 25050, loss[loss=0.2071, simple_loss=0.2615, pruned_loss=0.07631, over 21478.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3138, pruned_loss=0.09212, over 4278899.43 frames. ], batch size: 212, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:25:00,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1065138.0, ans=0.035 2023-06-21 21:25:05,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1065138.0, ans=0.2 2023-06-21 21:26:27,038 INFO [train.py:996] (1/4) Epoch 6, batch 25100, loss[loss=0.2105, simple_loss=0.2676, pruned_loss=0.07672, over 20737.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3079, pruned_loss=0.09091, over 4276885.23 frames. ], batch size: 608, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:26:36,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-21 21:26:45,412 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.865e+02 3.430e+02 4.483e+02 9.616e+02, threshold=6.861e+02, percent-clipped=4.0 2023-06-21 21:26:52,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-21 21:27:23,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.27 vs. limit=15.0 2023-06-21 21:27:39,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1065678.0, ans=0.125 2023-06-21 21:27:42,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1065678.0, ans=0.0 2023-06-21 21:28:01,871 INFO [train.py:996] (1/4) Epoch 6, batch 25150, loss[loss=0.2317, simple_loss=0.3217, pruned_loss=0.07085, over 21804.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3112, pruned_loss=0.08792, over 4280235.33 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:28:15,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1065798.0, ans=0.09899494936611666 2023-06-21 21:29:28,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1065978.0, ans=6.0 2023-06-21 21:29:32,226 INFO [train.py:996] (1/4) Epoch 6, batch 25200, loss[loss=0.2812, simple_loss=0.3536, pruned_loss=0.1043, over 21804.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3106, pruned_loss=0.0853, over 4264740.31 frames. ], batch size: 414, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:29:46,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1066038.0, ans=0.1 2023-06-21 21:29:55,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.627e+02 3.080e+02 3.902e+02 5.113e+02, threshold=6.160e+02, percent-clipped=0.0 2023-06-21 21:30:23,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1066158.0, ans=0.0 2023-06-21 21:30:35,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1066218.0, ans=0.1 2023-06-21 21:31:06,402 INFO [train.py:996] (1/4) Epoch 6, batch 25250, loss[loss=0.2011, simple_loss=0.2709, pruned_loss=0.0656, over 21554.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.309, pruned_loss=0.08334, over 4263350.46 frames. ], batch size: 263, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:31:13,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1066338.0, ans=0.125 2023-06-21 21:32:05,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1066518.0, ans=0.125 2023-06-21 21:32:46,513 INFO [train.py:996] (1/4) Epoch 6, batch 25300, loss[loss=0.2235, simple_loss=0.3028, pruned_loss=0.07212, over 21707.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3076, pruned_loss=0.08343, over 4243126.80 frames. ], batch size: 351, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:33:05,216 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.867e+02 3.250e+02 3.935e+02 6.834e+02, threshold=6.501e+02, percent-clipped=3.0 2023-06-21 21:33:30,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1066758.0, ans=0.2 2023-06-21 21:33:34,705 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:33:56,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1066818.0, ans=0.0 2023-06-21 21:34:14,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1066878.0, ans=0.125 2023-06-21 21:34:21,340 INFO [train.py:996] (1/4) Epoch 6, batch 25350, loss[loss=0.2507, simple_loss=0.3445, pruned_loss=0.07849, over 21198.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3122, pruned_loss=0.08425, over 4252473.28 frames. ], batch size: 548, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:34:34,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1066938.0, ans=0.1 2023-06-21 21:34:36,104 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=22.5 2023-06-21 21:34:44,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1066998.0, ans=0.0 2023-06-21 21:35:49,719 INFO [train.py:996] (1/4) Epoch 6, batch 25400, loss[loss=0.2314, simple_loss=0.2945, pruned_loss=0.08418, over 21595.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3072, pruned_loss=0.08338, over 4261928.35 frames. ], batch size: 263, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:36:02,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1067238.0, ans=0.0 2023-06-21 21:36:13,082 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.660e+02 3.051e+02 3.605e+02 5.899e+02, threshold=6.102e+02, percent-clipped=0.0 2023-06-21 21:36:13,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1067298.0, ans=0.1 2023-06-21 21:36:17,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1067298.0, ans=0.035 2023-06-21 21:37:13,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1067478.0, ans=0.125 2023-06-21 21:37:21,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1067478.0, ans=0.09899494936611666 2023-06-21 21:37:30,717 INFO [train.py:996] (1/4) Epoch 6, batch 25450, loss[loss=0.2345, simple_loss=0.3276, pruned_loss=0.07071, over 21815.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3081, pruned_loss=0.08463, over 4242670.20 frames. ], batch size: 282, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:38:04,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1067658.0, ans=0.0 2023-06-21 21:38:17,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1067658.0, ans=0.125 2023-06-21 21:38:19,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1067658.0, ans=0.125 2023-06-21 21:38:22,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1067658.0, ans=0.1 2023-06-21 21:38:40,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1067718.0, ans=0.0 2023-06-21 21:38:58,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1067778.0, ans=0.2 2023-06-21 21:39:10,629 INFO [train.py:996] (1/4) Epoch 6, batch 25500, loss[loss=0.2849, simple_loss=0.3669, pruned_loss=0.1015, over 21499.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3078, pruned_loss=0.08159, over 4234824.00 frames. ], batch size: 471, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:39:15,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1067838.0, ans=10.0 2023-06-21 21:39:25,735 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.857e+02 3.431e+02 4.303e+02 7.136e+02, threshold=6.862e+02, percent-clipped=5.0 2023-06-21 21:39:27,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1067898.0, ans=0.2 2023-06-21 21:40:45,934 INFO [train.py:996] (1/4) Epoch 6, batch 25550, loss[loss=0.2343, simple_loss=0.3253, pruned_loss=0.07168, over 21669.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3142, pruned_loss=0.08143, over 4238939.65 frames. ], batch size: 263, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:40:46,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068138.0, ans=0.1 2023-06-21 21:40:58,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1068138.0, ans=0.0 2023-06-21 21:41:11,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068198.0, ans=0.1 2023-06-21 21:41:24,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1068258.0, ans=0.0 2023-06-21 21:41:37,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.14 vs. limit=22.5 2023-06-21 21:41:58,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1068318.0, ans=0.1 2023-06-21 21:42:10,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1068378.0, ans=0.125 2023-06-21 21:42:13,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1068378.0, ans=0.04949747468305833 2023-06-21 21:42:21,203 INFO [train.py:996] (1/4) Epoch 6, batch 25600, loss[loss=0.2708, simple_loss=0.3495, pruned_loss=0.09607, over 17220.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3197, pruned_loss=0.08246, over 4245367.42 frames. ], batch size: 60, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:42:41,374 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.868e+02 3.276e+02 3.835e+02 9.464e+02, threshold=6.552e+02, percent-clipped=3.0 2023-06-21 21:43:11,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-21 21:43:56,052 INFO [train.py:996] (1/4) Epoch 6, batch 25650, loss[loss=0.2308, simple_loss=0.2867, pruned_loss=0.08747, over 21592.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3206, pruned_loss=0.08469, over 4250345.27 frames. ], batch size: 415, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:43:59,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1068738.0, ans=0.2 2023-06-21 21:44:57,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1068918.0, ans=0.125 2023-06-21 21:45:22,506 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=15.0 2023-06-21 21:45:28,621 INFO [train.py:996] (1/4) Epoch 6, batch 25700, loss[loss=0.2768, simple_loss=0.3264, pruned_loss=0.1137, over 21700.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3175, pruned_loss=0.08633, over 4256644.16 frames. ], batch size: 508, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:45:48,634 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 2.859e+02 3.225e+02 3.794e+02 7.100e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-21 21:45:49,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1069098.0, ans=0.025 2023-06-21 21:46:13,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069158.0, ans=0.1 2023-06-21 21:46:17,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=12.0 2023-06-21 21:47:05,081 INFO [train.py:996] (1/4) Epoch 6, batch 25750, loss[loss=0.2454, simple_loss=0.3205, pruned_loss=0.08511, over 21754.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3238, pruned_loss=0.08964, over 4257793.18 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:47:16,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1069338.0, ans=0.125 2023-06-21 21:47:31,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1069398.0, ans=0.025 2023-06-21 21:47:49,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1069458.0, ans=0.0 2023-06-21 21:48:02,647 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-21 21:48:19,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1069518.0, ans=0.125 2023-06-21 21:48:22,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1069518.0, ans=0.125 2023-06-21 21:48:33,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1069578.0, ans=0.125 2023-06-21 21:48:37,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069578.0, ans=0.1 2023-06-21 21:48:50,361 INFO [train.py:996] (1/4) Epoch 6, batch 25800, loss[loss=0.2495, simple_loss=0.3209, pruned_loss=0.08906, over 20783.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3322, pruned_loss=0.09365, over 4264042.91 frames. ], batch size: 608, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:49:08,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-21 21:49:10,561 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 3.168e+02 3.870e+02 4.969e+02 1.145e+03, threshold=7.739e+02, percent-clipped=13.0 2023-06-21 21:49:18,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1069698.0, ans=0.0 2023-06-21 21:49:26,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1069698.0, ans=0.1 2023-06-21 21:49:56,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-21 21:50:25,965 INFO [train.py:996] (1/4) Epoch 6, batch 25850, loss[loss=0.2855, simple_loss=0.3552, pruned_loss=0.1079, over 21625.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3339, pruned_loss=0.09311, over 4267803.85 frames. ], batch size: 471, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:51:01,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1069998.0, ans=0.125 2023-06-21 21:51:30,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1070118.0, ans=0.0 2023-06-21 21:51:33,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1070118.0, ans=0.2 2023-06-21 21:51:48,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.29 vs. limit=22.5 2023-06-21 21:51:54,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-21 21:52:10,924 INFO [train.py:996] (1/4) Epoch 6, batch 25900, loss[loss=0.2624, simple_loss=0.3363, pruned_loss=0.09419, over 21217.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3346, pruned_loss=0.09325, over 4273833.18 frames. ], batch size: 143, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:52:15,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1070238.0, ans=0.0 2023-06-21 21:52:17,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1070238.0, ans=0.125 2023-06-21 21:52:25,942 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.005e+02 3.553e+02 4.246e+02 7.646e+02, threshold=7.106e+02, percent-clipped=0.0 2023-06-21 21:52:32,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1070298.0, ans=0.2 2023-06-21 21:52:38,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1070298.0, ans=0.0 2023-06-21 21:53:08,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1070418.0, ans=0.125 2023-06-21 21:53:09,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1070418.0, ans=0.125 2023-06-21 21:53:09,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-21 21:53:32,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1070478.0, ans=0.0 2023-06-21 21:53:45,642 INFO [train.py:996] (1/4) Epoch 6, batch 25950, loss[loss=0.2498, simple_loss=0.3253, pruned_loss=0.08714, over 21596.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3405, pruned_loss=0.09614, over 4281616.44 frames. ], batch size: 112, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:53:49,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-21 21:54:20,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1070658.0, ans=22.5 2023-06-21 21:54:42,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1070718.0, ans=0.125 2023-06-21 21:54:45,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1070718.0, ans=0.0 2023-06-21 21:54:58,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1070718.0, ans=0.0 2023-06-21 21:55:08,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1070778.0, ans=0.125 2023-06-21 21:55:12,762 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:55:21,663 INFO [train.py:996] (1/4) Epoch 6, batch 26000, loss[loss=0.238, simple_loss=0.3554, pruned_loss=0.06025, over 19778.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3401, pruned_loss=0.09331, over 4278125.19 frames. ], batch size: 702, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:55:28,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1070838.0, ans=0.95 2023-06-21 21:55:41,609 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 3.120e+02 3.589e+02 4.615e+02 8.181e+02, threshold=7.178e+02, percent-clipped=1.0 2023-06-21 21:55:50,850 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:56:15,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1071018.0, ans=0.04949747468305833 2023-06-21 21:56:17,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1071018.0, ans=0.0 2023-06-21 21:56:26,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1071018.0, ans=0.1 2023-06-21 21:56:32,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1071018.0, ans=0.125 2023-06-21 21:56:49,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=1071078.0, ans=8.0 2023-06-21 21:56:55,912 INFO [train.py:996] (1/4) Epoch 6, batch 26050, loss[loss=0.2499, simple_loss=0.3191, pruned_loss=0.09037, over 21863.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3404, pruned_loss=0.09492, over 4283993.42 frames. ], batch size: 118, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:57:15,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.23 vs. limit=22.5 2023-06-21 21:57:23,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1071198.0, ans=0.125 2023-06-21 21:57:45,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1071258.0, ans=0.125 2023-06-21 21:57:45,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1071258.0, ans=0.1 2023-06-21 21:57:47,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1071258.0, ans=0.125 2023-06-21 21:57:51,005 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-21 21:58:11,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1071318.0, ans=0.125 2023-06-21 21:58:29,345 INFO [train.py:996] (1/4) Epoch 6, batch 26100, loss[loss=0.2415, simple_loss=0.296, pruned_loss=0.09348, over 21351.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3344, pruned_loss=0.09464, over 4284690.99 frames. ], batch size: 176, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:58:37,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1071438.0, ans=0.0 2023-06-21 21:58:43,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1071438.0, ans=0.125 2023-06-21 21:58:47,239 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-21 21:58:49,048 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.066e+02 3.551e+02 4.321e+02 9.246e+02, threshold=7.101e+02, percent-clipped=1.0 2023-06-21 21:59:14,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1071558.0, ans=0.0 2023-06-21 21:59:23,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-21 21:59:42,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1071618.0, ans=0.0 2023-06-21 22:00:03,475 INFO [train.py:996] (1/4) Epoch 6, batch 26150, loss[loss=0.241, simple_loss=0.3023, pruned_loss=0.08979, over 20942.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3318, pruned_loss=0.09548, over 4288265.01 frames. ], batch size: 607, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:00:22,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1071798.0, ans=0.125 2023-06-21 22:00:34,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1071798.0, ans=0.125 2023-06-21 22:01:08,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1071918.0, ans=0.05 2023-06-21 22:01:18,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-21 22:01:20,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1071918.0, ans=0.1 2023-06-21 22:01:28,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=22.5 2023-06-21 22:01:43,202 INFO [train.py:996] (1/4) Epoch 6, batch 26200, loss[loss=0.2399, simple_loss=0.3473, pruned_loss=0.0663, over 21726.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.332, pruned_loss=0.09279, over 4290111.80 frames. ], batch size: 351, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:01:58,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.878e+02 3.123e+02 3.619e+02 5.924e+02, threshold=6.246e+02, percent-clipped=0.0 2023-06-21 22:02:40,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1072158.0, ans=0.0 2023-06-21 22:02:52,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1072218.0, ans=0.0 2023-06-21 22:02:52,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-21 22:03:17,242 INFO [train.py:996] (1/4) Epoch 6, batch 26250, loss[loss=0.2587, simple_loss=0.3337, pruned_loss=0.09189, over 21754.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3353, pruned_loss=0.09224, over 4284993.12 frames. ], batch size: 441, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:03:31,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-21 22:04:16,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1072518.0, ans=0.125 2023-06-21 22:04:30,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1072578.0, ans=0.2 2023-06-21 22:04:50,862 INFO [train.py:996] (1/4) Epoch 6, batch 26300, loss[loss=0.2764, simple_loss=0.3385, pruned_loss=0.1071, over 21756.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3341, pruned_loss=0.09265, over 4290019.40 frames. ], batch size: 389, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:05:04,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1072638.0, ans=0.0 2023-06-21 22:05:10,180 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.912e+02 3.361e+02 4.041e+02 6.857e+02, threshold=6.722e+02, percent-clipped=1.0 2023-06-21 22:05:24,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1072698.0, ans=0.125 2023-06-21 22:05:51,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1072818.0, ans=0.125 2023-06-21 22:06:06,225 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-21 22:06:20,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1072878.0, ans=0.125 2023-06-21 22:06:29,487 INFO [train.py:996] (1/4) Epoch 6, batch 26350, loss[loss=0.2568, simple_loss=0.3279, pruned_loss=0.09287, over 21310.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3329, pruned_loss=0.09389, over 4285587.59 frames. ], batch size: 143, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:06:37,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1072938.0, ans=0.125 2023-06-21 22:06:38,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.89 vs. limit=15.0 2023-06-21 22:06:38,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1072938.0, ans=0.5 2023-06-21 22:07:39,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1073118.0, ans=0.1 2023-06-21 22:07:41,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=12.0 2023-06-21 22:08:02,658 INFO [train.py:996] (1/4) Epoch 6, batch 26400, loss[loss=0.2446, simple_loss=0.2997, pruned_loss=0.0947, over 21651.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3259, pruned_loss=0.09362, over 4277576.37 frames. ], batch size: 298, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:08:22,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.009e+02 3.283e+02 3.744e+02 6.986e+02, threshold=6.566e+02, percent-clipped=1.0 2023-06-21 22:08:41,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1073358.0, ans=0.05 2023-06-21 22:09:37,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1073538.0, ans=0.0 2023-06-21 22:09:43,570 INFO [train.py:996] (1/4) Epoch 6, batch 26450, loss[loss=0.2824, simple_loss=0.404, pruned_loss=0.08039, over 21204.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3241, pruned_loss=0.0924, over 4268579.26 frames. ], batch size: 549, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:10:02,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1073598.0, ans=0.2 2023-06-21 22:10:52,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1073718.0, ans=0.0 2023-06-21 22:11:23,582 INFO [train.py:996] (1/4) Epoch 6, batch 26500, loss[loss=0.1652, simple_loss=0.2066, pruned_loss=0.06192, over 16348.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3259, pruned_loss=0.09151, over 4263798.74 frames. ], batch size: 61, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:11:38,657 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.232e+02 3.914e+02 4.900e+02 8.574e+02, threshold=7.829e+02, percent-clipped=7.0 2023-06-21 22:12:03,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1073958.0, ans=0.1 2023-06-21 22:12:10,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1073958.0, ans=0.125 2023-06-21 22:12:39,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1074018.0, ans=0.95 2023-06-21 22:12:39,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-21 22:12:59,951 INFO [train.py:996] (1/4) Epoch 6, batch 26550, loss[loss=0.2326, simple_loss=0.3239, pruned_loss=0.07065, over 21702.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3257, pruned_loss=0.08863, over 4265178.33 frames. ], batch size: 415, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:13:34,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1074198.0, ans=0.0 2023-06-21 22:14:06,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1074318.0, ans=0.125 2023-06-21 22:14:34,293 INFO [train.py:996] (1/4) Epoch 6, batch 26600, loss[loss=0.2292, simple_loss=0.3008, pruned_loss=0.07881, over 21573.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3243, pruned_loss=0.08508, over 4263116.89 frames. ], batch size: 263, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:14:37,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1074438.0, ans=0.1 2023-06-21 22:14:56,517 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:15:00,582 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.868e+02 3.429e+02 4.174e+02 7.700e+02, threshold=6.858e+02, percent-clipped=0.0 2023-06-21 22:15:02,920 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-21 22:15:20,756 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:15:31,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1074558.0, ans=10.0 2023-06-21 22:15:39,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.82 vs. limit=15.0 2023-06-21 22:15:44,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1074618.0, ans=0.1 2023-06-21 22:16:00,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1074678.0, ans=0.09899494936611666 2023-06-21 22:16:13,048 INFO [train.py:996] (1/4) Epoch 6, batch 26650, loss[loss=0.1738, simple_loss=0.2621, pruned_loss=0.04275, over 21787.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3166, pruned_loss=0.08301, over 4256833.01 frames. ], batch size: 333, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:16:55,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-21 22:17:50,850 INFO [train.py:996] (1/4) Epoch 6, batch 26700, loss[loss=0.1895, simple_loss=0.2865, pruned_loss=0.04624, over 20808.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3098, pruned_loss=0.08028, over 4258359.41 frames. ], batch size: 609, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:18:07,347 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.839e+02 3.537e+02 4.249e+02 6.809e+02, threshold=7.074e+02, percent-clipped=0.0 2023-06-21 22:18:14,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1075098.0, ans=0.125 2023-06-21 22:18:32,382 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:19:25,050 INFO [train.py:996] (1/4) Epoch 6, batch 26750, loss[loss=0.2684, simple_loss=0.3405, pruned_loss=0.09815, over 21926.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3092, pruned_loss=0.07901, over 4266173.87 frames. ], batch size: 372, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:19:48,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1075398.0, ans=0.2 2023-06-21 22:20:00,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1075398.0, ans=0.0 2023-06-21 22:20:01,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1075398.0, ans=0.2 2023-06-21 22:20:11,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1075458.0, ans=0.125 2023-06-21 22:21:00,185 INFO [train.py:996] (1/4) Epoch 6, batch 26800, loss[loss=0.225, simple_loss=0.2916, pruned_loss=0.0792, over 21970.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3169, pruned_loss=0.08387, over 4273567.15 frames. ], batch size: 98, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:21:25,842 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.803e+02 3.255e+02 3.983e+02 6.627e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-21 22:22:20,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1075878.0, ans=0.2 2023-06-21 22:22:25,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1075878.0, ans=0.0 2023-06-21 22:22:38,495 INFO [train.py:996] (1/4) Epoch 6, batch 26850, loss[loss=0.2332, simple_loss=0.2845, pruned_loss=0.09096, over 15256.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3199, pruned_loss=0.08811, over 4265146.96 frames. ], batch size: 60, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:22:45,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1075938.0, ans=0.125 2023-06-21 22:22:58,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1075998.0, ans=0.05 2023-06-21 22:23:01,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1075998.0, ans=0.2 2023-06-21 22:23:05,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1075998.0, ans=0.0 2023-06-21 22:23:22,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-21 22:23:24,914 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:23:46,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1076118.0, ans=0.1 2023-06-21 22:23:46,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1076118.0, ans=0.2 2023-06-21 22:24:06,564 INFO [train.py:996] (1/4) Epoch 6, batch 26900, loss[loss=0.2131, simple_loss=0.2672, pruned_loss=0.07952, over 21270.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3121, pruned_loss=0.08705, over 4255269.75 frames. ], batch size: 177, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:24:17,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1076238.0, ans=0.125 2023-06-21 22:24:32,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 2.927e+02 3.403e+02 4.314e+02 6.686e+02, threshold=6.806e+02, percent-clipped=1.0 2023-06-21 22:24:38,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1076298.0, ans=0.0 2023-06-21 22:24:45,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-21 22:25:20,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1076418.0, ans=0.0 2023-06-21 22:25:20,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.64 vs. limit=22.5 2023-06-21 22:25:40,911 INFO [train.py:996] (1/4) Epoch 6, batch 26950, loss[loss=0.2146, simple_loss=0.2926, pruned_loss=0.06834, over 21389.00 frames. ], tot_loss[loss=0.243, simple_loss=0.311, pruned_loss=0.08748, over 4262381.65 frames. ], batch size: 131, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:27:05,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1076778.0, ans=0.125 2023-06-21 22:27:20,696 INFO [train.py:996] (1/4) Epoch 6, batch 27000, loss[loss=0.2207, simple_loss=0.2959, pruned_loss=0.07279, over 21109.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3119, pruned_loss=0.08535, over 4267574.93 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:27:20,696 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-21 22:27:39,470 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2469, simple_loss=0.3452, pruned_loss=0.07428, over 1796401.00 frames. 2023-06-21 22:27:39,471 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-21 22:27:41,937 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-21 22:27:57,430 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.871e+02 3.391e+02 3.871e+02 6.119e+02, threshold=6.783e+02, percent-clipped=0.0 2023-06-21 22:27:59,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1076898.0, ans=0.125 2023-06-21 22:28:20,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1076958.0, ans=0.05 2023-06-21 22:28:45,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-06-21 22:29:00,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1077078.0, ans=0.125 2023-06-21 22:29:03,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1077078.0, ans=0.125 2023-06-21 22:29:09,015 INFO [train.py:996] (1/4) Epoch 6, batch 27050, loss[loss=0.1943, simple_loss=0.3241, pruned_loss=0.03226, over 20863.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3137, pruned_loss=0.08192, over 4270448.43 frames. ], batch size: 607, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:29:40,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-21 22:29:41,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-21 22:29:56,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1077258.0, ans=0.035 2023-06-21 22:30:09,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1077318.0, ans=0.1 2023-06-21 22:30:28,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1077378.0, ans=0.125 2023-06-21 22:30:35,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-21 22:30:38,684 INFO [train.py:996] (1/4) Epoch 6, batch 27100, loss[loss=0.2619, simple_loss=0.3316, pruned_loss=0.09607, over 21787.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3167, pruned_loss=0.08264, over 4268782.00 frames. ], batch size: 441, lr: 4.93e-03, grad_scale: 8.0 2023-06-21 22:31:08,042 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.819e+02 3.363e+02 4.112e+02 5.749e+02, threshold=6.726e+02, percent-clipped=0.0 2023-06-21 22:32:03,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1077678.0, ans=0.0 2023-06-21 22:32:13,368 INFO [train.py:996] (1/4) Epoch 6, batch 27150, loss[loss=0.2859, simple_loss=0.3549, pruned_loss=0.1085, over 21406.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3284, pruned_loss=0.08654, over 4276454.24 frames. ], batch size: 194, lr: 4.93e-03, grad_scale: 8.0 2023-06-21 22:32:31,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1077798.0, ans=0.0 2023-06-21 22:32:57,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1077858.0, ans=0.2 2023-06-21 22:33:47,059 INFO [train.py:996] (1/4) Epoch 6, batch 27200, loss[loss=0.3025, simple_loss=0.3674, pruned_loss=0.1188, over 21375.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3355, pruned_loss=0.08847, over 4278530.83 frames. ], batch size: 131, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:34:07,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1078038.0, ans=0.125 2023-06-21 22:34:15,785 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.516e+02 3.235e+02 3.777e+02 4.284e+02 9.441e+02, threshold=7.555e+02, percent-clipped=8.0 2023-06-21 22:34:22,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1078098.0, ans=0.0 2023-06-21 22:34:37,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1078158.0, ans=0.2 2023-06-21 22:35:28,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-21 22:35:30,909 INFO [train.py:996] (1/4) Epoch 6, batch 27250, loss[loss=0.3216, simple_loss=0.3754, pruned_loss=0.1339, over 21223.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3375, pruned_loss=0.09243, over 4281112.92 frames. ], batch size: 143, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:35:37,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1078338.0, ans=0.125 2023-06-21 22:36:41,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1078518.0, ans=0.125 2023-06-21 22:36:58,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1078578.0, ans=10.0 2023-06-21 22:37:06,679 INFO [train.py:996] (1/4) Epoch 6, batch 27300, loss[loss=0.2327, simple_loss=0.3192, pruned_loss=0.07313, over 21761.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3392, pruned_loss=0.09389, over 4275607.53 frames. ], batch size: 247, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:37:30,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1078698.0, ans=0.125 2023-06-21 22:37:30,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1078698.0, ans=0.125 2023-06-21 22:37:36,251 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.091e+02 3.407e+02 3.961e+02 5.625e+02, threshold=6.815e+02, percent-clipped=0.0 2023-06-21 22:37:54,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1078758.0, ans=0.125 2023-06-21 22:38:16,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-21 22:38:33,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1078878.0, ans=0.5 2023-06-21 22:38:45,638 INFO [train.py:996] (1/4) Epoch 6, batch 27350, loss[loss=0.2655, simple_loss=0.3471, pruned_loss=0.09198, over 21337.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3425, pruned_loss=0.09467, over 4270902.61 frames. ], batch size: 548, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:38:46,227 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:38:52,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1078938.0, ans=0.0 2023-06-21 22:40:04,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1079178.0, ans=0.125 2023-06-21 22:40:17,994 INFO [train.py:996] (1/4) Epoch 6, batch 27400, loss[loss=0.2148, simple_loss=0.2783, pruned_loss=0.07567, over 21546.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3369, pruned_loss=0.09382, over 4272032.22 frames. ], batch size: 230, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:40:23,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1079238.0, ans=0.125 2023-06-21 22:40:43,755 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 2.934e+02 3.230e+02 3.710e+02 5.363e+02, threshold=6.461e+02, percent-clipped=0.0 2023-06-21 22:41:05,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1079358.0, ans=0.2 2023-06-21 22:41:22,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1079418.0, ans=0.07 2023-06-21 22:41:51,777 INFO [train.py:996] (1/4) Epoch 6, batch 27450, loss[loss=0.2314, simple_loss=0.3109, pruned_loss=0.07594, over 20063.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3311, pruned_loss=0.09201, over 4265717.70 frames. ], batch size: 702, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:42:09,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1079598.0, ans=0.125 2023-06-21 22:42:10,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1079598.0, ans=0.125 2023-06-21 22:42:38,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1079658.0, ans=0.05 2023-06-21 22:42:42,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1079658.0, ans=0.0 2023-06-21 22:43:03,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-21 22:43:24,728 INFO [train.py:996] (1/4) Epoch 6, batch 27500, loss[loss=0.2295, simple_loss=0.3006, pruned_loss=0.07915, over 21898.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3294, pruned_loss=0.09299, over 4272260.07 frames. ], batch size: 371, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:43:46,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1079898.0, ans=0.125 2023-06-21 22:43:49,712 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-21 22:43:50,266 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.559e+02 2.999e+02 3.729e+02 4.399e+02 9.645e+02, threshold=7.458e+02, percent-clipped=3.0 2023-06-21 22:44:03,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1079958.0, ans=0.0 2023-06-21 22:44:42,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1080018.0, ans=0.125 2023-06-21 22:44:59,244 INFO [train.py:996] (1/4) Epoch 6, batch 27550, loss[loss=0.2583, simple_loss=0.3042, pruned_loss=0.1062, over 21426.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3253, pruned_loss=0.09111, over 4277558.45 frames. ], batch size: 508, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:45:07,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1080138.0, ans=0.125 2023-06-21 22:45:24,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1080198.0, ans=0.1 2023-06-21 22:45:36,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1080198.0, ans=0.125 2023-06-21 22:46:14,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-21 22:46:17,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1080378.0, ans=0.125 2023-06-21 22:46:37,808 INFO [train.py:996] (1/4) Epoch 6, batch 27600, loss[loss=0.2367, simple_loss=0.2939, pruned_loss=0.08978, over 21358.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3181, pruned_loss=0.08933, over 4279918.75 frames. ], batch size: 194, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:46:44,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-21 22:46:55,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1080498.0, ans=0.1 2023-06-21 22:46:56,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1080498.0, ans=0.0 2023-06-21 22:46:58,586 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.844e+02 3.346e+02 3.964e+02 7.072e+02, threshold=6.692e+02, percent-clipped=0.0 2023-06-21 22:47:12,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1080558.0, ans=0.1 2023-06-21 22:47:15,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1080558.0, ans=0.1 2023-06-21 22:47:32,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1080618.0, ans=0.125 2023-06-21 22:47:36,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1080618.0, ans=0.125 2023-06-21 22:47:53,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-21 22:48:00,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1080678.0, ans=0.0 2023-06-21 22:48:01,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-21 22:48:06,209 INFO [train.py:996] (1/4) Epoch 6, batch 27650, loss[loss=0.2195, simple_loss=0.3084, pruned_loss=0.06532, over 21318.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3116, pruned_loss=0.088, over 4266391.06 frames. ], batch size: 176, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:48:30,958 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2023-06-21 22:48:52,936 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:48:58,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.75 vs. limit=22.5 2023-06-21 22:49:08,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1080918.0, ans=15.0 2023-06-21 22:49:31,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-21 22:49:37,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1080978.0, ans=0.125 2023-06-21 22:49:44,268 INFO [train.py:996] (1/4) Epoch 6, batch 27700, loss[loss=0.2875, simple_loss=0.3647, pruned_loss=0.1051, over 21720.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3111, pruned_loss=0.08615, over 4272649.70 frames. ], batch size: 351, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:49:57,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-21 22:50:05,745 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.845e+02 3.268e+02 3.924e+02 7.341e+02, threshold=6.535e+02, percent-clipped=1.0 2023-06-21 22:51:13,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1081338.0, ans=0.125 2023-06-21 22:51:18,672 INFO [train.py:996] (1/4) Epoch 6, batch 27750, loss[loss=0.2662, simple_loss=0.3449, pruned_loss=0.0937, over 21726.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3142, pruned_loss=0.08593, over 4275346.93 frames. ], batch size: 441, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:51:23,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1081338.0, ans=0.0 2023-06-21 22:51:35,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1081398.0, ans=0.125 2023-06-21 22:52:18,230 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=15.0 2023-06-21 22:52:46,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1081638.0, ans=0.125 2023-06-21 22:52:51,593 INFO [train.py:996] (1/4) Epoch 6, batch 27800, loss[loss=0.2497, simple_loss=0.3146, pruned_loss=0.09241, over 21904.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3133, pruned_loss=0.08634, over 4275593.80 frames. ], batch size: 351, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:53:11,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1081698.0, ans=0.125 2023-06-21 22:53:12,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.907e+02 3.249e+02 3.877e+02 6.679e+02, threshold=6.497e+02, percent-clipped=1.0 2023-06-21 22:54:25,870 INFO [train.py:996] (1/4) Epoch 6, batch 27850, loss[loss=0.2338, simple_loss=0.3496, pruned_loss=0.05897, over 19732.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3135, pruned_loss=0.08675, over 4281186.79 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:54:49,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1081998.0, ans=0.0 2023-06-21 22:55:04,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-21 22:55:57,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1082178.0, ans=0.125 2023-06-21 22:55:58,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1082178.0, ans=0.125 2023-06-21 22:56:01,480 INFO [train.py:996] (1/4) Epoch 6, batch 27900, loss[loss=0.2238, simple_loss=0.3124, pruned_loss=0.06759, over 21447.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3225, pruned_loss=0.08799, over 4279104.67 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:56:03,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1082238.0, ans=0.125 2023-06-21 22:56:27,607 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.965e+02 3.401e+02 4.272e+02 8.717e+02, threshold=6.802e+02, percent-clipped=4.0 2023-06-21 22:56:40,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1082358.0, ans=0.125 2023-06-21 22:57:00,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-21 22:57:09,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1082418.0, ans=0.125 2023-06-21 22:57:24,961 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=12.0 2023-06-21 22:57:42,211 INFO [train.py:996] (1/4) Epoch 6, batch 27950, loss[loss=0.1761, simple_loss=0.2607, pruned_loss=0.04577, over 21477.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3205, pruned_loss=0.08415, over 4278384.41 frames. ], batch size: 212, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:58:26,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1082658.0, ans=0.04949747468305833 2023-06-21 22:58:32,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1082658.0, ans=0.1 2023-06-21 22:58:34,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1082658.0, ans=0.2 2023-06-21 22:58:46,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1082718.0, ans=0.2 2023-06-21 22:59:15,427 INFO [train.py:996] (1/4) Epoch 6, batch 28000, loss[loss=0.2572, simple_loss=0.3115, pruned_loss=0.1014, over 21284.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3184, pruned_loss=0.08201, over 4283625.55 frames. ], batch size: 176, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:59:15,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1082838.0, ans=0.0 2023-06-21 22:59:43,149 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.927e+02 3.364e+02 4.265e+02 7.771e+02, threshold=6.727e+02, percent-clipped=2.0 2023-06-21 22:59:55,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1082958.0, ans=0.2 2023-06-21 23:00:28,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-21 23:00:34,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1083078.0, ans=0.125 2023-06-21 23:00:37,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1083078.0, ans=0.125 2023-06-21 23:00:50,700 INFO [train.py:996] (1/4) Epoch 6, batch 28050, loss[loss=0.2528, simple_loss=0.3214, pruned_loss=0.0921, over 21549.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3179, pruned_loss=0.08379, over 4286585.97 frames. ], batch size: 548, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:01:35,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-21 23:01:37,575 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.84 vs. limit=22.5 2023-06-21 23:01:41,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1083258.0, ans=0.125 2023-06-21 23:01:50,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1083318.0, ans=0.125 2023-06-21 23:01:56,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1083318.0, ans=0.0 2023-06-21 23:02:10,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1083378.0, ans=0.125 2023-06-21 23:02:29,344 INFO [train.py:996] (1/4) Epoch 6, batch 28100, loss[loss=0.2237, simple_loss=0.2792, pruned_loss=0.08413, over 21175.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3148, pruned_loss=0.08359, over 4274057.83 frames. ], batch size: 176, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:03:00,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.119e+02 3.918e+02 4.692e+02 8.833e+02, threshold=7.836e+02, percent-clipped=5.0 2023-06-21 23:03:29,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1083618.0, ans=0.125 2023-06-21 23:04:02,318 INFO [train.py:996] (1/4) Epoch 6, batch 28150, loss[loss=0.2191, simple_loss=0.277, pruned_loss=0.08061, over 21244.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3094, pruned_loss=0.08332, over 4272313.30 frames. ], batch size: 144, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:04:33,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=12.0 2023-06-21 23:04:35,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1083798.0, ans=0.0 2023-06-21 23:04:38,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1083798.0, ans=0.0 2023-06-21 23:05:12,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1083918.0, ans=0.125 2023-06-21 23:05:40,574 INFO [train.py:996] (1/4) Epoch 6, batch 28200, loss[loss=0.1962, simple_loss=0.2448, pruned_loss=0.07378, over 20771.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3066, pruned_loss=0.08522, over 4276953.89 frames. ], batch size: 608, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:05:41,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-21 23:06:07,618 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.112e+02 3.798e+02 4.464e+02 8.953e+02, threshold=7.596e+02, percent-clipped=1.0 2023-06-21 23:06:21,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1084158.0, ans=0.2 2023-06-21 23:07:14,366 INFO [train.py:996] (1/4) Epoch 6, batch 28250, loss[loss=0.2209, simple_loss=0.3, pruned_loss=0.07095, over 16123.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3102, pruned_loss=0.08865, over 4271558.11 frames. ], batch size: 60, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:07:30,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1084338.0, ans=0.125 2023-06-21 23:08:11,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1084518.0, ans=0.09899494936611666 2023-06-21 23:08:13,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1084518.0, ans=0.2 2023-06-21 23:08:53,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1084638.0, ans=0.1 2023-06-21 23:08:54,449 INFO [train.py:996] (1/4) Epoch 6, batch 28300, loss[loss=0.1958, simple_loss=0.2872, pruned_loss=0.05217, over 21642.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3081, pruned_loss=0.08616, over 4265450.64 frames. ], batch size: 414, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:08:55,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1084638.0, ans=15.0 2023-06-21 23:09:01,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1084638.0, ans=0.125 2023-06-21 23:09:17,327 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.819e+02 3.236e+02 3.708e+02 8.201e+02, threshold=6.472e+02, percent-clipped=2.0 2023-06-21 23:09:17,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1084698.0, ans=0.07 2023-06-21 23:09:40,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1084758.0, ans=0.0 2023-06-21 23:09:53,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-21 23:10:09,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1084878.0, ans=0.125 2023-06-21 23:10:22,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1084878.0, ans=0.0 2023-06-21 23:10:25,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1084878.0, ans=0.125 2023-06-21 23:10:27,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1084938.0, ans=0.0 2023-06-21 23:10:28,257 INFO [train.py:996] (1/4) Epoch 6, batch 28350, loss[loss=0.1612, simple_loss=0.2354, pruned_loss=0.04354, over 21130.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3041, pruned_loss=0.08138, over 4256402.50 frames. ], batch size: 143, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:10:37,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1084938.0, ans=0.125 2023-06-21 23:10:43,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1084998.0, ans=0.1 2023-06-21 23:10:53,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-21 23:10:54,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1084998.0, ans=0.1 2023-06-21 23:11:53,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1085178.0, ans=0.1 2023-06-21 23:12:03,501 INFO [train.py:996] (1/4) Epoch 6, batch 28400, loss[loss=0.2168, simple_loss=0.2884, pruned_loss=0.07265, over 21636.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.301, pruned_loss=0.08146, over 4264776.89 frames. ], batch size: 298, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:12:16,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1085238.0, ans=0.2 2023-06-21 23:12:25,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1085298.0, ans=0.0 2023-06-21 23:12:26,109 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.685e+02 3.251e+02 3.858e+02 5.974e+02, threshold=6.502e+02, percent-clipped=0.0 2023-06-21 23:12:52,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1085358.0, ans=0.125 2023-06-21 23:13:00,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1085358.0, ans=0.125 2023-06-21 23:13:37,438 INFO [train.py:996] (1/4) Epoch 6, batch 28450, loss[loss=0.2567, simple_loss=0.3189, pruned_loss=0.0972, over 21949.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3062, pruned_loss=0.08563, over 4271674.92 frames. ], batch size: 316, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:13:51,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1085598.0, ans=0.1 2023-06-21 23:14:33,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1085658.0, ans=0.1 2023-06-21 23:14:59,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1085778.0, ans=0.125 2023-06-21 23:15:06,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1085778.0, ans=0.0 2023-06-21 23:15:10,659 INFO [train.py:996] (1/4) Epoch 6, batch 28500, loss[loss=0.2616, simple_loss=0.3237, pruned_loss=0.09974, over 21320.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3097, pruned_loss=0.08804, over 4280470.61 frames. ], batch size: 176, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:15:38,219 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.148e+02 3.464e+02 4.022e+02 7.400e+02, threshold=6.927e+02, percent-clipped=1.0 2023-06-21 23:16:23,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1086018.0, ans=0.2 2023-06-21 23:16:26,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1086018.0, ans=0.2 2023-06-21 23:16:41,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1086078.0, ans=0.1 2023-06-21 23:16:44,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1086138.0, ans=0.125 2023-06-21 23:16:45,730 INFO [train.py:996] (1/4) Epoch 6, batch 28550, loss[loss=0.2723, simple_loss=0.3651, pruned_loss=0.08976, over 21764.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3195, pruned_loss=0.09141, over 4285913.59 frames. ], batch size: 351, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:17:32,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1086258.0, ans=0.125 2023-06-21 23:17:55,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1086318.0, ans=0.1 2023-06-21 23:18:00,766 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2023-06-21 23:18:10,730 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:18:19,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1086438.0, ans=0.2 2023-06-21 23:18:20,980 INFO [train.py:996] (1/4) Epoch 6, batch 28600, loss[loss=0.2537, simple_loss=0.3236, pruned_loss=0.09187, over 21862.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3258, pruned_loss=0.09321, over 4287582.38 frames. ], batch size: 372, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:18:58,678 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.626e+02 3.164e+02 3.571e+02 4.573e+02 8.343e+02, threshold=7.141e+02, percent-clipped=3.0 2023-06-21 23:19:05,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-21 23:19:16,235 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-21 23:19:20,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-21 23:19:23,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1086618.0, ans=0.125 2023-06-21 23:19:30,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-21 23:19:41,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1086678.0, ans=0.09899494936611666 2023-06-21 23:19:47,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1086678.0, ans=0.2 2023-06-21 23:19:51,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1086678.0, ans=0.0 2023-06-21 23:19:59,198 INFO [train.py:996] (1/4) Epoch 6, batch 28650, loss[loss=0.2355, simple_loss=0.2976, pruned_loss=0.0867, over 21674.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3226, pruned_loss=0.09273, over 4288059.78 frames. ], batch size: 333, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:20:28,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1086798.0, ans=0.125 2023-06-21 23:20:45,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1086858.0, ans=0.125 2023-06-21 23:21:11,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1086918.0, ans=0.125 2023-06-21 23:21:38,629 INFO [train.py:996] (1/4) Epoch 6, batch 28700, loss[loss=0.2594, simple_loss=0.3277, pruned_loss=0.09558, over 21864.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3218, pruned_loss=0.09387, over 4292222.62 frames. ], batch size: 371, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:21:40,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1087038.0, ans=0.125 2023-06-21 23:22:07,879 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.373e+02 3.119e+02 3.496e+02 4.060e+02 9.079e+02, threshold=6.992e+02, percent-clipped=1.0 2023-06-21 23:22:26,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1087158.0, ans=0.0 2023-06-21 23:22:57,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1087278.0, ans=0.125 2023-06-21 23:22:58,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1087278.0, ans=0.125 2023-06-21 23:23:09,077 INFO [train.py:996] (1/4) Epoch 6, batch 28750, loss[loss=0.2425, simple_loss=0.3331, pruned_loss=0.07594, over 21683.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3223, pruned_loss=0.09367, over 4284515.71 frames. ], batch size: 389, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:24:00,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1087458.0, ans=0.04949747468305833 2023-06-21 23:24:39,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1087578.0, ans=0.2 2023-06-21 23:24:42,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1087638.0, ans=0.09899494936611666 2023-06-21 23:24:43,701 INFO [train.py:996] (1/4) Epoch 6, batch 28800, loss[loss=0.306, simple_loss=0.3732, pruned_loss=0.1194, over 21826.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.325, pruned_loss=0.09397, over 4290038.63 frames. ], batch size: 124, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:24:54,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1087638.0, ans=0.125 2023-06-21 23:25:16,856 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.957e+02 3.291e+02 3.824e+02 6.486e+02, threshold=6.582e+02, percent-clipped=0.0 2023-06-21 23:25:23,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-21 23:25:53,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-21 23:26:21,869 INFO [train.py:996] (1/4) Epoch 6, batch 28850, loss[loss=0.3854, simple_loss=0.4879, pruned_loss=0.1415, over 19656.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3274, pruned_loss=0.09561, over 4289772.16 frames. ], batch size: 702, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:26:39,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-21 23:26:43,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1087998.0, ans=0.125 2023-06-21 23:27:38,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1088178.0, ans=0.1 2023-06-21 23:28:01,562 INFO [train.py:996] (1/4) Epoch 6, batch 28900, loss[loss=0.2409, simple_loss=0.3215, pruned_loss=0.08015, over 21366.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3311, pruned_loss=0.09754, over 4291575.59 frames. ], batch size: 548, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:28:22,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1088298.0, ans=0.1 2023-06-21 23:28:26,422 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.480e+02 3.260e+02 3.561e+02 4.329e+02 7.781e+02, threshold=7.122e+02, percent-clipped=1.0 2023-06-21 23:28:26,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1088298.0, ans=0.125 2023-06-21 23:29:11,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-21 23:29:38,521 INFO [train.py:996] (1/4) Epoch 6, batch 28950, loss[loss=0.2499, simple_loss=0.3208, pruned_loss=0.08946, over 21734.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3313, pruned_loss=0.09721, over 4288449.64 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:31:09,135 INFO [train.py:996] (1/4) Epoch 6, batch 29000, loss[loss=0.3328, simple_loss=0.39, pruned_loss=0.1378, over 21348.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3327, pruned_loss=0.09565, over 4280186.98 frames. ], batch size: 507, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:31:39,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1088898.0, ans=0.125 2023-06-21 23:31:43,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.241e+02 3.719e+02 4.877e+02 7.775e+02, threshold=7.438e+02, percent-clipped=3.0 2023-06-21 23:32:16,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-21 23:32:19,990 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:32:42,124 INFO [train.py:996] (1/4) Epoch 6, batch 29050, loss[loss=0.247, simple_loss=0.3067, pruned_loss=0.09358, over 21361.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3311, pruned_loss=0.09555, over 4286604.26 frames. ], batch size: 176, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:33:37,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1089258.0, ans=0.125 2023-06-21 23:33:44,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-21 23:34:15,076 INFO [train.py:996] (1/4) Epoch 6, batch 29100, loss[loss=0.2475, simple_loss=0.3019, pruned_loss=0.09659, over 21521.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3224, pruned_loss=0.09354, over 4283886.36 frames. ], batch size: 391, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:34:15,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1089438.0, ans=0.1 2023-06-21 23:34:49,732 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.936e+02 3.266e+02 3.955e+02 6.605e+02, threshold=6.533e+02, percent-clipped=0.0 2023-06-21 23:35:01,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-06-21 23:35:43,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-06-21 23:35:48,085 INFO [train.py:996] (1/4) Epoch 6, batch 29150, loss[loss=0.2651, simple_loss=0.3275, pruned_loss=0.1014, over 21550.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3202, pruned_loss=0.09136, over 4278065.81 frames. ], batch size: 230, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:35:50,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1089738.0, ans=0.95 2023-06-21 23:36:39,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-21 23:37:12,323 INFO [train.py:996] (1/4) Epoch 6, batch 29200, loss[loss=0.1967, simple_loss=0.2632, pruned_loss=0.06513, over 21567.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3154, pruned_loss=0.09052, over 4274674.69 frames. ], batch size: 231, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:37:40,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=22.5 2023-06-21 23:37:43,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1090098.0, ans=0.2 2023-06-21 23:37:47,286 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 2.885e+02 3.375e+02 4.210e+02 7.193e+02, threshold=6.750e+02, percent-clipped=2.0 2023-06-21 23:38:30,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-21 23:38:55,718 INFO [train.py:996] (1/4) Epoch 6, batch 29250, loss[loss=0.2405, simple_loss=0.3323, pruned_loss=0.07433, over 20805.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3127, pruned_loss=0.08721, over 4274974.27 frames. ], batch size: 608, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:39:58,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1090518.0, ans=0.0 2023-06-21 23:40:29,841 INFO [train.py:996] (1/4) Epoch 6, batch 29300, loss[loss=0.2105, simple_loss=0.2885, pruned_loss=0.06626, over 21265.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.315, pruned_loss=0.08621, over 4273694.20 frames. ], batch size: 549, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:40:59,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1090698.0, ans=0.125 2023-06-21 23:41:00,629 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.989e+02 3.718e+02 4.652e+02 8.892e+02, threshold=7.436e+02, percent-clipped=6.0 2023-06-21 23:41:25,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1090818.0, ans=0.125 2023-06-21 23:41:33,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1090818.0, ans=0.0 2023-06-21 23:41:53,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1090878.0, ans=0.125 2023-06-21 23:42:00,600 INFO [train.py:996] (1/4) Epoch 6, batch 29350, loss[loss=0.2505, simple_loss=0.3364, pruned_loss=0.08235, over 21532.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3107, pruned_loss=0.08506, over 4281211.44 frames. ], batch size: 441, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:42:10,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1090938.0, ans=0.0 2023-06-21 23:42:13,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1090938.0, ans=0.2 2023-06-21 23:43:21,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-21 23:43:32,542 INFO [train.py:996] (1/4) Epoch 6, batch 29400, loss[loss=0.1781, simple_loss=0.2314, pruned_loss=0.06238, over 21169.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3113, pruned_loss=0.08349, over 4283518.69 frames. ], batch size: 159, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:43:40,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1091238.0, ans=0.125 2023-06-21 23:43:56,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1091298.0, ans=0.0 2023-06-21 23:44:03,860 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 2.776e+02 3.211e+02 3.938e+02 7.454e+02, threshold=6.422e+02, percent-clipped=1.0 2023-06-21 23:44:44,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1091478.0, ans=0.0 2023-06-21 23:44:45,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091478.0, ans=0.1 2023-06-21 23:44:58,839 INFO [train.py:996] (1/4) Epoch 6, batch 29450, loss[loss=0.3442, simple_loss=0.413, pruned_loss=0.1377, over 21838.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3103, pruned_loss=0.08287, over 4279082.93 frames. ], batch size: 124, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:44:59,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1091538.0, ans=0.125 2023-06-21 23:45:11,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1091538.0, ans=0.0 2023-06-21 23:45:32,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1091658.0, ans=0.0 2023-06-21 23:45:35,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1091658.0, ans=0.0 2023-06-21 23:46:16,947 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:46:26,836 INFO [train.py:996] (1/4) Epoch 6, batch 29500, loss[loss=0.2881, simple_loss=0.3439, pruned_loss=0.1162, over 21596.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3158, pruned_loss=0.08723, over 4283140.47 frames. ], batch size: 471, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:46:35,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1091838.0, ans=0.125 2023-06-21 23:47:01,554 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.458e+02 2.999e+02 3.395e+02 3.971e+02 6.244e+02, threshold=6.790e+02, percent-clipped=0.0 2023-06-21 23:47:03,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1091898.0, ans=0.07 2023-06-21 23:47:21,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1092018.0, ans=0.125 2023-06-21 23:47:46,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1092078.0, ans=0.125 2023-06-21 23:48:05,514 INFO [train.py:996] (1/4) Epoch 6, batch 29550, loss[loss=0.2363, simple_loss=0.2998, pruned_loss=0.08641, over 21870.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3156, pruned_loss=0.08909, over 4289017.43 frames. ], batch size: 298, lr: 4.89e-03, grad_scale: 32.0 2023-06-21 23:48:34,147 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-21 23:49:02,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1092318.0, ans=0.0 2023-06-21 23:49:44,862 INFO [train.py:996] (1/4) Epoch 6, batch 29600, loss[loss=0.2528, simple_loss=0.3288, pruned_loss=0.08844, over 21397.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3225, pruned_loss=0.09203, over 4293236.31 frames. ], batch size: 211, lr: 4.89e-03, grad_scale: 32.0 2023-06-21 23:50:04,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1092498.0, ans=10.0 2023-06-21 23:50:08,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-21 23:50:11,743 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.056e+02 3.521e+02 4.458e+02 7.696e+02, threshold=7.042e+02, percent-clipped=3.0 2023-06-21 23:50:41,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1092618.0, ans=0.2 2023-06-21 23:50:51,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1092678.0, ans=0.0 2023-06-21 23:51:13,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1092678.0, ans=0.125 2023-06-21 23:51:17,497 INFO [train.py:996] (1/4) Epoch 6, batch 29650, loss[loss=0.2397, simple_loss=0.3068, pruned_loss=0.08631, over 21786.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3193, pruned_loss=0.08815, over 4293859.46 frames. ], batch size: 112, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:51:23,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1092738.0, ans=0.125 2023-06-21 23:52:50,736 INFO [train.py:996] (1/4) Epoch 6, batch 29700, loss[loss=0.2425, simple_loss=0.3609, pruned_loss=0.06209, over 19779.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3193, pruned_loss=0.0873, over 4287224.99 frames. ], batch size: 702, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:52:51,748 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-21 23:53:17,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=22.5 2023-06-21 23:53:19,201 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.737e+02 3.024e+02 3.717e+02 5.941e+02, threshold=6.048e+02, percent-clipped=0.0 2023-06-21 23:53:46,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1093218.0, ans=0.0 2023-06-21 23:54:24,076 INFO [train.py:996] (1/4) Epoch 6, batch 29750, loss[loss=0.2116, simple_loss=0.2843, pruned_loss=0.0695, over 16509.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3243, pruned_loss=0.08669, over 4282262.92 frames. ], batch size: 60, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:54:36,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1093338.0, ans=0.2 2023-06-21 23:55:27,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1093518.0, ans=0.125 2023-06-21 23:55:33,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1093578.0, ans=0.1 2023-06-21 23:55:33,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-21 23:55:56,888 INFO [train.py:996] (1/4) Epoch 6, batch 29800, loss[loss=0.2255, simple_loss=0.2972, pruned_loss=0.07688, over 21482.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3259, pruned_loss=0.08783, over 4285411.41 frames. ], batch size: 211, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:56:01,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1093638.0, ans=0.0 2023-06-21 23:56:25,355 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.711e+02 3.022e+02 3.723e+02 5.120e+02, threshold=6.044e+02, percent-clipped=0.0 2023-06-21 23:56:34,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1093758.0, ans=0.2 2023-06-21 23:56:37,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1093758.0, ans=0.125 2023-06-21 23:57:03,567 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:57:29,964 INFO [train.py:996] (1/4) Epoch 6, batch 29850, loss[loss=0.2548, simple_loss=0.3227, pruned_loss=0.09344, over 21782.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3205, pruned_loss=0.0848, over 4286877.77 frames. ], batch size: 414, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:57:36,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1093938.0, ans=0.125 2023-06-21 23:58:04,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-21 23:58:29,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-21 23:58:39,967 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-21 23:58:51,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1094178.0, ans=0.125 2023-06-21 23:58:51,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1094178.0, ans=0.0 2023-06-21 23:58:57,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1094178.0, ans=0.125 2023-06-21 23:59:02,900 INFO [train.py:996] (1/4) Epoch 6, batch 29900, loss[loss=0.2443, simple_loss=0.3185, pruned_loss=0.0851, over 21319.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3196, pruned_loss=0.08624, over 4290454.64 frames. ], batch size: 176, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:59:10,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1094238.0, ans=0.1 2023-06-21 23:59:20,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-21 23:59:36,081 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 3.118e+02 4.055e+02 5.716e+02 1.068e+03, threshold=8.110e+02, percent-clipped=21.0 2023-06-21 23:59:53,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1094358.0, ans=0.05 2023-06-22 00:00:30,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1094478.0, ans=0.125 2023-06-22 00:00:37,267 INFO [train.py:996] (1/4) Epoch 6, batch 29950, loss[loss=0.3006, simple_loss=0.3918, pruned_loss=0.1048, over 17886.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3237, pruned_loss=0.09058, over 4281463.40 frames. ], batch size: 62, lr: 4.89e-03, grad_scale: 8.0 2023-06-22 00:01:06,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1094598.0, ans=0.125 2023-06-22 00:01:07,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-22 00:01:14,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1094598.0, ans=0.035 2023-06-22 00:02:07,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1094778.0, ans=0.125 2023-06-22 00:02:11,809 INFO [train.py:996] (1/4) Epoch 6, batch 30000, loss[loss=0.247, simple_loss=0.3312, pruned_loss=0.08141, over 21791.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3263, pruned_loss=0.09111, over 4277402.28 frames. ], batch size: 282, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:02:11,810 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 00:02:29,659 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.8814, 3.3282, 3.3025, 1.8053], device='cuda:1') 2023-06-22 00:02:30,093 INFO [train.py:1028] (1/4) Epoch 6, validation: loss=0.2467, simple_loss=0.3478, pruned_loss=0.07276, over 1796401.00 frames. 2023-06-22 00:02:30,094 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 00:03:14,300 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.660e+02 3.036e+02 3.460e+02 6.733e+02, threshold=6.073e+02, percent-clipped=0.0 2023-06-22 00:03:25,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1094958.0, ans=0.09899494936611666 2023-06-22 00:03:39,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-22 00:03:55,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-22 00:04:09,917 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:04:17,402 INFO [train.py:996] (1/4) Epoch 6, batch 30050, loss[loss=0.2779, simple_loss=0.3801, pruned_loss=0.08787, over 21842.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3307, pruned_loss=0.08852, over 4275083.33 frames. ], batch size: 371, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:04:51,183 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:05:11,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-22 00:05:14,372 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.64 vs. limit=15.0 2023-06-22 00:05:33,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1095378.0, ans=0.125 2023-06-22 00:05:46,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1095378.0, ans=0.125 2023-06-22 00:05:46,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1095378.0, ans=0.2 2023-06-22 00:05:50,563 INFO [train.py:996] (1/4) Epoch 6, batch 30100, loss[loss=0.2298, simple_loss=0.2889, pruned_loss=0.08536, over 21582.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3287, pruned_loss=0.088, over 4272728.06 frames. ], batch size: 247, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:05:51,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1095438.0, ans=0.1 2023-06-22 00:06:24,501 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.037e+02 3.799e+02 4.739e+02 8.498e+02, threshold=7.598e+02, percent-clipped=11.0 2023-06-22 00:06:27,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1095558.0, ans=0.0 2023-06-22 00:07:15,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1095678.0, ans=0.2 2023-06-22 00:07:25,728 INFO [train.py:996] (1/4) Epoch 6, batch 30150, loss[loss=0.3065, simple_loss=0.3704, pruned_loss=0.1213, over 21833.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3245, pruned_loss=0.08957, over 4265622.59 frames. ], batch size: 124, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:08:11,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1095858.0, ans=0.125 2023-06-22 00:08:21,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1095858.0, ans=0.125 2023-06-22 00:08:23,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1095858.0, ans=0.025 2023-06-22 00:09:07,450 INFO [train.py:996] (1/4) Epoch 6, batch 30200, loss[loss=0.2531, simple_loss=0.3228, pruned_loss=0.09174, over 21235.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3267, pruned_loss=0.08796, over 4271625.20 frames. ], batch size: 176, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:09:46,250 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 3.001e+02 3.567e+02 4.107e+02 7.558e+02, threshold=7.134e+02, percent-clipped=0.0 2023-06-22 00:10:02,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1096158.0, ans=0.125 2023-06-22 00:10:06,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1096218.0, ans=0.125 2023-06-22 00:10:29,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-06-22 00:10:34,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1096278.0, ans=0.2 2023-06-22 00:10:43,117 INFO [train.py:996] (1/4) Epoch 6, batch 30250, loss[loss=0.3042, simple_loss=0.3818, pruned_loss=0.1133, over 19964.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3352, pruned_loss=0.0905, over 4271044.31 frames. ], batch size: 702, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:10:43,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1096338.0, ans=0.0 2023-06-22 00:11:32,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1096458.0, ans=0.0 2023-06-22 00:11:57,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-22 00:11:58,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1096518.0, ans=0.0 2023-06-22 00:12:17,394 INFO [train.py:996] (1/4) Epoch 6, batch 30300, loss[loss=0.2394, simple_loss=0.3029, pruned_loss=0.08798, over 21552.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3325, pruned_loss=0.09078, over 4257274.92 frames. ], batch size: 414, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:12:31,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1096638.0, ans=0.125 2023-06-22 00:12:43,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-22 00:12:45,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096698.0, ans=0.1 2023-06-22 00:12:59,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1096698.0, ans=0.0 2023-06-22 00:13:00,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.155e+02 3.767e+02 4.351e+02 8.059e+02, threshold=7.534e+02, percent-clipped=2.0 2023-06-22 00:13:02,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1096758.0, ans=0.0 2023-06-22 00:13:32,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2023-06-22 00:14:03,142 INFO [train.py:996] (1/4) Epoch 6, batch 30350, loss[loss=0.2397, simple_loss=0.3059, pruned_loss=0.08668, over 21489.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3345, pruned_loss=0.09268, over 4264101.76 frames. ], batch size: 211, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:14:08,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-22 00:14:11,898 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-22 00:15:00,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1097118.0, ans=0.125 2023-06-22 00:15:21,231 INFO [train.py:996] (1/4) Epoch 6, batch 30400, loss[loss=0.2239, simple_loss=0.2744, pruned_loss=0.0867, over 20205.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3263, pruned_loss=0.09078, over 4255875.43 frames. ], batch size: 703, lr: 4.88e-03, grad_scale: 32.0 2023-06-22 00:15:28,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1097238.0, ans=0.0 2023-06-22 00:15:31,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-22 00:15:50,254 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.413e+02 4.097e+02 5.278e+02 1.616e+03, threshold=8.194e+02, percent-clipped=3.0 2023-06-22 00:16:21,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1097478.0, ans=0.125 2023-06-22 00:16:30,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=12.0 2023-06-22 00:16:34,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1097478.0, ans=0.07 2023-06-22 00:16:39,256 INFO [train.py:996] (1/4) Epoch 6, batch 30450, loss[loss=0.2942, simple_loss=0.4168, pruned_loss=0.08585, over 19909.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3283, pruned_loss=0.09041, over 4197336.21 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:16:53,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1097598.0, ans=0.0 2023-06-22 00:17:04,311 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-22 00:17:06,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1097598.0, ans=0.125 2023-06-22 00:17:27,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2023-06-22 00:19:20,717 INFO [train.py:996] (1/4) Epoch 7, batch 0, loss[loss=0.2371, simple_loss=0.297, pruned_loss=0.08861, over 21472.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.297, pruned_loss=0.08861, over 21472.00 frames. ], batch size: 195, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:19:20,717 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 00:19:38,894 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2422, simple_loss=0.3486, pruned_loss=0.06787, over 1796401.00 frames. 2023-06-22 00:19:38,894 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 00:19:45,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1097802.0, ans=0.2 2023-06-22 00:19:47,656 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.73 vs. limit=15.0 2023-06-22 00:19:58,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097862.0, ans=0.1 2023-06-22 00:20:20,154 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.17 vs. limit=12.0 2023-06-22 00:20:23,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-22 00:20:26,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 4.648e+02 5.934e+02 9.527e+02 2.892e+03, threshold=1.187e+03, percent-clipped=31.0 2023-06-22 00:20:36,713 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.17 vs. limit=12.0 2023-06-22 00:20:40,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1097982.0, ans=0.125 2023-06-22 00:20:41,184 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-22 00:21:07,549 INFO [train.py:996] (1/4) Epoch 7, batch 50, loss[loss=0.2874, simple_loss=0.3872, pruned_loss=0.09373, over 21730.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3319, pruned_loss=0.09179, over 956469.96 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:21:56,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-22 00:22:43,754 INFO [train.py:996] (1/4) Epoch 7, batch 100, loss[loss=0.2649, simple_loss=0.3324, pruned_loss=0.09873, over 21468.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3447, pruned_loss=0.09377, over 1688190.78 frames. ], batch size: 211, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:23:16,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-22 00:23:37,731 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.827e+02 3.336e+02 3.937e+02 6.913e+02, threshold=6.673e+02, percent-clipped=0.0 2023-06-22 00:24:19,893 INFO [train.py:996] (1/4) Epoch 7, batch 150, loss[loss=0.1931, simple_loss=0.2533, pruned_loss=0.06644, over 15264.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.3434, pruned_loss=0.09071, over 2259468.71 frames. ], batch size: 60, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:24:34,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1098702.0, ans=0.2 2023-06-22 00:25:10,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1098822.0, ans=0.0 2023-06-22 00:25:20,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1098822.0, ans=0.0 2023-06-22 00:25:32,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1098882.0, ans=0.125 2023-06-22 00:25:49,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1098942.0, ans=0.1 2023-06-22 00:25:52,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-22 00:25:58,212 INFO [train.py:996] (1/4) Epoch 7, batch 200, loss[loss=0.1998, simple_loss=0.2834, pruned_loss=0.05805, over 21451.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3408, pruned_loss=0.09122, over 2708899.70 frames. ], batch size: 195, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:26:14,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.49 vs. limit=15.0 2023-06-22 00:26:56,358 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.932e+02 3.409e+02 3.929e+02 8.481e+02, threshold=6.818e+02, percent-clipped=3.0 2023-06-22 00:26:58,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1099122.0, ans=0.125 2023-06-22 00:27:32,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-22 00:27:36,475 INFO [train.py:996] (1/4) Epoch 7, batch 250, loss[loss=0.2335, simple_loss=0.292, pruned_loss=0.08749, over 21476.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3375, pruned_loss=0.09092, over 3057594.54 frames. ], batch size: 194, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:27:36,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1099302.0, ans=0.0 2023-06-22 00:27:41,333 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:27:49,469 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-22 00:28:00,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1099362.0, ans=0.125 2023-06-22 00:28:17,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1099422.0, ans=0.05 2023-06-22 00:28:30,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1099422.0, ans=0.0 2023-06-22 00:29:14,502 INFO [train.py:996] (1/4) Epoch 7, batch 300, loss[loss=0.2353, simple_loss=0.2877, pruned_loss=0.09143, over 21292.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3319, pruned_loss=0.09031, over 3327769.72 frames. ], batch size: 159, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:29:16,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1099602.0, ans=0.0 2023-06-22 00:29:29,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1099602.0, ans=0.125 2023-06-22 00:29:32,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1099602.0, ans=0.2 2023-06-22 00:29:36,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1099662.0, ans=0.125 2023-06-22 00:30:11,201 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.968e+02 3.407e+02 3.987e+02 5.179e+02, threshold=6.813e+02, percent-clipped=0.0 2023-06-22 00:30:33,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-22 00:30:52,930 INFO [train.py:996] (1/4) Epoch 7, batch 350, loss[loss=0.2269, simple_loss=0.3009, pruned_loss=0.07647, over 21639.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3256, pruned_loss=0.08864, over 3539646.74 frames. ], batch size: 415, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:31:59,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1100082.0, ans=0.0 2023-06-22 00:32:17,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100142.0, ans=0.1 2023-06-22 00:32:25,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1100142.0, ans=0.0 2023-06-22 00:32:36,691 INFO [train.py:996] (1/4) Epoch 7, batch 400, loss[loss=0.246, simple_loss=0.3116, pruned_loss=0.09019, over 21817.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3221, pruned_loss=0.08751, over 3701823.52 frames. ], batch size: 118, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:32:37,162 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:33:05,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1100262.0, ans=0.0 2023-06-22 00:33:15,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1100262.0, ans=0.125 2023-06-22 00:33:17,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1100322.0, ans=0.125 2023-06-22 00:33:20,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100322.0, ans=0.1 2023-06-22 00:33:29,206 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.211e+02 3.756e+02 4.853e+02 8.203e+02, threshold=7.513e+02, percent-clipped=4.0 2023-06-22 00:33:43,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=12.0 2023-06-22 00:34:14,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1100502.0, ans=0.0 2023-06-22 00:34:15,921 INFO [train.py:996] (1/4) Epoch 7, batch 450, loss[loss=0.2303, simple_loss=0.3346, pruned_loss=0.06302, over 21659.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3174, pruned_loss=0.08596, over 3835517.57 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:34:39,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1100562.0, ans=0.125 2023-06-22 00:34:52,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1100562.0, ans=0.09899494936611666 2023-06-22 00:35:28,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-22 00:36:00,505 INFO [train.py:996] (1/4) Epoch 7, batch 500, loss[loss=0.1797, simple_loss=0.2327, pruned_loss=0.06329, over 20739.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3173, pruned_loss=0.0853, over 3932696.14 frames. ], batch size: 609, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:36:05,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-22 00:36:50,807 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.984e+02 3.762e+02 4.525e+02 7.787e+02, threshold=7.525e+02, percent-clipped=1.0 2023-06-22 00:37:15,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1100982.0, ans=0.125 2023-06-22 00:37:18,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1101042.0, ans=0.2 2023-06-22 00:37:42,373 INFO [train.py:996] (1/4) Epoch 7, batch 550, loss[loss=0.2698, simple_loss=0.3704, pruned_loss=0.08456, over 21772.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3169, pruned_loss=0.08454, over 4013731.64 frames. ], batch size: 282, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:38:26,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1101222.0, ans=0.0 2023-06-22 00:38:29,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-22 00:39:20,766 INFO [train.py:996] (1/4) Epoch 7, batch 600, loss[loss=0.269, simple_loss=0.3932, pruned_loss=0.07237, over 19756.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3227, pruned_loss=0.08441, over 4070684.84 frames. ], batch size: 702, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:39:21,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1101402.0, ans=0.125 2023-06-22 00:39:31,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1101402.0, ans=0.125 2023-06-22 00:40:10,123 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.00 vs. limit=6.0 2023-06-22 00:40:10,561 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.944e+02 3.479e+02 4.173e+02 5.834e+02, threshold=6.959e+02, percent-clipped=0.0 2023-06-22 00:40:14,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1101582.0, ans=0.125 2023-06-22 00:41:00,136 INFO [train.py:996] (1/4) Epoch 7, batch 650, loss[loss=0.2704, simple_loss=0.3178, pruned_loss=0.1115, over 21757.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3238, pruned_loss=0.08448, over 4121382.07 frames. ], batch size: 508, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:41:22,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-22 00:41:30,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1101762.0, ans=0.1 2023-06-22 00:41:51,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1101882.0, ans=0.125 2023-06-22 00:42:07,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1101882.0, ans=0.0 2023-06-22 00:42:28,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1101942.0, ans=0.125 2023-06-22 00:42:34,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1101942.0, ans=0.0 2023-06-22 00:42:38,142 INFO [train.py:996] (1/4) Epoch 7, batch 700, loss[loss=0.234, simple_loss=0.3025, pruned_loss=0.08272, over 21716.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3257, pruned_loss=0.08471, over 4161567.88 frames. ], batch size: 230, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:43:05,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-22 00:43:27,979 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.306e+02 4.325e+02 5.540e+02 9.236e+02, threshold=8.651e+02, percent-clipped=10.0 2023-06-22 00:43:29,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.67 vs. limit=5.0 2023-06-22 00:43:50,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1102182.0, ans=0.0 2023-06-22 00:44:03,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-22 00:44:11,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1102242.0, ans=0.125 2023-06-22 00:44:16,059 INFO [train.py:996] (1/4) Epoch 7, batch 750, loss[loss=0.2724, simple_loss=0.3313, pruned_loss=0.1067, over 21828.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3263, pruned_loss=0.08562, over 4182014.69 frames. ], batch size: 118, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:44:20,135 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=22.5 2023-06-22 00:45:53,763 INFO [train.py:996] (1/4) Epoch 7, batch 800, loss[loss=0.2916, simple_loss=0.368, pruned_loss=0.1076, over 21769.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3227, pruned_loss=0.08614, over 4195484.95 frames. ], batch size: 298, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:45:57,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1102602.0, ans=0.1 2023-06-22 00:46:13,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1102662.0, ans=0.1 2023-06-22 00:46:24,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1102662.0, ans=0.125 2023-06-22 00:46:41,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1102722.0, ans=0.0 2023-06-22 00:46:42,862 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.353e+02 3.931e+02 5.238e+02 1.056e+03, threshold=7.862e+02, percent-clipped=1.0 2023-06-22 00:46:46,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1102782.0, ans=0.125 2023-06-22 00:47:31,051 INFO [train.py:996] (1/4) Epoch 7, batch 850, loss[loss=0.2238, simple_loss=0.2925, pruned_loss=0.07755, over 21818.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3208, pruned_loss=0.08634, over 4218109.08 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:47:47,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1102902.0, ans=0.0 2023-06-22 00:47:49,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1102902.0, ans=0.125 2023-06-22 00:48:03,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1102962.0, ans=0.125 2023-06-22 00:49:04,188 INFO [train.py:996] (1/4) Epoch 7, batch 900, loss[loss=0.2424, simple_loss=0.3082, pruned_loss=0.08827, over 21327.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3169, pruned_loss=0.08589, over 4234747.58 frames. ], batch size: 131, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:49:51,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1103322.0, ans=0.125 2023-06-22 00:49:53,743 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.788e+02 3.273e+02 3.960e+02 6.263e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-22 00:50:07,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1103382.0, ans=0.95 2023-06-22 00:50:48,016 INFO [train.py:996] (1/4) Epoch 7, batch 950, loss[loss=0.3041, simple_loss=0.3579, pruned_loss=0.1251, over 21567.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3155, pruned_loss=0.08557, over 4250102.82 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:50:58,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1103502.0, ans=0.0 2023-06-22 00:51:02,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1103562.0, ans=0.5 2023-06-22 00:51:07,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1103562.0, ans=0.125 2023-06-22 00:51:07,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1103562.0, ans=0.2 2023-06-22 00:51:11,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-22 00:51:15,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1103562.0, ans=0.2 2023-06-22 00:51:17,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1103562.0, ans=0.125 2023-06-22 00:51:58,739 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:52:26,751 INFO [train.py:996] (1/4) Epoch 7, batch 1000, loss[loss=0.2637, simple_loss=0.3244, pruned_loss=0.1015, over 21563.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3156, pruned_loss=0.08553, over 4254660.04 frames. ], batch size: 548, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:52:42,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-22 00:53:25,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 2.971e+02 3.488e+02 4.258e+02 7.403e+02, threshold=6.977e+02, percent-clipped=1.0 2023-06-22 00:53:31,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-22 00:53:41,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1103982.0, ans=0.07 2023-06-22 00:53:54,517 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:53:54,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1104042.0, ans=0.0 2023-06-22 00:54:03,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1104042.0, ans=0.0 2023-06-22 00:54:07,788 INFO [train.py:996] (1/4) Epoch 7, batch 1050, loss[loss=0.2384, simple_loss=0.3014, pruned_loss=0.08766, over 21866.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3131, pruned_loss=0.08487, over 4269626.90 frames. ], batch size: 298, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:55:41,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1104342.0, ans=0.0 2023-06-22 00:55:47,387 INFO [train.py:996] (1/4) Epoch 7, batch 1100, loss[loss=0.2557, simple_loss=0.3217, pruned_loss=0.09479, over 21478.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3122, pruned_loss=0.08463, over 4272594.89 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:55:54,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-22 00:56:13,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1104462.0, ans=0.0 2023-06-22 00:56:19,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1104462.0, ans=0.035 2023-06-22 00:56:40,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104522.0, ans=0.1 2023-06-22 00:56:48,284 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.865e+02 3.379e+02 3.962e+02 8.205e+02, threshold=6.758e+02, percent-clipped=2.0 2023-06-22 00:56:50,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1104582.0, ans=0.125 2023-06-22 00:56:52,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1104582.0, ans=0.0 2023-06-22 00:56:55,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1104582.0, ans=0.0 2023-06-22 00:57:27,382 INFO [train.py:996] (1/4) Epoch 7, batch 1150, loss[loss=0.1844, simple_loss=0.2716, pruned_loss=0.04866, over 21616.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3138, pruned_loss=0.08455, over 4281051.33 frames. ], batch size: 230, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:57:30,161 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.21 vs. limit=15.0 2023-06-22 00:58:02,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1104762.0, ans=0.125 2023-06-22 00:58:23,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1104822.0, ans=0.2 2023-06-22 00:58:25,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1104822.0, ans=0.125 2023-06-22 00:58:30,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1104882.0, ans=0.0 2023-06-22 00:58:47,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104882.0, ans=0.1 2023-06-22 00:59:04,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1104942.0, ans=0.0 2023-06-22 00:59:05,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1104942.0, ans=0.125 2023-06-22 00:59:08,527 INFO [train.py:996] (1/4) Epoch 7, batch 1200, loss[loss=0.2358, simple_loss=0.3121, pruned_loss=0.07973, over 21411.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3165, pruned_loss=0.08456, over 4277767.73 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:59:33,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1105062.0, ans=0.0 2023-06-22 01:00:09,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1105182.0, ans=0.0 2023-06-22 01:00:10,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.150e+02 3.657e+02 4.212e+02 7.667e+02, threshold=7.313e+02, percent-clipped=2.0 2023-06-22 01:00:39,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1105242.0, ans=0.125 2023-06-22 01:00:43,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.95 vs. limit=12.0 2023-06-22 01:00:48,917 INFO [train.py:996] (1/4) Epoch 7, batch 1250, loss[loss=0.3291, simple_loss=0.4061, pruned_loss=0.1261, over 21591.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3156, pruned_loss=0.08448, over 4279494.01 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:01:40,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1105422.0, ans=0.0 2023-06-22 01:02:28,470 INFO [train.py:996] (1/4) Epoch 7, batch 1300, loss[loss=0.2928, simple_loss=0.3509, pruned_loss=0.1173, over 21790.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3179, pruned_loss=0.08482, over 4285958.27 frames. ], batch size: 441, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:02:57,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1105662.0, ans=0.2 2023-06-22 01:03:19,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1105722.0, ans=0.04949747468305833 2023-06-22 01:03:20,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1105722.0, ans=0.125 2023-06-22 01:03:25,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1105722.0, ans=0.0 2023-06-22 01:03:36,718 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.032e+02 3.671e+02 4.552e+02 8.321e+02, threshold=7.341e+02, percent-clipped=3.0 2023-06-22 01:03:39,361 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=15.0 2023-06-22 01:04:01,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-22 01:04:13,383 INFO [train.py:996] (1/4) Epoch 7, batch 1350, loss[loss=0.2111, simple_loss=0.2949, pruned_loss=0.06367, over 21798.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3195, pruned_loss=0.08595, over 4290574.92 frames. ], batch size: 351, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:04:18,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1105902.0, ans=0.125 2023-06-22 01:04:27,980 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:04:34,203 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:05:20,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106082.0, ans=0.1 2023-06-22 01:05:51,571 INFO [train.py:996] (1/4) Epoch 7, batch 1400, loss[loss=0.2241, simple_loss=0.2914, pruned_loss=0.07841, over 21700.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3167, pruned_loss=0.08564, over 4281126.05 frames. ], batch size: 316, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:06:43,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1106322.0, ans=0.0 2023-06-22 01:06:54,468 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.431e+02 3.168e+02 3.476e+02 4.011e+02 7.450e+02, threshold=6.951e+02, percent-clipped=1.0 2023-06-22 01:07:12,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-22 01:07:26,228 INFO [train.py:996] (1/4) Epoch 7, batch 1450, loss[loss=0.2651, simple_loss=0.332, pruned_loss=0.0991, over 21798.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3182, pruned_loss=0.08663, over 4288002.07 frames. ], batch size: 282, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:07:44,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-22 01:07:46,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-22 01:07:57,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106562.0, ans=0.1 2023-06-22 01:08:08,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106562.0, ans=0.1 2023-06-22 01:08:21,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1106622.0, ans=0.125 2023-06-22 01:08:22,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1106622.0, ans=0.125 2023-06-22 01:09:11,613 INFO [train.py:996] (1/4) Epoch 7, batch 1500, loss[loss=0.2135, simple_loss=0.2752, pruned_loss=0.07591, over 21676.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3208, pruned_loss=0.08785, over 4291406.80 frames. ], batch size: 333, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:09:15,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1106802.0, ans=0.125 2023-06-22 01:10:12,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-06-22 01:10:14,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1106982.0, ans=0.1 2023-06-22 01:10:15,567 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 2.952e+02 3.357e+02 3.782e+02 8.287e+02, threshold=6.713e+02, percent-clipped=2.0 2023-06-22 01:10:16,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1106982.0, ans=0.0 2023-06-22 01:11:03,405 INFO [train.py:996] (1/4) Epoch 7, batch 1550, loss[loss=0.2666, simple_loss=0.3594, pruned_loss=0.0869, over 21713.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3203, pruned_loss=0.08701, over 4292761.91 frames. ], batch size: 298, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:11:28,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-22 01:11:54,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1107222.0, ans=0.1 2023-06-22 01:12:17,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1107342.0, ans=0.125 2023-06-22 01:12:23,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1107342.0, ans=0.1 2023-06-22 01:12:43,993 INFO [train.py:996] (1/4) Epoch 7, batch 1600, loss[loss=0.1907, simple_loss=0.2485, pruned_loss=0.0664, over 21760.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3182, pruned_loss=0.08617, over 4286068.70 frames. ], batch size: 124, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 01:13:30,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1107522.0, ans=0.0 2023-06-22 01:13:35,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-22 01:13:39,492 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.079e+02 3.590e+02 4.621e+02 8.115e+02, threshold=7.180e+02, percent-clipped=4.0 2023-06-22 01:14:03,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1107642.0, ans=0.1 2023-06-22 01:14:25,772 INFO [train.py:996] (1/4) Epoch 7, batch 1650, loss[loss=0.2392, simple_loss=0.3058, pruned_loss=0.08624, over 20135.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3174, pruned_loss=0.08571, over 4286298.32 frames. ], batch size: 703, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:14:30,968 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:15:12,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1107822.0, ans=0.5 2023-06-22 01:15:21,362 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-22 01:15:34,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-06-22 01:15:50,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-22 01:15:54,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.65 vs. limit=15.0 2023-06-22 01:15:58,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-22 01:16:07,486 INFO [train.py:996] (1/4) Epoch 7, batch 1700, loss[loss=0.2594, simple_loss=0.3232, pruned_loss=0.0978, over 20088.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3196, pruned_loss=0.0863, over 4281677.24 frames. ], batch size: 702, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:16:31,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1108062.0, ans=0.05 2023-06-22 01:16:31,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1108062.0, ans=0.125 2023-06-22 01:17:00,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1108122.0, ans=0.0 2023-06-22 01:17:13,169 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.011e+02 3.603e+02 4.344e+02 6.909e+02, threshold=7.205e+02, percent-clipped=0.0 2023-06-22 01:17:29,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-22 01:17:43,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1108242.0, ans=0.95 2023-06-22 01:17:54,158 INFO [train.py:996] (1/4) Epoch 7, batch 1750, loss[loss=0.2308, simple_loss=0.3072, pruned_loss=0.07724, over 20660.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3188, pruned_loss=0.08504, over 4279029.56 frames. ], batch size: 607, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:17:54,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1108302.0, ans=0.2 2023-06-22 01:18:41,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1108422.0, ans=0.0 2023-06-22 01:18:41,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1108422.0, ans=0.2 2023-06-22 01:19:00,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1108482.0, ans=0.125 2023-06-22 01:19:13,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1108542.0, ans=0.125 2023-06-22 01:19:36,784 INFO [train.py:996] (1/4) Epoch 7, batch 1800, loss[loss=0.2851, simple_loss=0.3604, pruned_loss=0.1049, over 21761.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3162, pruned_loss=0.08259, over 4271352.25 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:19:48,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1108602.0, ans=0.0 2023-06-22 01:20:05,283 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:20:30,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.53 vs. limit=15.0 2023-06-22 01:20:36,858 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.999e+02 3.807e+02 4.640e+02 8.092e+02, threshold=7.614e+02, percent-clipped=1.0 2023-06-22 01:20:51,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1108842.0, ans=0.125 2023-06-22 01:21:12,537 INFO [train.py:996] (1/4) Epoch 7, batch 1850, loss[loss=0.1927, simple_loss=0.2437, pruned_loss=0.07083, over 20042.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3176, pruned_loss=0.08175, over 4271907.69 frames. ], batch size: 702, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:21:16,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1108902.0, ans=0.04949747468305833 2023-06-22 01:21:17,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1108902.0, ans=0.125 2023-06-22 01:21:29,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1108962.0, ans=0.0 2023-06-22 01:21:30,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1108962.0, ans=0.1 2023-06-22 01:21:40,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1108962.0, ans=0.125 2023-06-22 01:22:19,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1109082.0, ans=0.1 2023-06-22 01:22:20,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1109082.0, ans=0.0 2023-06-22 01:22:36,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1109142.0, ans=0.125 2023-06-22 01:22:42,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1109142.0, ans=0.125 2023-06-22 01:22:52,003 INFO [train.py:996] (1/4) Epoch 7, batch 1900, loss[loss=0.188, simple_loss=0.263, pruned_loss=0.05648, over 21738.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3173, pruned_loss=0.08196, over 4267791.38 frames. ], batch size: 282, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:23:55,374 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.011e+02 3.315e+02 4.225e+02 7.544e+02, threshold=6.631e+02, percent-clipped=0.0 2023-06-22 01:24:31,575 INFO [train.py:996] (1/4) Epoch 7, batch 1950, loss[loss=0.2026, simple_loss=0.3002, pruned_loss=0.05253, over 21721.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3132, pruned_loss=0.08087, over 4266385.40 frames. ], batch size: 352, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:25:00,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1109562.0, ans=15.0 2023-06-22 01:25:27,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1109622.0, ans=0.1 2023-06-22 01:25:30,701 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:26:05,443 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:26:08,040 INFO [train.py:996] (1/4) Epoch 7, batch 2000, loss[loss=0.251, simple_loss=0.3213, pruned_loss=0.09034, over 20677.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3104, pruned_loss=0.07927, over 4267632.19 frames. ], batch size: 607, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:26:29,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1109862.0, ans=0.0 2023-06-22 01:27:02,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1109922.0, ans=0.2 2023-06-22 01:27:05,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1109982.0, ans=0.2 2023-06-22 01:27:08,098 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.011e+02 3.534e+02 4.204e+02 7.079e+02, threshold=7.069e+02, percent-clipped=1.0 2023-06-22 01:27:10,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1109982.0, ans=0.1 2023-06-22 01:27:43,235 INFO [train.py:996] (1/4) Epoch 7, batch 2050, loss[loss=0.2427, simple_loss=0.3114, pruned_loss=0.08697, over 21776.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3124, pruned_loss=0.08082, over 4278688.72 frames. ], batch size: 316, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:28:19,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1110162.0, ans=0.125 2023-06-22 01:28:45,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-22 01:29:06,903 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:29:28,354 INFO [train.py:996] (1/4) Epoch 7, batch 2100, loss[loss=0.2393, simple_loss=0.3075, pruned_loss=0.08558, over 21537.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3168, pruned_loss=0.0829, over 4272396.95 frames. ], batch size: 230, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:30:12,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1110522.0, ans=0.0 2023-06-22 01:30:34,422 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.462e+02 4.025e+02 4.907e+02 9.309e+02, threshold=8.051e+02, percent-clipped=5.0 2023-06-22 01:31:07,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1110702.0, ans=0.025 2023-06-22 01:31:08,540 INFO [train.py:996] (1/4) Epoch 7, batch 2150, loss[loss=0.2024, simple_loss=0.2753, pruned_loss=0.06469, over 21602.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3171, pruned_loss=0.08414, over 4259430.49 frames. ], batch size: 263, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:31:12,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1110702.0, ans=10.0 2023-06-22 01:32:02,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1110822.0, ans=0.125 2023-06-22 01:32:18,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1110882.0, ans=0.125 2023-06-22 01:32:19,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2023-06-22 01:32:48,381 INFO [train.py:996] (1/4) Epoch 7, batch 2200, loss[loss=0.1809, simple_loss=0.2657, pruned_loss=0.04809, over 21400.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3176, pruned_loss=0.08381, over 4258365.10 frames. ], batch size: 194, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:33:48,374 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.551e+02 3.115e+02 3.820e+02 5.117e+02 8.192e+02, threshold=7.640e+02, percent-clipped=1.0 2023-06-22 01:33:59,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1111182.0, ans=0.2 2023-06-22 01:34:27,197 INFO [train.py:996] (1/4) Epoch 7, batch 2250, loss[loss=0.2051, simple_loss=0.2725, pruned_loss=0.06884, over 21617.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3138, pruned_loss=0.08236, over 4269715.58 frames. ], batch size: 332, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:35:09,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1111422.0, ans=0.125 2023-06-22 01:35:10,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-22 01:35:13,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.56 vs. limit=12.0 2023-06-22 01:35:38,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1111482.0, ans=0.125 2023-06-22 01:36:02,574 INFO [train.py:996] (1/4) Epoch 7, batch 2300, loss[loss=0.2254, simple_loss=0.3183, pruned_loss=0.06623, over 21793.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3105, pruned_loss=0.08273, over 4267221.50 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:37:09,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.004e+02 3.529e+02 4.212e+02 9.324e+02, threshold=7.058e+02, percent-clipped=1.0 2023-06-22 01:37:42,539 INFO [train.py:996] (1/4) Epoch 7, batch 2350, loss[loss=0.2484, simple_loss=0.3038, pruned_loss=0.09649, over 21538.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3087, pruned_loss=0.08319, over 4269219.48 frames. ], batch size: 391, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:37:49,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1111902.0, ans=0.125 2023-06-22 01:37:55,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1111902.0, ans=0.125 2023-06-22 01:38:23,538 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-22 01:38:27,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1112022.0, ans=0.0 2023-06-22 01:38:53,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1112082.0, ans=0.035 2023-06-22 01:38:58,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1112142.0, ans=0.1 2023-06-22 01:39:10,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1112142.0, ans=0.0 2023-06-22 01:39:17,644 INFO [train.py:996] (1/4) Epoch 7, batch 2400, loss[loss=0.2685, simple_loss=0.3328, pruned_loss=0.1021, over 21711.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3108, pruned_loss=0.08457, over 4274511.90 frames. ], batch size: 332, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:39:45,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-22 01:40:25,230 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 3.130e+02 3.615e+02 4.219e+02 6.751e+02, threshold=7.231e+02, percent-clipped=0.0 2023-06-22 01:40:45,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1112442.0, ans=0.125 2023-06-22 01:40:48,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1112442.0, ans=0.2 2023-06-22 01:40:59,128 INFO [train.py:996] (1/4) Epoch 7, batch 2450, loss[loss=0.2058, simple_loss=0.2749, pruned_loss=0.0683, over 21491.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3151, pruned_loss=0.08571, over 4279904.75 frames. ], batch size: 230, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:41:30,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1112562.0, ans=0.125 2023-06-22 01:42:07,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.87 vs. limit=15.0 2023-06-22 01:42:10,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=12.0 2023-06-22 01:42:14,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1112682.0, ans=0.035 2023-06-22 01:42:31,505 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-22 01:42:33,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-22 01:42:40,172 INFO [train.py:996] (1/4) Epoch 7, batch 2500, loss[loss=0.2401, simple_loss=0.3229, pruned_loss=0.07862, over 21495.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3117, pruned_loss=0.08506, over 4274146.48 frames. ], batch size: 389, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:42:45,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1112802.0, ans=0.1 2023-06-22 01:43:22,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1112922.0, ans=0.125 2023-06-22 01:43:48,546 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.102e+02 3.610e+02 4.513e+02 8.483e+02, threshold=7.220e+02, percent-clipped=3.0 2023-06-22 01:44:21,338 INFO [train.py:996] (1/4) Epoch 7, batch 2550, loss[loss=0.2257, simple_loss=0.2926, pruned_loss=0.07936, over 21440.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3119, pruned_loss=0.08404, over 4268435.11 frames. ], batch size: 131, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:44:45,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.78 vs. limit=22.5 2023-06-22 01:45:21,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1113282.0, ans=0.1 2023-06-22 01:45:57,698 INFO [train.py:996] (1/4) Epoch 7, batch 2600, loss[loss=0.2776, simple_loss=0.3567, pruned_loss=0.09922, over 21323.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3167, pruned_loss=0.08705, over 4274348.58 frames. ], batch size: 548, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:46:06,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1113402.0, ans=0.0 2023-06-22 01:46:07,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1113402.0, ans=0.0 2023-06-22 01:46:19,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1113462.0, ans=0.125 2023-06-22 01:47:06,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 3.288e+02 3.616e+02 4.316e+02 7.089e+02, threshold=7.232e+02, percent-clipped=0.0 2023-06-22 01:47:39,085 INFO [train.py:996] (1/4) Epoch 7, batch 2650, loss[loss=0.2468, simple_loss=0.3065, pruned_loss=0.09359, over 21879.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3168, pruned_loss=0.08783, over 4278070.47 frames. ], batch size: 118, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:48:01,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1113702.0, ans=0.0 2023-06-22 01:48:02,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-22 01:48:10,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1113762.0, ans=10.0 2023-06-22 01:48:15,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1113762.0, ans=0.125 2023-06-22 01:48:23,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1113822.0, ans=0.2 2023-06-22 01:49:19,738 INFO [train.py:996] (1/4) Epoch 7, batch 2700, loss[loss=0.2431, simple_loss=0.3112, pruned_loss=0.08747, over 21818.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3139, pruned_loss=0.08692, over 4271414.33 frames. ], batch size: 316, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:49:54,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1114062.0, ans=0.125 2023-06-22 01:50:28,243 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 2.952e+02 3.435e+02 4.194e+02 7.834e+02, threshold=6.870e+02, percent-clipped=2.0 2023-06-22 01:50:50,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1114242.0, ans=0.0 2023-06-22 01:50:59,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1114302.0, ans=0.5 2023-06-22 01:51:00,684 INFO [train.py:996] (1/4) Epoch 7, batch 2750, loss[loss=0.2827, simple_loss=0.3397, pruned_loss=0.1129, over 21704.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3126, pruned_loss=0.08693, over 4272719.25 frames. ], batch size: 473, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:51:11,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1114302.0, ans=0.0 2023-06-22 01:51:22,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1114302.0, ans=0.2 2023-06-22 01:52:50,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1114542.0, ans=0.125 2023-06-22 01:52:53,305 INFO [train.py:996] (1/4) Epoch 7, batch 2800, loss[loss=0.265, simple_loss=0.3449, pruned_loss=0.09254, over 21643.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3174, pruned_loss=0.08814, over 4268119.87 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 32.0 2023-06-22 01:53:57,892 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.206e+02 3.798e+02 4.545e+02 8.220e+02, threshold=7.596e+02, percent-clipped=2.0 2023-06-22 01:54:03,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1114782.0, ans=0.1 2023-06-22 01:54:16,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1114842.0, ans=0.125 2023-06-22 01:54:16,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1114842.0, ans=0.2 2023-06-22 01:54:35,974 INFO [train.py:996] (1/4) Epoch 7, batch 2850, loss[loss=0.2083, simple_loss=0.2867, pruned_loss=0.06497, over 21794.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3196, pruned_loss=0.08941, over 4272763.25 frames. ], batch size: 333, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:54:46,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1114902.0, ans=0.0 2023-06-22 01:54:49,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1114902.0, ans=0.05 2023-06-22 01:55:47,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1115082.0, ans=0.2 2023-06-22 01:55:52,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1115142.0, ans=0.125 2023-06-22 01:56:16,777 INFO [train.py:996] (1/4) Epoch 7, batch 2900, loss[loss=0.3293, simple_loss=0.4102, pruned_loss=0.1242, over 21654.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3172, pruned_loss=0.08862, over 4276381.52 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:56:23,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-22 01:56:25,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1115202.0, ans=0.0 2023-06-22 01:56:41,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1115262.0, ans=0.125 2023-06-22 01:56:46,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1115262.0, ans=0.0 2023-06-22 01:57:22,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.184e+02 3.787e+02 4.850e+02 9.590e+02, threshold=7.574e+02, percent-clipped=4.0 2023-06-22 01:57:41,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1115442.0, ans=0.0 2023-06-22 01:57:58,398 INFO [train.py:996] (1/4) Epoch 7, batch 2950, loss[loss=0.2894, simple_loss=0.3513, pruned_loss=0.1137, over 21720.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3187, pruned_loss=0.0891, over 4282541.84 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:58:00,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1115502.0, ans=0.0 2023-06-22 01:58:44,400 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-22 01:58:55,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-22 01:59:39,996 INFO [train.py:996] (1/4) Epoch 7, batch 3000, loss[loss=0.2834, simple_loss=0.3505, pruned_loss=0.1082, over 21776.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3229, pruned_loss=0.08954, over 4278900.94 frames. ], batch size: 332, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:59:39,996 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 01:59:56,483 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2473, simple_loss=0.3435, pruned_loss=0.07556, over 1796401.00 frames. 2023-06-22 01:59:56,484 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 01:59:57,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1115802.0, ans=0.125 2023-06-22 02:01:11,448 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.189e+02 3.683e+02 4.814e+02 8.214e+02, threshold=7.366e+02, percent-clipped=1.0 2023-06-22 02:01:36,626 INFO [train.py:996] (1/4) Epoch 7, batch 3050, loss[loss=0.1737, simple_loss=0.2571, pruned_loss=0.04515, over 21390.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3249, pruned_loss=0.08862, over 4282102.10 frames. ], batch size: 194, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:02:37,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1116222.0, ans=0.125 2023-06-22 02:03:24,163 INFO [train.py:996] (1/4) Epoch 7, batch 3100, loss[loss=0.2168, simple_loss=0.3046, pruned_loss=0.06452, over 21784.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3238, pruned_loss=0.08729, over 4277876.74 frames. ], batch size: 282, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:03:33,378 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-22 02:03:34,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1116402.0, ans=0.125 2023-06-22 02:03:41,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1116402.0, ans=0.1 2023-06-22 02:03:58,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1116462.0, ans=0.1 2023-06-22 02:04:34,771 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.895e+02 3.297e+02 4.092e+02 7.123e+02, threshold=6.595e+02, percent-clipped=0.0 2023-06-22 02:04:57,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1116642.0, ans=0.125 2023-06-22 02:05:11,853 INFO [train.py:996] (1/4) Epoch 7, batch 3150, loss[loss=0.3485, simple_loss=0.4134, pruned_loss=0.1418, over 21504.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3248, pruned_loss=0.0888, over 4274503.56 frames. ], batch size: 131, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:05:22,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1116702.0, ans=0.125 2023-06-22 02:06:02,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1116822.0, ans=0.125 2023-06-22 02:06:12,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-22 02:06:15,967 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.18 vs. limit=15.0 2023-06-22 02:06:48,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1116942.0, ans=0.125 2023-06-22 02:06:52,854 INFO [train.py:996] (1/4) Epoch 7, batch 3200, loss[loss=0.2283, simple_loss=0.3118, pruned_loss=0.07244, over 21723.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3239, pruned_loss=0.08689, over 4280448.11 frames. ], batch size: 298, lr: 4.45e-03, grad_scale: 32.0 2023-06-22 02:08:05,128 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.937e+02 3.458e+02 4.160e+02 8.829e+02, threshold=6.916e+02, percent-clipped=6.0 2023-06-22 02:08:23,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1117242.0, ans=0.125 2023-06-22 02:08:28,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1117242.0, ans=0.125 2023-06-22 02:08:34,751 INFO [train.py:996] (1/4) Epoch 7, batch 3250, loss[loss=0.2495, simple_loss=0.332, pruned_loss=0.08354, over 21602.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.327, pruned_loss=0.08922, over 4282405.59 frames. ], batch size: 263, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:09:22,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1117422.0, ans=0.2 2023-06-22 02:09:51,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-22 02:10:13,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-22 02:10:20,749 INFO [train.py:996] (1/4) Epoch 7, batch 3300, loss[loss=0.2487, simple_loss=0.3393, pruned_loss=0.0791, over 21214.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3211, pruned_loss=0.08841, over 4267110.32 frames. ], batch size: 549, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:10:59,478 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:11:16,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1117722.0, ans=0.125 2023-06-22 02:11:26,404 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 2.992e+02 3.641e+02 4.480e+02 7.487e+02, threshold=7.281e+02, percent-clipped=2.0 2023-06-22 02:11:43,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1117842.0, ans=0.0 2023-06-22 02:11:45,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1117842.0, ans=0.05 2023-06-22 02:11:45,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1117842.0, ans=0.0 2023-06-22 02:11:45,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1117842.0, ans=0.125 2023-06-22 02:11:48,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1117842.0, ans=0.1 2023-06-22 02:11:50,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1117842.0, ans=0.125 2023-06-22 02:12:00,479 INFO [train.py:996] (1/4) Epoch 7, batch 3350, loss[loss=0.2972, simple_loss=0.3556, pruned_loss=0.1194, over 21580.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.324, pruned_loss=0.08916, over 4272118.88 frames. ], batch size: 471, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:12:28,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-22 02:12:53,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1118022.0, ans=0.125 2023-06-22 02:13:16,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1118082.0, ans=0.0 2023-06-22 02:13:18,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118082.0, ans=0.1 2023-06-22 02:13:46,410 INFO [train.py:996] (1/4) Epoch 7, batch 3400, loss[loss=0.2607, simple_loss=0.3411, pruned_loss=0.09012, over 21595.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3239, pruned_loss=0.08944, over 4278597.02 frames. ], batch size: 389, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:14:27,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1118322.0, ans=15.0 2023-06-22 02:14:52,640 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.058e+02 3.533e+02 4.088e+02 6.686e+02, threshold=7.066e+02, percent-clipped=0.0 2023-06-22 02:14:54,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1118382.0, ans=0.125 2023-06-22 02:14:55,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1118382.0, ans=0.125 2023-06-22 02:15:20,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1118442.0, ans=0.125 2023-06-22 02:15:22,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118442.0, ans=0.1 2023-06-22 02:15:26,630 INFO [train.py:996] (1/4) Epoch 7, batch 3450, loss[loss=0.21, simple_loss=0.2923, pruned_loss=0.06387, over 21624.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3175, pruned_loss=0.08839, over 4274485.60 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:17:01,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-22 02:17:03,883 INFO [train.py:996] (1/4) Epoch 7, batch 3500, loss[loss=0.2465, simple_loss=0.3348, pruned_loss=0.07908, over 21728.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3261, pruned_loss=0.09133, over 4279734.62 frames. ], batch size: 247, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:17:53,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1118922.0, ans=0.125 2023-06-22 02:18:20,461 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.453e+02 3.920e+02 4.708e+02 8.175e+02, threshold=7.839e+02, percent-clipped=4.0 2023-06-22 02:18:44,637 INFO [train.py:996] (1/4) Epoch 7, batch 3550, loss[loss=0.2686, simple_loss=0.3322, pruned_loss=0.1025, over 21766.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3301, pruned_loss=0.09358, over 4273834.31 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:18:53,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1119102.0, ans=0.07 2023-06-22 02:18:56,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.07 vs. limit=15.0 2023-06-22 02:19:08,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-22 02:19:47,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1119282.0, ans=0.0 2023-06-22 02:20:20,937 INFO [train.py:996] (1/4) Epoch 7, batch 3600, loss[loss=0.266, simple_loss=0.3121, pruned_loss=0.11, over 21161.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3233, pruned_loss=0.09188, over 4271214.40 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:20:28,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-22 02:21:15,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1119522.0, ans=0.125 2023-06-22 02:21:38,454 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.375e+02 4.106e+02 5.045e+02 9.366e+02, threshold=8.213e+02, percent-clipped=2.0 2023-06-22 02:21:42,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1119582.0, ans=0.125 2023-06-22 02:21:53,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1119642.0, ans=0.1 2023-06-22 02:22:03,421 INFO [train.py:996] (1/4) Epoch 7, batch 3650, loss[loss=0.1692, simple_loss=0.2021, pruned_loss=0.06813, over 16965.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.325, pruned_loss=0.09278, over 4261113.49 frames. ], batch size: 60, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:22:18,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1119702.0, ans=0.2 2023-06-22 02:23:23,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-22 02:23:31,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1119942.0, ans=0.125 2023-06-22 02:23:43,418 INFO [train.py:996] (1/4) Epoch 7, batch 3700, loss[loss=0.2383, simple_loss=0.3166, pruned_loss=0.07996, over 21221.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3239, pruned_loss=0.0913, over 4267307.79 frames. ], batch size: 176, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:23:58,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1120002.0, ans=0.125 2023-06-22 02:24:31,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1120122.0, ans=0.0 2023-06-22 02:25:01,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=1120182.0, ans=12.0 2023-06-22 02:25:01,663 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 3.250e+02 3.889e+02 4.859e+02 8.141e+02, threshold=7.777e+02, percent-clipped=0.0 2023-06-22 02:25:07,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1120242.0, ans=0.05 2023-06-22 02:25:24,512 INFO [train.py:996] (1/4) Epoch 7, batch 3750, loss[loss=0.1815, simple_loss=0.2468, pruned_loss=0.05807, over 21149.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.323, pruned_loss=0.09128, over 4271909.16 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:26:40,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.39 vs. limit=10.0 2023-06-22 02:26:41,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1120482.0, ans=0.125 2023-06-22 02:26:56,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1120542.0, ans=0.1 2023-06-22 02:27:04,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1120542.0, ans=0.125 2023-06-22 02:27:10,043 INFO [train.py:996] (1/4) Epoch 7, batch 3800, loss[loss=0.2802, simple_loss=0.3485, pruned_loss=0.106, over 21533.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3212, pruned_loss=0.09019, over 4269531.35 frames. ], batch size: 131, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:27:33,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1120662.0, ans=0.125 2023-06-22 02:27:53,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1120722.0, ans=0.125 2023-06-22 02:28:07,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1120782.0, ans=0.125 2023-06-22 02:28:18,661 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.223e+02 3.848e+02 4.886e+02 9.152e+02, threshold=7.696e+02, percent-clipped=1.0 2023-06-22 02:28:46,257 INFO [train.py:996] (1/4) Epoch 7, batch 3850, loss[loss=0.2241, simple_loss=0.2844, pruned_loss=0.08184, over 21739.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3179, pruned_loss=0.08999, over 4267650.33 frames. ], batch size: 112, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:30:25,866 INFO [train.py:996] (1/4) Epoch 7, batch 3900, loss[loss=0.2645, simple_loss=0.3169, pruned_loss=0.1061, over 15239.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.313, pruned_loss=0.08974, over 4271201.98 frames. ], batch size: 61, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:30:28,475 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-22 02:30:30,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-22 02:30:42,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1121202.0, ans=0.0 2023-06-22 02:30:51,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1121262.0, ans=0.0 2023-06-22 02:31:07,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1121322.0, ans=0.1 2023-06-22 02:31:17,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1121322.0, ans=0.5 2023-06-22 02:31:36,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-22 02:31:38,632 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.017e+02 3.574e+02 4.086e+02 6.704e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-22 02:31:57,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1121442.0, ans=0.1 2023-06-22 02:32:06,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1121442.0, ans=0.125 2023-06-22 02:32:07,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1121442.0, ans=0.125 2023-06-22 02:32:11,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1121502.0, ans=0.07 2023-06-22 02:32:12,134 INFO [train.py:996] (1/4) Epoch 7, batch 3950, loss[loss=0.2085, simple_loss=0.2735, pruned_loss=0.07173, over 21268.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3145, pruned_loss=0.08817, over 4265857.26 frames. ], batch size: 159, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:32:19,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1121502.0, ans=0.125 2023-06-22 02:33:12,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1121682.0, ans=0.0 2023-06-22 02:33:21,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1121682.0, ans=0.0 2023-06-22 02:33:33,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1121742.0, ans=0.05 2023-06-22 02:33:53,248 INFO [train.py:996] (1/4) Epoch 7, batch 4000, loss[loss=0.2599, simple_loss=0.3413, pruned_loss=0.08927, over 19833.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3072, pruned_loss=0.08441, over 4268408.93 frames. ], batch size: 702, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:34:19,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1121862.0, ans=0.125 2023-06-22 02:34:21,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1121862.0, ans=0.1 2023-06-22 02:35:00,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.938e+02 3.344e+02 4.048e+02 7.852e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-22 02:35:34,241 INFO [train.py:996] (1/4) Epoch 7, batch 4050, loss[loss=0.225, simple_loss=0.3174, pruned_loss=0.06633, over 21610.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3077, pruned_loss=0.08312, over 4269874.97 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:35:45,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1122102.0, ans=0.0 2023-06-22 02:35:50,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1122102.0, ans=0.0 2023-06-22 02:36:23,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1122222.0, ans=0.1 2023-06-22 02:36:40,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-22 02:36:56,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-22 02:37:13,271 INFO [train.py:996] (1/4) Epoch 7, batch 4100, loss[loss=0.2509, simple_loss=0.3272, pruned_loss=0.08733, over 21842.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.31, pruned_loss=0.08351, over 4281431.57 frames. ], batch size: 414, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:37:49,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1122462.0, ans=0.125 2023-06-22 02:38:26,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.663e+02 2.992e+02 3.569e+02 4.943e+02, threshold=5.983e+02, percent-clipped=0.0 2023-06-22 02:38:29,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-22 02:38:53,889 INFO [train.py:996] (1/4) Epoch 7, batch 4150, loss[loss=0.254, simple_loss=0.3342, pruned_loss=0.08686, over 21732.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3108, pruned_loss=0.08106, over 4284777.66 frames. ], batch size: 351, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:39:16,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1122702.0, ans=0.1 2023-06-22 02:40:13,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1122882.0, ans=0.125 2023-06-22 02:40:45,130 INFO [train.py:996] (1/4) Epoch 7, batch 4200, loss[loss=0.1999, simple_loss=0.2888, pruned_loss=0.0555, over 21426.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3123, pruned_loss=0.08177, over 4273491.86 frames. ], batch size: 212, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:41:15,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1123062.0, ans=0.0 2023-06-22 02:41:56,346 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 3.116e+02 3.775e+02 4.930e+02 8.993e+02, threshold=7.550e+02, percent-clipped=12.0 2023-06-22 02:42:16,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-22 02:42:27,364 INFO [train.py:996] (1/4) Epoch 7, batch 4250, loss[loss=0.2652, simple_loss=0.3372, pruned_loss=0.09657, over 21775.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3212, pruned_loss=0.08417, over 4280910.42 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:42:45,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1123362.0, ans=0.2 2023-06-22 02:43:29,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1123482.0, ans=0.125 2023-06-22 02:43:29,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1123482.0, ans=0.2 2023-06-22 02:44:10,835 INFO [train.py:996] (1/4) Epoch 7, batch 4300, loss[loss=0.236, simple_loss=0.3282, pruned_loss=0.07187, over 21752.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3259, pruned_loss=0.0864, over 4273231.66 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:44:11,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1123602.0, ans=0.0 2023-06-22 02:44:12,366 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-22 02:45:28,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1123782.0, ans=0.125 2023-06-22 02:45:31,295 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.519e+02 4.292e+02 5.383e+02 8.752e+02, threshold=8.584e+02, percent-clipped=3.0 2023-06-22 02:45:49,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1123842.0, ans=0.125 2023-06-22 02:45:52,089 INFO [train.py:996] (1/4) Epoch 7, batch 4350, loss[loss=0.2246, simple_loss=0.2832, pruned_loss=0.08299, over 21362.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3235, pruned_loss=0.0851, over 4259759.24 frames. ], batch size: 160, lr: 4.43e-03, grad_scale: 8.0 2023-06-22 02:46:06,493 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-22 02:46:08,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1123902.0, ans=0.125 2023-06-22 02:47:08,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1124082.0, ans=0.1 2023-06-22 02:47:39,364 INFO [train.py:996] (1/4) Epoch 7, batch 4400, loss[loss=0.2124, simple_loss=0.3006, pruned_loss=0.06209, over 21618.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.321, pruned_loss=0.08429, over 4256121.87 frames. ], batch size: 263, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:47:44,889 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:47:47,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-22 02:48:17,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1124262.0, ans=0.95 2023-06-22 02:48:48,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1124382.0, ans=0.0 2023-06-22 02:48:57,462 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.429e+02 3.105e+02 3.566e+02 4.210e+02 6.733e+02, threshold=7.132e+02, percent-clipped=0.0 2023-06-22 02:49:21,684 INFO [train.py:996] (1/4) Epoch 7, batch 4450, loss[loss=0.292, simple_loss=0.3705, pruned_loss=0.1068, over 21717.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3294, pruned_loss=0.08649, over 4264437.70 frames. ], batch size: 389, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:50:34,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-22 02:51:06,312 INFO [train.py:996] (1/4) Epoch 7, batch 4500, loss[loss=0.2259, simple_loss=0.3206, pruned_loss=0.0656, over 20889.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3301, pruned_loss=0.08847, over 4269494.12 frames. ], batch size: 608, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:51:48,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1124922.0, ans=0.0 2023-06-22 02:52:00,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1124922.0, ans=0.025 2023-06-22 02:52:22,687 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.121e+02 3.546e+02 4.306e+02 8.092e+02, threshold=7.092e+02, percent-clipped=3.0 2023-06-22 02:52:32,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1125042.0, ans=0.125 2023-06-22 02:52:47,556 INFO [train.py:996] (1/4) Epoch 7, batch 4550, loss[loss=0.3017, simple_loss=0.3701, pruned_loss=0.1167, over 21759.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3326, pruned_loss=0.08908, over 4272379.87 frames. ], batch size: 441, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:52:48,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.09 vs. limit=6.0 2023-06-22 02:53:00,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1125102.0, ans=0.0 2023-06-22 02:53:40,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125222.0, ans=0.1 2023-06-22 02:53:52,924 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-22 02:54:00,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125282.0, ans=0.1 2023-06-22 02:54:25,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-22 02:54:33,351 INFO [train.py:996] (1/4) Epoch 7, batch 4600, loss[loss=0.2285, simple_loss=0.3009, pruned_loss=0.0781, over 21748.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3328, pruned_loss=0.09001, over 4276919.85 frames. ], batch size: 247, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:55:25,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-22 02:55:43,902 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.156e+02 3.559e+02 4.324e+02 6.713e+02, threshold=7.117e+02, percent-clipped=0.0 2023-06-22 02:56:13,825 INFO [train.py:996] (1/4) Epoch 7, batch 4650, loss[loss=0.1798, simple_loss=0.259, pruned_loss=0.05033, over 21786.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3255, pruned_loss=0.08777, over 4288646.31 frames. ], batch size: 282, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:56:38,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1125762.0, ans=0.125 2023-06-22 02:56:45,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-22 02:56:54,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1125822.0, ans=0.125 2023-06-22 02:57:08,280 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-22 02:57:12,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1125882.0, ans=0.05 2023-06-22 02:57:14,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1125882.0, ans=0.2 2023-06-22 02:57:52,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1126002.0, ans=0.125 2023-06-22 02:57:58,282 INFO [train.py:996] (1/4) Epoch 7, batch 4700, loss[loss=0.2739, simple_loss=0.3801, pruned_loss=0.08387, over 21198.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3155, pruned_loss=0.08509, over 4286625.20 frames. ], batch size: 548, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:57:58,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1126002.0, ans=0.1 2023-06-22 02:57:59,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-22 02:58:33,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1126122.0, ans=0.125 2023-06-22 02:58:58,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1126182.0, ans=0.0 2023-06-22 02:59:03,089 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.887e+02 3.239e+02 4.275e+02 6.571e+02, threshold=6.478e+02, percent-clipped=0.0 2023-06-22 02:59:24,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1126242.0, ans=0.0 2023-06-22 02:59:27,607 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=15.0 2023-06-22 02:59:31,472 INFO [train.py:996] (1/4) Epoch 7, batch 4750, loss[loss=0.2122, simple_loss=0.2801, pruned_loss=0.07211, over 21654.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3108, pruned_loss=0.08492, over 4290638.50 frames. ], batch size: 230, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:59:35,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1126302.0, ans=0.125 2023-06-22 02:59:50,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-22 03:00:12,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1126422.0, ans=0.0 2023-06-22 03:00:18,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1126422.0, ans=0.1 2023-06-22 03:01:13,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1126542.0, ans=0.125 2023-06-22 03:01:16,466 INFO [train.py:996] (1/4) Epoch 7, batch 4800, loss[loss=0.2307, simple_loss=0.3169, pruned_loss=0.07225, over 21719.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3123, pruned_loss=0.08541, over 4297737.14 frames. ], batch size: 247, lr: 4.43e-03, grad_scale: 32.0 2023-06-22 03:01:18,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1126602.0, ans=0.125 2023-06-22 03:01:23,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1126602.0, ans=0.125 2023-06-22 03:01:34,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1126662.0, ans=0.0 2023-06-22 03:01:55,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1126722.0, ans=0.0 2023-06-22 03:02:17,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-22 03:02:23,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.114e+02 3.560e+02 4.146e+02 5.866e+02, threshold=7.121e+02, percent-clipped=0.0 2023-06-22 03:02:39,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1126842.0, ans=0.1 2023-06-22 03:02:50,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1126842.0, ans=0.0 2023-06-22 03:02:56,172 INFO [train.py:996] (1/4) Epoch 7, batch 4850, loss[loss=0.2357, simple_loss=0.3059, pruned_loss=0.08273, over 21695.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3111, pruned_loss=0.08513, over 4301895.34 frames. ], batch size: 441, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:03:19,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-22 03:03:32,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1127022.0, ans=0.125 2023-06-22 03:04:36,859 INFO [train.py:996] (1/4) Epoch 7, batch 4900, loss[loss=0.2306, simple_loss=0.3, pruned_loss=0.08066, over 21858.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3133, pruned_loss=0.08607, over 4307560.86 frames. ], batch size: 118, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:04:38,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-22 03:05:25,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1127322.0, ans=0.0 2023-06-22 03:05:55,062 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.055e+02 3.401e+02 3.973e+02 6.495e+02, threshold=6.802e+02, percent-clipped=0.0 2023-06-22 03:06:13,346 INFO [train.py:996] (1/4) Epoch 7, batch 4950, loss[loss=0.2194, simple_loss=0.3107, pruned_loss=0.06407, over 21724.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3149, pruned_loss=0.08363, over 4296042.79 frames. ], batch size: 351, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:07:21,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1127682.0, ans=0.1 2023-06-22 03:07:43,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1127742.0, ans=0.0 2023-06-22 03:07:50,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.52 vs. limit=10.0 2023-06-22 03:07:52,454 INFO [train.py:996] (1/4) Epoch 7, batch 5000, loss[loss=0.1667, simple_loss=0.2353, pruned_loss=0.04907, over 17666.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3133, pruned_loss=0.08037, over 4287134.43 frames. ], batch size: 65, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:08:11,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1127862.0, ans=0.125 2023-06-22 03:09:08,697 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.797e+02 3.102e+02 3.690e+02 6.361e+02, threshold=6.203e+02, percent-clipped=0.0 2023-06-22 03:09:30,800 INFO [train.py:996] (1/4) Epoch 7, batch 5050, loss[loss=0.2379, simple_loss=0.361, pruned_loss=0.05742, over 20688.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3132, pruned_loss=0.0818, over 4296026.41 frames. ], batch size: 607, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:09:39,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1128102.0, ans=0.0 2023-06-22 03:09:58,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1128162.0, ans=0.125 2023-06-22 03:10:13,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1128222.0, ans=0.125 2023-06-22 03:11:05,903 INFO [train.py:996] (1/4) Epoch 7, batch 5100, loss[loss=0.2075, simple_loss=0.2807, pruned_loss=0.06719, over 21621.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3135, pruned_loss=0.08258, over 4296421.44 frames. ], batch size: 230, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:11:42,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1128522.0, ans=0.02 2023-06-22 03:11:55,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1128582.0, ans=0.2 2023-06-22 03:12:17,465 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.054e+02 3.634e+02 4.109e+02 8.042e+02, threshold=7.267e+02, percent-clipped=4.0 2023-06-22 03:12:40,011 INFO [train.py:996] (1/4) Epoch 7, batch 5150, loss[loss=0.238, simple_loss=0.3177, pruned_loss=0.07919, over 21869.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3125, pruned_loss=0.08326, over 4296284.28 frames. ], batch size: 371, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:12:42,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-22 03:13:35,407 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-22 03:14:05,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.17 vs. limit=5.0 2023-06-22 03:14:12,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1128942.0, ans=0.125 2023-06-22 03:14:15,509 INFO [train.py:996] (1/4) Epoch 7, batch 5200, loss[loss=0.2664, simple_loss=0.354, pruned_loss=0.08935, over 21624.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3174, pruned_loss=0.08419, over 4293389.71 frames. ], batch size: 263, lr: 4.42e-03, grad_scale: 32.0 2023-06-22 03:14:19,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-22 03:14:23,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1129002.0, ans=0.035 2023-06-22 03:14:30,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1129062.0, ans=0.0 2023-06-22 03:14:44,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129062.0, ans=0.1 2023-06-22 03:15:38,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.246e+02 3.920e+02 4.845e+02 8.696e+02, threshold=7.839e+02, percent-clipped=4.0 2023-06-22 03:15:54,276 INFO [train.py:996] (1/4) Epoch 7, batch 5250, loss[loss=0.1922, simple_loss=0.2679, pruned_loss=0.05827, over 16462.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3212, pruned_loss=0.08226, over 4284939.15 frames. ], batch size: 62, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:16:00,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1129302.0, ans=0.1 2023-06-22 03:16:23,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1129362.0, ans=0.0 2023-06-22 03:16:46,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1129422.0, ans=0.1 2023-06-22 03:17:17,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1129542.0, ans=0.0 2023-06-22 03:17:20,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1129542.0, ans=0.0 2023-06-22 03:17:24,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1129542.0, ans=0.125 2023-06-22 03:17:32,623 INFO [train.py:996] (1/4) Epoch 7, batch 5300, loss[loss=0.2541, simple_loss=0.3159, pruned_loss=0.0962, over 21663.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.321, pruned_loss=0.08438, over 4289264.09 frames. ], batch size: 263, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:17:33,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1129602.0, ans=0.125 2023-06-22 03:17:55,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1129662.0, ans=0.125 2023-06-22 03:17:58,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1129662.0, ans=0.0 2023-06-22 03:18:23,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1129722.0, ans=0.5 2023-06-22 03:18:54,583 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.014e+02 3.652e+02 4.152e+02 6.819e+02, threshold=7.305e+02, percent-clipped=0.0 2023-06-22 03:18:59,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129842.0, ans=0.1 2023-06-22 03:19:09,872 INFO [train.py:996] (1/4) Epoch 7, batch 5350, loss[loss=0.2408, simple_loss=0.3062, pruned_loss=0.08763, over 21907.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3196, pruned_loss=0.08617, over 4294655.30 frames. ], batch size: 414, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:19:22,409 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-06-22 03:19:42,862 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-22 03:20:03,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1130022.0, ans=0.125 2023-06-22 03:20:29,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1130082.0, ans=0.1 2023-06-22 03:20:33,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-22 03:20:50,439 INFO [train.py:996] (1/4) Epoch 7, batch 5400, loss[loss=0.2753, simple_loss=0.3381, pruned_loss=0.1062, over 21561.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3168, pruned_loss=0.08684, over 4293972.54 frames. ], batch size: 471, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:21:05,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1130262.0, ans=0.125 2023-06-22 03:21:24,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1130262.0, ans=0.125 2023-06-22 03:21:51,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1130322.0, ans=0.125 2023-06-22 03:22:03,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1130382.0, ans=0.2 2023-06-22 03:22:07,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1130382.0, ans=0.0 2023-06-22 03:22:14,205 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.826e+02 3.234e+02 3.815e+02 6.268e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-22 03:22:17,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1130442.0, ans=0.0 2023-06-22 03:22:26,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1130442.0, ans=0.1 2023-06-22 03:22:30,513 INFO [train.py:996] (1/4) Epoch 7, batch 5450, loss[loss=0.2979, simple_loss=0.3826, pruned_loss=0.1066, over 21563.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.318, pruned_loss=0.08558, over 4287376.41 frames. ], batch size: 471, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:22:37,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1130502.0, ans=0.125 2023-06-22 03:24:04,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1130742.0, ans=0.125 2023-06-22 03:24:12,397 INFO [train.py:996] (1/4) Epoch 7, batch 5500, loss[loss=0.2102, simple_loss=0.2964, pruned_loss=0.06202, over 21335.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3236, pruned_loss=0.08329, over 4284514.72 frames. ], batch size: 176, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:25:31,988 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.055e+02 3.640e+02 4.334e+02 7.311e+02, threshold=7.280e+02, percent-clipped=2.0 2023-06-22 03:25:35,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1131042.0, ans=0.125 2023-06-22 03:25:57,930 INFO [train.py:996] (1/4) Epoch 7, batch 5550, loss[loss=0.1845, simple_loss=0.2777, pruned_loss=0.04564, over 21369.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3224, pruned_loss=0.08045, over 4284004.54 frames. ], batch size: 211, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:26:34,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1131162.0, ans=0.1 2023-06-22 03:26:50,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1131222.0, ans=0.0 2023-06-22 03:26:56,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-22 03:27:41,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1131342.0, ans=0.025 2023-06-22 03:27:43,816 INFO [train.py:996] (1/4) Epoch 7, batch 5600, loss[loss=0.2292, simple_loss=0.3135, pruned_loss=0.07247, over 21239.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3182, pruned_loss=0.07752, over 4281354.96 frames. ], batch size: 159, lr: 4.42e-03, grad_scale: 32.0 2023-06-22 03:27:59,358 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.08 vs. limit=10.0 2023-06-22 03:28:26,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1131522.0, ans=0.125 2023-06-22 03:28:26,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1131522.0, ans=0.05 2023-06-22 03:28:31,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1131522.0, ans=0.0 2023-06-22 03:28:53,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1131582.0, ans=0.02 2023-06-22 03:29:03,135 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.861e+02 3.641e+02 4.597e+02 1.091e+03, threshold=7.283e+02, percent-clipped=6.0 2023-06-22 03:29:22,534 INFO [train.py:996] (1/4) Epoch 7, batch 5650, loss[loss=0.2604, simple_loss=0.33, pruned_loss=0.09539, over 21784.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3205, pruned_loss=0.07971, over 4286779.01 frames. ], batch size: 112, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:29:35,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1131702.0, ans=0.125 2023-06-22 03:29:51,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1131762.0, ans=0.125 2023-06-22 03:29:59,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1131822.0, ans=0.125 2023-06-22 03:30:16,255 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:30:48,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1131942.0, ans=0.125 2023-06-22 03:31:07,588 INFO [train.py:996] (1/4) Epoch 7, batch 5700, loss[loss=0.2897, simple_loss=0.3578, pruned_loss=0.1108, over 21618.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3207, pruned_loss=0.08145, over 4291580.26 frames. ], batch size: 441, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:31:21,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1132002.0, ans=0.0 2023-06-22 03:31:32,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1132062.0, ans=0.1 2023-06-22 03:31:35,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1132062.0, ans=0.0 2023-06-22 03:31:52,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1132122.0, ans=0.125 2023-06-22 03:32:13,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1132182.0, ans=0.125 2023-06-22 03:32:33,708 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.008e+02 3.484e+02 4.188e+02 7.295e+02, threshold=6.968e+02, percent-clipped=1.0 2023-06-22 03:32:46,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1132242.0, ans=0.0 2023-06-22 03:32:48,520 INFO [train.py:996] (1/4) Epoch 7, batch 5750, loss[loss=0.2264, simple_loss=0.3152, pruned_loss=0.06879, over 21601.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3171, pruned_loss=0.07856, over 4287698.81 frames. ], batch size: 441, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:33:01,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1132302.0, ans=0.04949747468305833 2023-06-22 03:33:49,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-22 03:34:28,165 INFO [train.py:996] (1/4) Epoch 7, batch 5800, loss[loss=0.2711, simple_loss=0.366, pruned_loss=0.08809, over 21633.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3178, pruned_loss=0.07761, over 4286657.73 frames. ], batch size: 389, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:34:28,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1132602.0, ans=0.2 2023-06-22 03:35:55,106 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.844e+02 3.714e+02 4.784e+02 7.655e+02, threshold=7.428e+02, percent-clipped=1.0 2023-06-22 03:36:00,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1132842.0, ans=0.125 2023-06-22 03:36:10,057 INFO [train.py:996] (1/4) Epoch 7, batch 5850, loss[loss=0.1945, simple_loss=0.2848, pruned_loss=0.05207, over 21414.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3155, pruned_loss=0.07507, over 4278691.07 frames. ], batch size: 194, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:36:46,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1132962.0, ans=0.125 2023-06-22 03:37:35,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1133142.0, ans=0.125 2023-06-22 03:37:39,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1133142.0, ans=0.125 2023-06-22 03:37:47,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-22 03:37:49,873 INFO [train.py:996] (1/4) Epoch 7, batch 5900, loss[loss=0.1688, simple_loss=0.2458, pruned_loss=0.04592, over 21864.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3076, pruned_loss=0.06928, over 4276929.69 frames. ], batch size: 102, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:39:08,934 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.538e+02 2.951e+02 3.905e+02 7.879e+02, threshold=5.902e+02, percent-clipped=2.0 2023-06-22 03:39:16,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-22 03:39:23,039 INFO [train.py:996] (1/4) Epoch 7, batch 5950, loss[loss=0.2262, simple_loss=0.2946, pruned_loss=0.07895, over 21688.00 frames. ], tot_loss[loss=0.226, simple_loss=0.307, pruned_loss=0.07256, over 4284311.77 frames. ], batch size: 389, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:40:00,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1133562.0, ans=0.2 2023-06-22 03:40:16,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-22 03:40:16,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.77 vs. limit=10.0 2023-06-22 03:40:30,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1133682.0, ans=0.0 2023-06-22 03:40:34,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1133682.0, ans=0.125 2023-06-22 03:41:00,579 INFO [train.py:996] (1/4) Epoch 7, batch 6000, loss[loss=0.2089, simple_loss=0.3287, pruned_loss=0.04457, over 21247.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3051, pruned_loss=0.07626, over 4291451.15 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 32.0 2023-06-22 03:41:00,580 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 03:41:21,111 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2587, simple_loss=0.3532, pruned_loss=0.08209, over 1796401.00 frames. 2023-06-22 03:41:21,112 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 03:42:23,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1133982.0, ans=0.0 2023-06-22 03:42:43,740 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.441e+02 4.226e+02 5.483e+02 1.064e+03, threshold=8.451e+02, percent-clipped=15.0 2023-06-22 03:42:57,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-22 03:43:01,832 INFO [train.py:996] (1/4) Epoch 7, batch 6050, loss[loss=0.1861, simple_loss=0.2697, pruned_loss=0.05124, over 21597.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3012, pruned_loss=0.07612, over 4272104.39 frames. ], batch size: 414, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:43:02,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1134102.0, ans=0.0 2023-06-22 03:43:12,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1134102.0, ans=0.2 2023-06-22 03:44:07,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-22 03:44:08,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1134282.0, ans=0.125 2023-06-22 03:44:24,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1134342.0, ans=0.0 2023-06-22 03:44:25,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1134342.0, ans=0.125 2023-06-22 03:44:32,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1134342.0, ans=0.1 2023-06-22 03:44:33,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1134342.0, ans=0.2 2023-06-22 03:44:39,802 INFO [train.py:996] (1/4) Epoch 7, batch 6100, loss[loss=0.2253, simple_loss=0.3056, pruned_loss=0.07245, over 20138.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2995, pruned_loss=0.0745, over 4270524.71 frames. ], batch size: 702, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:45:16,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1134462.0, ans=0.125 2023-06-22 03:45:25,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=12.0 2023-06-22 03:45:38,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1134582.0, ans=0.0 2023-06-22 03:46:00,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1134642.0, ans=0.0 2023-06-22 03:46:01,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.844e+02 3.267e+02 3.769e+02 7.598e+02, threshold=6.534e+02, percent-clipped=0.0 2023-06-22 03:46:12,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1134642.0, ans=0.1 2023-06-22 03:46:24,543 INFO [train.py:996] (1/4) Epoch 7, batch 6150, loss[loss=0.2609, simple_loss=0.334, pruned_loss=0.09393, over 21645.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3024, pruned_loss=0.07765, over 4266877.22 frames. ], batch size: 415, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:46:43,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1134762.0, ans=0.125 2023-06-22 03:47:02,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1134822.0, ans=0.125 2023-06-22 03:47:41,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.29 vs. limit=22.5 2023-06-22 03:47:54,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1134942.0, ans=0.1 2023-06-22 03:48:02,665 INFO [train.py:996] (1/4) Epoch 7, batch 6200, loss[loss=0.262, simple_loss=0.3293, pruned_loss=0.09733, over 21489.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3058, pruned_loss=0.07833, over 4276733.49 frames. ], batch size: 509, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:49:28,289 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.933e+02 3.461e+02 4.493e+02 7.617e+02, threshold=6.923e+02, percent-clipped=2.0 2023-06-22 03:49:41,014 INFO [train.py:996] (1/4) Epoch 7, batch 6250, loss[loss=0.2446, simple_loss=0.3523, pruned_loss=0.0685, over 21672.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3128, pruned_loss=0.07824, over 4277936.43 frames. ], batch size: 414, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:50:13,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1135362.0, ans=0.125 2023-06-22 03:50:23,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-06-22 03:50:31,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1135422.0, ans=0.0 2023-06-22 03:51:23,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1135542.0, ans=0.125 2023-06-22 03:51:25,845 INFO [train.py:996] (1/4) Epoch 7, batch 6300, loss[loss=0.2393, simple_loss=0.354, pruned_loss=0.06228, over 21218.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3162, pruned_loss=0.07721, over 4275291.55 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:52:22,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1135782.0, ans=0.2 2023-06-22 03:52:51,748 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.09 vs. limit=15.0 2023-06-22 03:52:52,255 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.056e+02 3.560e+02 4.261e+02 7.497e+02, threshold=7.120e+02, percent-clipped=1.0 2023-06-22 03:53:05,194 INFO [train.py:996] (1/4) Epoch 7, batch 6350, loss[loss=0.2964, simple_loss=0.3622, pruned_loss=0.1153, over 21464.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3192, pruned_loss=0.08188, over 4283975.05 frames. ], batch size: 194, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:53:09,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-22 03:53:25,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1135962.0, ans=0.1 2023-06-22 03:53:31,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-22 03:53:54,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.16 vs. limit=6.0 2023-06-22 03:54:12,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136082.0, ans=0.1 2023-06-22 03:54:45,905 INFO [train.py:996] (1/4) Epoch 7, batch 6400, loss[loss=0.28, simple_loss=0.3472, pruned_loss=0.1064, over 21743.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3246, pruned_loss=0.08696, over 4286620.76 frames. ], batch size: 298, lr: 4.41e-03, grad_scale: 32.0 2023-06-22 03:55:08,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1136262.0, ans=0.2 2023-06-22 03:55:57,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1136382.0, ans=0.04949747468305833 2023-06-22 03:56:04,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136442.0, ans=0.1 2023-06-22 03:56:10,192 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 3.095e+02 3.612e+02 4.103e+02 7.644e+02, threshold=7.224e+02, percent-clipped=1.0 2023-06-22 03:56:21,505 INFO [train.py:996] (1/4) Epoch 7, batch 6450, loss[loss=0.2647, simple_loss=0.3392, pruned_loss=0.09503, over 21455.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3271, pruned_loss=0.08627, over 4289690.82 frames. ], batch size: 131, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:56:40,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136562.0, ans=0.1 2023-06-22 03:56:42,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1136562.0, ans=0.04949747468305833 2023-06-22 03:57:06,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1136622.0, ans=0.125 2023-06-22 03:57:10,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1136622.0, ans=0.0 2023-06-22 03:57:27,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1136682.0, ans=0.125 2023-06-22 03:57:32,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1136682.0, ans=0.125 2023-06-22 03:57:48,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1136742.0, ans=0.125 2023-06-22 03:58:00,834 INFO [train.py:996] (1/4) Epoch 7, batch 6500, loss[loss=0.2329, simple_loss=0.3138, pruned_loss=0.07603, over 21628.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3215, pruned_loss=0.08462, over 4277721.78 frames. ], batch size: 263, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:59:29,997 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 3.034e+02 3.497e+02 4.367e+02 8.159e+02, threshold=6.993e+02, percent-clipped=3.0 2023-06-22 03:59:40,158 INFO [train.py:996] (1/4) Epoch 7, batch 6550, loss[loss=0.2252, simple_loss=0.3436, pruned_loss=0.05335, over 21208.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3191, pruned_loss=0.08182, over 4276368.86 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 03:59:41,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.36 vs. limit=6.0 2023-06-22 04:00:43,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1137222.0, ans=0.0 2023-06-22 04:00:51,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1137282.0, ans=0.0 2023-06-22 04:01:06,323 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.10 vs. limit=15.0 2023-06-22 04:01:19,401 INFO [train.py:996] (1/4) Epoch 7, batch 6600, loss[loss=0.1928, simple_loss=0.254, pruned_loss=0.06584, over 21260.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3132, pruned_loss=0.08182, over 4283093.92 frames. ], batch size: 548, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:02:25,965 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-22 04:02:42,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1137642.0, ans=0.125 2023-06-22 04:02:49,275 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.768e+02 3.180e+02 3.748e+02 5.312e+02, threshold=6.360e+02, percent-clipped=0.0 2023-06-22 04:02:59,294 INFO [train.py:996] (1/4) Epoch 7, batch 6650, loss[loss=0.2187, simple_loss=0.2913, pruned_loss=0.07304, over 21649.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3041, pruned_loss=0.0788, over 4274277.82 frames. ], batch size: 391, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:03:10,253 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-22 04:03:14,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-22 04:04:09,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1137882.0, ans=0.125 2023-06-22 04:04:15,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1137882.0, ans=0.09899494936611666 2023-06-22 04:04:39,344 INFO [train.py:996] (1/4) Epoch 7, batch 6700, loss[loss=0.2349, simple_loss=0.3052, pruned_loss=0.08233, over 21652.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2989, pruned_loss=0.07859, over 4271844.12 frames. ], batch size: 415, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:05:50,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1138182.0, ans=0.0 2023-06-22 04:06:08,323 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.940e+02 3.406e+02 4.084e+02 6.605e+02, threshold=6.813e+02, percent-clipped=1.0 2023-06-22 04:06:17,957 INFO [train.py:996] (1/4) Epoch 7, batch 6750, loss[loss=0.2131, simple_loss=0.2796, pruned_loss=0.07332, over 21763.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.297, pruned_loss=0.07942, over 4273241.22 frames. ], batch size: 351, lr: 4.40e-03, grad_scale: 8.0 2023-06-22 04:06:34,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1138302.0, ans=0.125 2023-06-22 04:07:19,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1138422.0, ans=0.0 2023-06-22 04:07:43,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1138542.0, ans=0.125 2023-06-22 04:07:55,323 INFO [train.py:996] (1/4) Epoch 7, batch 6800, loss[loss=0.221, simple_loss=0.2803, pruned_loss=0.0808, over 21724.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2992, pruned_loss=0.08145, over 4273068.38 frames. ], batch size: 282, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:08:12,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1138602.0, ans=0.125 2023-06-22 04:08:55,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1138722.0, ans=0.125 2023-06-22 04:09:05,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1138782.0, ans=0.2 2023-06-22 04:09:24,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.025e+02 3.541e+02 4.379e+02 6.653e+02, threshold=7.081e+02, percent-clipped=0.0 2023-06-22 04:09:33,730 INFO [train.py:996] (1/4) Epoch 7, batch 6850, loss[loss=0.2214, simple_loss=0.2884, pruned_loss=0.07723, over 21759.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3, pruned_loss=0.08362, over 4274998.82 frames. ], batch size: 351, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:10:31,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1139022.0, ans=0.1 2023-06-22 04:10:35,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-22 04:11:04,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1139142.0, ans=0.125 2023-06-22 04:11:14,111 INFO [train.py:996] (1/4) Epoch 7, batch 6900, loss[loss=0.2668, simple_loss=0.3915, pruned_loss=0.07105, over 19814.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3028, pruned_loss=0.08374, over 4279381.09 frames. ], batch size: 702, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:12:06,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1139322.0, ans=0.0 2023-06-22 04:12:08,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1139322.0, ans=0.125 2023-06-22 04:12:41,065 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.939e+02 3.532e+02 4.220e+02 8.926e+02, threshold=7.064e+02, percent-clipped=5.0 2023-06-22 04:12:43,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1139442.0, ans=0.1 2023-06-22 04:12:55,508 INFO [train.py:996] (1/4) Epoch 7, batch 6950, loss[loss=0.2491, simple_loss=0.323, pruned_loss=0.08759, over 21718.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3044, pruned_loss=0.08145, over 4279367.09 frames. ], batch size: 298, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:13:20,999 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-22 04:13:27,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-22 04:14:00,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1139682.0, ans=0.125 2023-06-22 04:14:35,507 INFO [train.py:996] (1/4) Epoch 7, batch 7000, loss[loss=0.2977, simple_loss=0.3341, pruned_loss=0.1307, over 21310.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3069, pruned_loss=0.08368, over 4283640.23 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:14:55,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1139802.0, ans=0.125 2023-06-22 04:15:21,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1139922.0, ans=0.2 2023-06-22 04:15:29,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1139922.0, ans=0.125 2023-06-22 04:15:47,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1139982.0, ans=0.125 2023-06-22 04:16:01,356 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.353e+02 3.073e+02 3.537e+02 4.508e+02 8.250e+02, threshold=7.073e+02, percent-clipped=4.0 2023-06-22 04:16:10,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 04:16:10,902 INFO [train.py:996] (1/4) Epoch 7, batch 7050, loss[loss=0.1934, simple_loss=0.276, pruned_loss=0.05544, over 21694.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3063, pruned_loss=0.08312, over 4277347.40 frames. ], batch size: 247, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:16:37,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1140162.0, ans=0.1 2023-06-22 04:16:38,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-22 04:16:57,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1140222.0, ans=0.0 2023-06-22 04:17:52,912 INFO [train.py:996] (1/4) Epoch 7, batch 7100, loss[loss=0.2736, simple_loss=0.3422, pruned_loss=0.1025, over 21691.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3099, pruned_loss=0.08398, over 4280525.24 frames. ], batch size: 298, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:18:30,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1140522.0, ans=0.0 2023-06-22 04:18:30,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1140522.0, ans=0.125 2023-06-22 04:18:36,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1140522.0, ans=0.0 2023-06-22 04:18:36,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1140522.0, ans=0.125 2023-06-22 04:18:56,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1140582.0, ans=0.0 2023-06-22 04:19:25,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.884e+02 3.183e+02 3.903e+02 6.649e+02, threshold=6.367e+02, percent-clipped=0.0 2023-06-22 04:19:34,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=22.5 2023-06-22 04:19:34,827 INFO [train.py:996] (1/4) Epoch 7, batch 7150, loss[loss=0.223, simple_loss=0.3086, pruned_loss=0.06873, over 21341.00 frames. ], tot_loss[loss=0.234, simple_loss=0.307, pruned_loss=0.08052, over 4278602.06 frames. ], batch size: 549, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:19:43,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1140702.0, ans=0.05 2023-06-22 04:19:44,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1140702.0, ans=0.125 2023-06-22 04:20:14,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1140822.0, ans=15.0 2023-06-22 04:21:14,911 INFO [train.py:996] (1/4) Epoch 7, batch 7200, loss[loss=0.2152, simple_loss=0.3181, pruned_loss=0.05615, over 20925.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3096, pruned_loss=0.08324, over 4277991.85 frames. ], batch size: 607, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:21:20,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-22 04:21:37,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1141062.0, ans=0.125 2023-06-22 04:22:15,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1141182.0, ans=0.0 2023-06-22 04:22:16,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.74 vs. limit=10.0 2023-06-22 04:22:30,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1141182.0, ans=0.2 2023-06-22 04:22:44,338 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 2.920e+02 3.463e+02 4.119e+02 7.524e+02, threshold=6.925e+02, percent-clipped=3.0 2023-06-22 04:22:45,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-06-22 04:22:53,911 INFO [train.py:996] (1/4) Epoch 7, batch 7250, loss[loss=0.205, simple_loss=0.2689, pruned_loss=0.07055, over 21882.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3065, pruned_loss=0.08332, over 4274589.88 frames. ], batch size: 373, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:23:30,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1141422.0, ans=0.0 2023-06-22 04:23:56,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1141482.0, ans=0.125 2023-06-22 04:24:17,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1141542.0, ans=0.125 2023-06-22 04:24:25,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1141542.0, ans=0.0 2023-06-22 04:24:33,105 INFO [train.py:996] (1/4) Epoch 7, batch 7300, loss[loss=0.2115, simple_loss=0.273, pruned_loss=0.07503, over 21655.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3009, pruned_loss=0.08205, over 4267821.51 frames. ], batch size: 333, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:26:04,337 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.850e+02 3.371e+02 4.063e+02 7.936e+02, threshold=6.743e+02, percent-clipped=2.0 2023-06-22 04:26:10,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1141842.0, ans=0.125 2023-06-22 04:26:13,071 INFO [train.py:996] (1/4) Epoch 7, batch 7350, loss[loss=0.2823, simple_loss=0.3526, pruned_loss=0.106, over 21456.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.298, pruned_loss=0.08231, over 4260904.50 frames. ], batch size: 131, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:26:26,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1141902.0, ans=0.2 2023-06-22 04:27:07,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1142022.0, ans=0.125 2023-06-22 04:27:24,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1142082.0, ans=0.0 2023-06-22 04:27:27,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1142082.0, ans=0.05 2023-06-22 04:27:40,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1142142.0, ans=0.125 2023-06-22 04:27:49,481 INFO [train.py:996] (1/4) Epoch 7, batch 7400, loss[loss=0.2996, simple_loss=0.3822, pruned_loss=0.1085, over 21477.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3022, pruned_loss=0.0843, over 4257881.36 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:28:09,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1142262.0, ans=0.125 2023-06-22 04:29:00,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-22 04:29:21,612 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.129e+02 3.591e+02 4.476e+02 8.193e+02, threshold=7.182e+02, percent-clipped=3.0 2023-06-22 04:29:26,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1142442.0, ans=0.0 2023-06-22 04:29:29,616 INFO [train.py:996] (1/4) Epoch 7, batch 7450, loss[loss=0.2428, simple_loss=0.2993, pruned_loss=0.09314, over 21368.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3018, pruned_loss=0.08368, over 4251498.97 frames. ], batch size: 473, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:29:36,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1142502.0, ans=0.2 2023-06-22 04:30:01,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1142562.0, ans=0.2 2023-06-22 04:31:10,685 INFO [train.py:996] (1/4) Epoch 7, batch 7500, loss[loss=0.2376, simple_loss=0.3294, pruned_loss=0.07292, over 21432.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3075, pruned_loss=0.08623, over 4256092.46 frames. ], batch size: 194, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:31:34,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1142862.0, ans=0.125 2023-06-22 04:31:35,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1142862.0, ans=0.125 2023-06-22 04:31:43,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-22 04:32:43,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 3.393e+02 4.362e+02 5.679e+02 1.317e+03, threshold=8.723e+02, percent-clipped=9.0 2023-06-22 04:32:51,289 INFO [train.py:996] (1/4) Epoch 7, batch 7550, loss[loss=0.2629, simple_loss=0.3596, pruned_loss=0.08312, over 21661.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3158, pruned_loss=0.0856, over 4257413.49 frames. ], batch size: 414, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:32:58,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-22 04:33:10,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.06 vs. limit=12.0 2023-06-22 04:33:38,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1143162.0, ans=0.125 2023-06-22 04:33:46,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.35 vs. limit=10.0 2023-06-22 04:34:24,273 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:34:29,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1143402.0, ans=0.0 2023-06-22 04:34:30,116 INFO [train.py:996] (1/4) Epoch 7, batch 7600, loss[loss=0.2459, simple_loss=0.3092, pruned_loss=0.09131, over 21350.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.314, pruned_loss=0.08348, over 4258335.49 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:34:34,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1143402.0, ans=0.2 2023-06-22 04:34:42,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1143402.0, ans=0.125 2023-06-22 04:35:34,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1143582.0, ans=0.1 2023-06-22 04:35:56,524 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.783e+02 3.344e+02 4.108e+02 6.348e+02, threshold=6.687e+02, percent-clipped=0.0 2023-06-22 04:35:56,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1143642.0, ans=0.0 2023-06-22 04:36:04,735 INFO [train.py:996] (1/4) Epoch 7, batch 7650, loss[loss=0.3148, simple_loss=0.3499, pruned_loss=0.1399, over 21779.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3144, pruned_loss=0.08622, over 4276269.92 frames. ], batch size: 508, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:37:16,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1143882.0, ans=0.125 2023-06-22 04:37:24,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1143882.0, ans=0.0 2023-06-22 04:37:37,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1143942.0, ans=0.0 2023-06-22 04:37:40,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1143942.0, ans=0.125 2023-06-22 04:37:44,858 INFO [train.py:996] (1/4) Epoch 7, batch 7700, loss[loss=0.2877, simple_loss=0.4144, pruned_loss=0.08048, over 19779.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3183, pruned_loss=0.08904, over 4281815.05 frames. ], batch size: 702, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:38:23,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1144062.0, ans=0.125 2023-06-22 04:38:25,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1144062.0, ans=0.125 2023-06-22 04:39:20,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 3.101e+02 3.626e+02 4.244e+02 7.117e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-22 04:39:33,832 INFO [train.py:996] (1/4) Epoch 7, batch 7750, loss[loss=0.2124, simple_loss=0.2829, pruned_loss=0.07093, over 21910.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3219, pruned_loss=0.08841, over 4282590.86 frames. ], batch size: 98, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:40:00,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1144362.0, ans=0.125 2023-06-22 04:40:05,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1144362.0, ans=0.125 2023-06-22 04:40:32,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1144482.0, ans=0.125 2023-06-22 04:40:47,831 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.588e-03 2023-06-22 04:40:58,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1144542.0, ans=0.125 2023-06-22 04:41:10,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1144542.0, ans=0.0 2023-06-22 04:41:20,044 INFO [train.py:996] (1/4) Epoch 7, batch 7800, loss[loss=0.2514, simple_loss=0.3328, pruned_loss=0.08498, over 21685.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3234, pruned_loss=0.0886, over 4266431.32 frames. ], batch size: 414, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:42:37,948 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.647e+02 3.499e+02 4.174e+02 5.699e+02 9.171e+02, threshold=8.349e+02, percent-clipped=6.0 2023-06-22 04:42:49,108 INFO [train.py:996] (1/4) Epoch 7, batch 7850, loss[loss=0.2383, simple_loss=0.2842, pruned_loss=0.09616, over 21378.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.317, pruned_loss=0.08805, over 4273958.31 frames. ], batch size: 509, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:43:30,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1145022.0, ans=0.125 2023-06-22 04:44:41,146 INFO [train.py:996] (1/4) Epoch 7, batch 7900, loss[loss=0.1592, simple_loss=0.2025, pruned_loss=0.05792, over 16119.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3099, pruned_loss=0.08671, over 4262558.02 frames. ], batch size: 60, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:45:17,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1145322.0, ans=0.125 2023-06-22 04:46:17,646 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 3.136e+02 3.570e+02 4.500e+02 9.857e+02, threshold=7.139e+02, percent-clipped=1.0 2023-06-22 04:46:23,932 INFO [train.py:996] (1/4) Epoch 7, batch 7950, loss[loss=0.2427, simple_loss=0.3143, pruned_loss=0.0856, over 20792.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3145, pruned_loss=0.08568, over 4257029.28 frames. ], batch size: 611, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:46:58,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1145622.0, ans=0.0 2023-06-22 04:47:07,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-06-22 04:47:51,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1145742.0, ans=0.0 2023-06-22 04:47:53,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1145742.0, ans=0.125 2023-06-22 04:48:05,928 INFO [train.py:996] (1/4) Epoch 7, batch 8000, loss[loss=0.262, simple_loss=0.3511, pruned_loss=0.08652, over 21739.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3197, pruned_loss=0.08787, over 4262283.61 frames. ], batch size: 351, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:48:06,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1145802.0, ans=0.0 2023-06-22 04:48:13,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1145802.0, ans=0.0 2023-06-22 04:48:30,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1145862.0, ans=0.0 2023-06-22 04:48:36,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.77 vs. limit=15.0 2023-06-22 04:49:43,961 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.592e+02 3.352e+02 3.873e+02 5.175e+02 9.395e+02, threshold=7.746e+02, percent-clipped=4.0 2023-06-22 04:49:47,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1146042.0, ans=0.125 2023-06-22 04:49:50,531 INFO [train.py:996] (1/4) Epoch 7, batch 8050, loss[loss=0.2495, simple_loss=0.3404, pruned_loss=0.07929, over 21742.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3214, pruned_loss=0.08766, over 4258038.99 frames. ], batch size: 351, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:49:52,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1146102.0, ans=0.125 2023-06-22 04:50:02,134 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:50:44,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1146222.0, ans=0.125 2023-06-22 04:51:05,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1146282.0, ans=0.125 2023-06-22 04:51:31,152 INFO [train.py:996] (1/4) Epoch 7, batch 8100, loss[loss=0.2684, simple_loss=0.3268, pruned_loss=0.105, over 21310.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3232, pruned_loss=0.08877, over 4260066.61 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:51:37,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1146402.0, ans=0.125 2023-06-22 04:51:45,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1146402.0, ans=0.0 2023-06-22 04:51:52,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1146402.0, ans=0.04949747468305833 2023-06-22 04:51:52,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1146402.0, ans=0.125 2023-06-22 04:51:55,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1146402.0, ans=0.0 2023-06-22 04:52:05,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1146462.0, ans=0.0 2023-06-22 04:52:20,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1146462.0, ans=0.125 2023-06-22 04:52:23,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1146522.0, ans=0.1 2023-06-22 04:52:32,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-22 04:52:33,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1146522.0, ans=0.0 2023-06-22 04:52:36,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1146522.0, ans=0.125 2023-06-22 04:53:20,133 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.528e+02 3.325e+02 3.912e+02 5.287e+02 8.623e+02, threshold=7.823e+02, percent-clipped=4.0 2023-06-22 04:53:29,626 INFO [train.py:996] (1/4) Epoch 7, batch 8150, loss[loss=0.2176, simple_loss=0.3192, pruned_loss=0.05799, over 20071.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3311, pruned_loss=0.09125, over 4261689.45 frames. ], batch size: 703, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:53:42,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1146702.0, ans=0.2 2023-06-22 04:53:57,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1146762.0, ans=0.0 2023-06-22 04:54:11,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1146822.0, ans=0.1 2023-06-22 04:54:22,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1146882.0, ans=0.125 2023-06-22 04:55:08,823 INFO [train.py:996] (1/4) Epoch 7, batch 8200, loss[loss=0.1861, simple_loss=0.2424, pruned_loss=0.06494, over 21221.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3231, pruned_loss=0.08813, over 4269974.80 frames. ], batch size: 176, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:55:15,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1147002.0, ans=0.125 2023-06-22 04:56:38,837 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 2.942e+02 3.673e+02 4.818e+02 8.671e+02, threshold=7.346e+02, percent-clipped=2.0 2023-06-22 04:56:48,468 INFO [train.py:996] (1/4) Epoch 7, batch 8250, loss[loss=0.2161, simple_loss=0.2983, pruned_loss=0.06689, over 21287.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3227, pruned_loss=0.08864, over 4266965.32 frames. ], batch size: 159, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:57:23,186 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:57:26,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1147422.0, ans=0.125 2023-06-22 04:57:37,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1147422.0, ans=0.125 2023-06-22 04:58:13,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1147542.0, ans=0.0 2023-06-22 04:58:23,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1147542.0, ans=0.04949747468305833 2023-06-22 04:58:24,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1147542.0, ans=0.0 2023-06-22 04:58:28,882 INFO [train.py:996] (1/4) Epoch 7, batch 8300, loss[loss=0.2172, simple_loss=0.2965, pruned_loss=0.06896, over 21350.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3219, pruned_loss=0.08616, over 4263445.82 frames. ], batch size: 176, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:58:43,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.18 vs. limit=15.0 2023-06-22 04:59:07,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1147722.0, ans=0.125 2023-06-22 04:59:11,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1147722.0, ans=0.0 2023-06-22 05:00:04,384 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.891e+02 3.462e+02 4.321e+02 7.253e+02, threshold=6.923e+02, percent-clipped=0.0 2023-06-22 05:00:14,286 INFO [train.py:996] (1/4) Epoch 7, batch 8350, loss[loss=0.2071, simple_loss=0.2802, pruned_loss=0.067, over 21774.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3198, pruned_loss=0.08383, over 4258511.66 frames. ], batch size: 112, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 05:00:40,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1147962.0, ans=0.125 2023-06-22 05:01:03,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-22 05:01:17,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1148082.0, ans=0.1 2023-06-22 05:01:38,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1148142.0, ans=0.2 2023-06-22 05:01:49,568 INFO [train.py:996] (1/4) Epoch 7, batch 8400, loss[loss=0.1687, simple_loss=0.2442, pruned_loss=0.04661, over 21204.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3165, pruned_loss=0.0807, over 4263585.98 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 05:01:54,863 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:02:01,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=12.0 2023-06-22 05:02:07,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1148262.0, ans=0.1 2023-06-22 05:02:29,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1148322.0, ans=0.125 2023-06-22 05:03:23,506 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.841e+02 3.519e+02 4.158e+02 9.923e+02, threshold=7.039e+02, percent-clipped=4.0 2023-06-22 05:03:28,548 INFO [train.py:996] (1/4) Epoch 7, batch 8450, loss[loss=0.2786, simple_loss=0.3356, pruned_loss=0.1108, over 21236.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3138, pruned_loss=0.07999, over 4268775.74 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 05:04:14,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1148682.0, ans=0.0 2023-06-22 05:04:38,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1148682.0, ans=0.0 2023-06-22 05:05:07,108 INFO [train.py:996] (1/4) Epoch 7, batch 8500, loss[loss=0.227, simple_loss=0.2688, pruned_loss=0.09255, over 20073.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3107, pruned_loss=0.08184, over 4268612.45 frames. ], batch size: 704, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:05:26,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1148862.0, ans=0.125 2023-06-22 05:05:36,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1148862.0, ans=0.125 2023-06-22 05:05:38,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1148922.0, ans=0.125 2023-06-22 05:05:57,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1148982.0, ans=0.2 2023-06-22 05:06:43,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.101e+02 3.750e+02 4.685e+02 7.391e+02, threshold=7.500e+02, percent-clipped=2.0 2023-06-22 05:06:46,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1149102.0, ans=0.2 2023-06-22 05:06:47,881 INFO [train.py:996] (1/4) Epoch 7, batch 8550, loss[loss=0.2447, simple_loss=0.3354, pruned_loss=0.07701, over 21708.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3146, pruned_loss=0.08438, over 4274448.97 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:07:09,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1149162.0, ans=0.125 2023-06-22 05:08:29,796 INFO [train.py:996] (1/4) Epoch 7, batch 8600, loss[loss=0.2492, simple_loss=0.3239, pruned_loss=0.08725, over 21410.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3203, pruned_loss=0.0858, over 4272347.09 frames. ], batch size: 211, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:08:40,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1149402.0, ans=0.0 2023-06-22 05:08:51,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1149462.0, ans=0.125 2023-06-22 05:09:18,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-22 05:09:18,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-22 05:09:26,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-22 05:09:38,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1149582.0, ans=0.05 2023-06-22 05:09:59,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1149642.0, ans=0.125 2023-06-22 05:10:06,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 3.158e+02 3.954e+02 4.821e+02 7.985e+02, threshold=7.909e+02, percent-clipped=1.0 2023-06-22 05:10:11,808 INFO [train.py:996] (1/4) Epoch 7, batch 8650, loss[loss=0.2256, simple_loss=0.2939, pruned_loss=0.0787, over 21087.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3273, pruned_loss=0.08691, over 4279185.93 frames. ], batch size: 607, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:10:25,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1149702.0, ans=0.125 2023-06-22 05:10:44,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1149762.0, ans=0.1 2023-06-22 05:11:21,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1149882.0, ans=0.0 2023-06-22 05:11:43,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1149942.0, ans=0.125 2023-06-22 05:11:46,162 INFO [train.py:996] (1/4) Epoch 7, batch 8700, loss[loss=0.196, simple_loss=0.2611, pruned_loss=0.06548, over 21453.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3221, pruned_loss=0.08406, over 4271247.82 frames. ], batch size: 131, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:11:50,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-22 05:12:10,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-22 05:12:53,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-22 05:13:04,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1150182.0, ans=0.125 2023-06-22 05:13:21,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.826e+02 3.648e+02 4.622e+02 7.671e+02, threshold=7.296e+02, percent-clipped=0.0 2023-06-22 05:13:24,961 INFO [train.py:996] (1/4) Epoch 7, batch 8750, loss[loss=0.2436, simple_loss=0.2986, pruned_loss=0.09428, over 21362.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3192, pruned_loss=0.08466, over 4273834.51 frames. ], batch size: 159, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:13:27,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1150302.0, ans=0.5 2023-06-22 05:13:28,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1150302.0, ans=0.125 2023-06-22 05:13:35,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1150302.0, ans=0.2 2023-06-22 05:13:46,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1150362.0, ans=0.125 2023-06-22 05:14:27,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1150482.0, ans=0.125 2023-06-22 05:15:05,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1150602.0, ans=0.0 2023-06-22 05:15:06,523 INFO [train.py:996] (1/4) Epoch 7, batch 8800, loss[loss=0.2514, simple_loss=0.3347, pruned_loss=0.08407, over 21567.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3256, pruned_loss=0.0868, over 4269868.46 frames. ], batch size: 230, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:15:10,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1150602.0, ans=0.125 2023-06-22 05:15:29,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1150662.0, ans=0.2 2023-06-22 05:16:33,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1150842.0, ans=0.0 2023-06-22 05:16:41,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1150842.0, ans=0.09899494936611666 2023-06-22 05:16:45,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.569e+02 4.667e+02 6.070e+02 1.023e+03, threshold=9.335e+02, percent-clipped=11.0 2023-06-22 05:16:47,138 INFO [train.py:996] (1/4) Epoch 7, batch 8850, loss[loss=0.2606, simple_loss=0.3481, pruned_loss=0.0865, over 16045.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3301, pruned_loss=0.08775, over 4263524.38 frames. ], batch size: 61, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:17:02,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1150902.0, ans=0.125 2023-06-22 05:17:17,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-22 05:17:18,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1150962.0, ans=0.5 2023-06-22 05:17:46,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1151022.0, ans=0.035 2023-06-22 05:18:07,424 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-22 05:18:33,272 INFO [train.py:996] (1/4) Epoch 7, batch 8900, loss[loss=0.2416, simple_loss=0.323, pruned_loss=0.08013, over 21852.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3269, pruned_loss=0.08755, over 4263418.86 frames. ], batch size: 372, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:18:47,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1151202.0, ans=0.125 2023-06-22 05:19:37,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1151382.0, ans=0.0 2023-06-22 05:19:42,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1151382.0, ans=0.2 2023-06-22 05:20:15,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1151442.0, ans=0.125 2023-06-22 05:20:18,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1151502.0, ans=0.125 2023-06-22 05:20:19,277 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.272e+02 4.158e+02 4.869e+02 9.581e+02, threshold=8.315e+02, percent-clipped=1.0 2023-06-22 05:20:19,308 INFO [train.py:996] (1/4) Epoch 7, batch 8950, loss[loss=0.3003, simple_loss=0.4249, pruned_loss=0.08783, over 19769.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3293, pruned_loss=0.08696, over 4268561.36 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:20:48,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-22 05:20:49,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1151562.0, ans=0.1 2023-06-22 05:21:00,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1151622.0, ans=0.1 2023-06-22 05:21:09,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1151622.0, ans=0.07 2023-06-22 05:21:16,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1151682.0, ans=0.125 2023-06-22 05:21:41,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1151742.0, ans=0.07 2023-06-22 05:21:58,752 INFO [train.py:996] (1/4) Epoch 7, batch 9000, loss[loss=0.2219, simple_loss=0.2801, pruned_loss=0.08185, over 21728.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3239, pruned_loss=0.08664, over 4270638.55 frames. ], batch size: 300, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:21:58,752 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 05:22:20,470 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2667, simple_loss=0.3612, pruned_loss=0.08614, over 1796401.00 frames. 2023-06-22 05:22:20,471 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 05:22:24,179 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:23:13,489 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-22 05:23:14,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1151982.0, ans=0.035 2023-06-22 05:23:57,245 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.838e+02 3.460e+02 4.383e+02 1.064e+03, threshold=6.920e+02, percent-clipped=1.0 2023-06-22 05:23:57,275 INFO [train.py:996] (1/4) Epoch 7, batch 9050, loss[loss=0.2211, simple_loss=0.3027, pruned_loss=0.06972, over 21747.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3203, pruned_loss=0.08443, over 4269507.86 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:24:01,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1152102.0, ans=0.05 2023-06-22 05:24:24,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1152162.0, ans=0.125 2023-06-22 05:24:25,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1152162.0, ans=0.125 2023-06-22 05:24:34,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1152222.0, ans=0.125 2023-06-22 05:24:58,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1152282.0, ans=0.1 2023-06-22 05:25:04,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1152282.0, ans=0.125 2023-06-22 05:25:38,181 INFO [train.py:996] (1/4) Epoch 7, batch 9100, loss[loss=0.2516, simple_loss=0.3454, pruned_loss=0.07893, over 21673.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3248, pruned_loss=0.08676, over 4269082.15 frames. ], batch size: 414, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:26:08,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1152462.0, ans=0.5 2023-06-22 05:26:41,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1152582.0, ans=0.0 2023-06-22 05:26:44,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152582.0, ans=0.1 2023-06-22 05:27:18,324 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.226e+02 3.929e+02 4.790e+02 9.193e+02, threshold=7.858e+02, percent-clipped=7.0 2023-06-22 05:27:18,345 INFO [train.py:996] (1/4) Epoch 7, batch 9150, loss[loss=0.2281, simple_loss=0.3392, pruned_loss=0.05843, over 21208.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3269, pruned_loss=0.08425, over 4271140.11 frames. ], batch size: 548, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:27:36,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152762.0, ans=0.1 2023-06-22 05:28:12,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152822.0, ans=0.1 2023-06-22 05:28:57,922 INFO [train.py:996] (1/4) Epoch 7, batch 9200, loss[loss=0.2632, simple_loss=0.3498, pruned_loss=0.08832, over 21641.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3284, pruned_loss=0.08265, over 4267481.35 frames. ], batch size: 414, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:29:14,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-22 05:29:23,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1153062.0, ans=0.125 2023-06-22 05:29:28,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1153062.0, ans=0.125 2023-06-22 05:29:37,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1153062.0, ans=0.125 2023-06-22 05:30:02,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1153182.0, ans=15.0 2023-06-22 05:30:14,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1153182.0, ans=0.1 2023-06-22 05:30:15,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1153182.0, ans=0.2 2023-06-22 05:30:17,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1153182.0, ans=0.0 2023-06-22 05:30:26,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1153242.0, ans=0.0 2023-06-22 05:30:34,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1153242.0, ans=0.125 2023-06-22 05:30:38,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 3.168e+02 3.950e+02 4.666e+02 8.453e+02, threshold=7.900e+02, percent-clipped=2.0 2023-06-22 05:30:38,565 INFO [train.py:996] (1/4) Epoch 7, batch 9250, loss[loss=0.2701, simple_loss=0.3386, pruned_loss=0.1008, over 21679.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3299, pruned_loss=0.08542, over 4266201.77 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:31:09,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-22 05:31:14,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1153362.0, ans=0.0 2023-06-22 05:32:02,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=12.0 2023-06-22 05:32:24,623 INFO [train.py:996] (1/4) Epoch 7, batch 9300, loss[loss=0.2058, simple_loss=0.269, pruned_loss=0.07128, over 21564.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3225, pruned_loss=0.08504, over 4269815.74 frames. ], batch size: 263, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:33:31,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1153782.0, ans=0.1 2023-06-22 05:33:42,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1153782.0, ans=0.0 2023-06-22 05:34:11,596 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.246e+02 3.753e+02 4.890e+02 7.964e+02, threshold=7.506e+02, percent-clipped=1.0 2023-06-22 05:34:11,626 INFO [train.py:996] (1/4) Epoch 7, batch 9350, loss[loss=0.2583, simple_loss=0.3386, pruned_loss=0.08896, over 21735.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.329, pruned_loss=0.08676, over 4271133.38 frames. ], batch size: 298, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:34:28,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1153962.0, ans=0.1 2023-06-22 05:35:52,130 INFO [train.py:996] (1/4) Epoch 7, batch 9400, loss[loss=0.2245, simple_loss=0.2854, pruned_loss=0.08181, over 21304.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3308, pruned_loss=0.08732, over 4276989.29 frames. ], batch size: 549, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:35:57,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1154202.0, ans=0.0 2023-06-22 05:36:01,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1154202.0, ans=0.0 2023-06-22 05:36:18,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1154262.0, ans=0.125 2023-06-22 05:36:46,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1154322.0, ans=0.125 2023-06-22 05:37:25,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1154442.0, ans=0.035 2023-06-22 05:37:29,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1154442.0, ans=0.125 2023-06-22 05:37:32,770 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.206e+02 3.638e+02 4.538e+02 8.694e+02, threshold=7.276e+02, percent-clipped=2.0 2023-06-22 05:37:32,803 INFO [train.py:996] (1/4) Epoch 7, batch 9450, loss[loss=0.2138, simple_loss=0.2812, pruned_loss=0.07321, over 21770.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3232, pruned_loss=0.08706, over 4261395.70 frames. ], batch size: 124, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:37:38,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1154502.0, ans=0.125 2023-06-22 05:37:57,147 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:38:21,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1154622.0, ans=0.125 2023-06-22 05:38:26,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1154622.0, ans=0.125 2023-06-22 05:38:37,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1154682.0, ans=0.0 2023-06-22 05:38:54,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1154742.0, ans=0.125 2023-06-22 05:39:09,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1154742.0, ans=0.0 2023-06-22 05:39:11,635 INFO [train.py:996] (1/4) Epoch 7, batch 9500, loss[loss=0.2432, simple_loss=0.3101, pruned_loss=0.08814, over 21823.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3149, pruned_loss=0.08496, over 4259473.24 frames. ], batch size: 118, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:39:23,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1154802.0, ans=0.125 2023-06-22 05:39:34,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1154862.0, ans=0.2 2023-06-22 05:39:54,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1154922.0, ans=0.125 2023-06-22 05:40:12,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-22 05:40:52,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.257e+02 3.921e+02 4.943e+02 1.018e+03, threshold=7.842e+02, percent-clipped=7.0 2023-06-22 05:40:52,045 INFO [train.py:996] (1/4) Epoch 7, batch 9550, loss[loss=0.2557, simple_loss=0.3499, pruned_loss=0.08068, over 21811.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3188, pruned_loss=0.08701, over 4263337.96 frames. ], batch size: 282, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:41:25,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1155162.0, ans=0.2 2023-06-22 05:41:33,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1155222.0, ans=0.125 2023-06-22 05:42:23,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-22 05:42:26,566 INFO [train.py:996] (1/4) Epoch 7, batch 9600, loss[loss=0.2033, simple_loss=0.2763, pruned_loss=0.06511, over 21293.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3211, pruned_loss=0.08815, over 4267639.77 frames. ], batch size: 176, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:42:36,754 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-22 05:43:59,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-22 05:44:06,683 INFO [train.py:996] (1/4) Epoch 7, batch 9650, loss[loss=0.2798, simple_loss=0.3472, pruned_loss=0.1062, over 21692.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3214, pruned_loss=0.08927, over 4268001.63 frames. ], batch size: 351, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:44:08,124 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 3.176e+02 3.740e+02 4.596e+02 7.915e+02, threshold=7.479e+02, percent-clipped=1.0 2023-06-22 05:44:57,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-22 05:45:39,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1155942.0, ans=0.0 2023-06-22 05:45:51,588 INFO [train.py:996] (1/4) Epoch 7, batch 9700, loss[loss=0.2434, simple_loss=0.3129, pruned_loss=0.087, over 21279.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3257, pruned_loss=0.08945, over 4270913.25 frames. ], batch size: 143, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:46:03,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1156002.0, ans=0.2 2023-06-22 05:46:24,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1156062.0, ans=0.1 2023-06-22 05:46:37,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1156122.0, ans=0.125 2023-06-22 05:46:43,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1156122.0, ans=0.125 2023-06-22 05:47:07,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1156182.0, ans=0.125 2023-06-22 05:47:10,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1156242.0, ans=0.125 2023-06-22 05:47:13,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1156242.0, ans=0.125 2023-06-22 05:47:19,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1156242.0, ans=0.0 2023-06-22 05:47:21,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-22 05:47:35,191 INFO [train.py:996] (1/4) Epoch 7, batch 9750, loss[loss=0.2923, simple_loss=0.3727, pruned_loss=0.106, over 21449.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3201, pruned_loss=0.08799, over 4258674.67 frames. ], batch size: 131, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:47:36,495 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.485e+02 3.111e+02 3.618e+02 4.143e+02 7.836e+02, threshold=7.236e+02, percent-clipped=1.0 2023-06-22 05:47:57,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1156362.0, ans=0.2 2023-06-22 05:48:12,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1156422.0, ans=0.1 2023-06-22 05:48:48,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1156542.0, ans=0.1 2023-06-22 05:48:51,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1156542.0, ans=0.0 2023-06-22 05:49:08,231 INFO [train.py:996] (1/4) Epoch 7, batch 9800, loss[loss=0.2603, simple_loss=0.3254, pruned_loss=0.09762, over 21692.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3194, pruned_loss=0.08763, over 4262617.69 frames. ], batch size: 389, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:50:00,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156722.0, ans=0.1 2023-06-22 05:50:11,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-22 05:50:37,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1156842.0, ans=0.0 2023-06-22 05:50:42,016 INFO [train.py:996] (1/4) Epoch 7, batch 9850, loss[loss=0.1923, simple_loss=0.2404, pruned_loss=0.07212, over 20074.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3163, pruned_loss=0.08744, over 4260020.29 frames. ], batch size: 703, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:50:43,436 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 3.145e+02 3.713e+02 4.993e+02 9.640e+02, threshold=7.425e+02, percent-clipped=7.0 2023-06-22 05:50:50,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1156902.0, ans=0.125 2023-06-22 05:51:03,364 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:51:31,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1157022.0, ans=0.0 2023-06-22 05:51:56,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1157082.0, ans=0.2 2023-06-22 05:52:21,250 INFO [train.py:996] (1/4) Epoch 7, batch 9900, loss[loss=0.2161, simple_loss=0.2874, pruned_loss=0.07236, over 21369.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3126, pruned_loss=0.08691, over 4256841.14 frames. ], batch size: 211, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:53:01,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1157262.0, ans=0.125 2023-06-22 05:53:06,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-22 05:53:23,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1157382.0, ans=0.2 2023-06-22 05:53:23,967 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-22 05:53:28,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1157382.0, ans=0.0 2023-06-22 05:54:06,761 INFO [train.py:996] (1/4) Epoch 7, batch 9950, loss[loss=0.3308, simple_loss=0.4317, pruned_loss=0.1149, over 19783.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3137, pruned_loss=0.08883, over 4258998.59 frames. ], batch size: 702, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:54:08,118 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.677e+02 3.173e+02 3.693e+02 4.396e+02 6.940e+02, threshold=7.386e+02, percent-clipped=0.0 2023-06-22 05:54:57,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1157622.0, ans=0.1 2023-06-22 05:54:58,191 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.78 vs. limit=15.0 2023-06-22 05:55:13,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1157682.0, ans=0.0 2023-06-22 05:55:23,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1157682.0, ans=0.1 2023-06-22 05:55:54,217 INFO [train.py:996] (1/4) Epoch 7, batch 10000, loss[loss=0.1999, simple_loss=0.2802, pruned_loss=0.05977, over 21668.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3096, pruned_loss=0.08745, over 4255859.34 frames. ], batch size: 391, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:56:02,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-06-22 05:56:11,061 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:56:36,094 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-22 05:57:35,228 INFO [train.py:996] (1/4) Epoch 7, batch 10050, loss[loss=0.2176, simple_loss=0.2789, pruned_loss=0.07813, over 21203.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3117, pruned_loss=0.08805, over 4260940.22 frames. ], batch size: 159, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:57:36,648 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.023e+02 3.455e+02 4.258e+02 6.801e+02, threshold=6.910e+02, percent-clipped=0.0 2023-06-22 05:58:10,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-22 05:58:36,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1158282.0, ans=0.125 2023-06-22 05:58:51,964 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:59:15,545 INFO [train.py:996] (1/4) Epoch 7, batch 10100, loss[loss=0.1884, simple_loss=0.2731, pruned_loss=0.05188, over 21003.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3109, pruned_loss=0.08659, over 4252105.04 frames. ], batch size: 608, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:59:22,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1158402.0, ans=0.125 2023-06-22 05:59:24,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1158402.0, ans=0.2 2023-06-22 05:59:44,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1158462.0, ans=0.04949747468305833 2023-06-22 05:59:50,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1158462.0, ans=0.1 2023-06-22 06:00:55,656 INFO [train.py:996] (1/4) Epoch 7, batch 10150, loss[loss=0.2498, simple_loss=0.3288, pruned_loss=0.08542, over 21814.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3182, pruned_loss=0.0894, over 4256067.65 frames. ], batch size: 316, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 06:00:58,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.244e+02 3.861e+02 4.882e+02 7.298e+02, threshold=7.722e+02, percent-clipped=2.0 2023-06-22 06:01:16,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1158762.0, ans=0.125 2023-06-22 06:01:21,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1158762.0, ans=0.2 2023-06-22 06:01:43,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1158822.0, ans=0.1 2023-06-22 06:02:35,520 INFO [train.py:996] (1/4) Epoch 7, batch 10200, loss[loss=0.2529, simple_loss=0.3376, pruned_loss=0.08411, over 21182.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3172, pruned_loss=0.0873, over 4256164.94 frames. ], batch size: 548, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 06:02:57,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1159062.0, ans=0.125 2023-06-22 06:03:02,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1159062.0, ans=0.0 2023-06-22 06:03:06,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1159062.0, ans=0.04949747468305833 2023-06-22 06:03:47,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1159182.0, ans=0.125 2023-06-22 06:04:14,098 INFO [train.py:996] (1/4) Epoch 7, batch 10250, loss[loss=0.2809, simple_loss=0.3518, pruned_loss=0.105, over 21406.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.31, pruned_loss=0.08112, over 4265569.12 frames. ], batch size: 131, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:04:17,132 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.655e+02 3.086e+02 4.104e+02 7.872e+02, threshold=6.172e+02, percent-clipped=2.0 2023-06-22 06:04:54,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1159362.0, ans=0.125 2023-06-22 06:04:55,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1159422.0, ans=0.125 2023-06-22 06:05:22,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1159482.0, ans=0.2 2023-06-22 06:05:39,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1159542.0, ans=0.2 2023-06-22 06:06:01,477 INFO [train.py:996] (1/4) Epoch 7, batch 10300, loss[loss=0.2592, simple_loss=0.3342, pruned_loss=0.09212, over 21418.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3157, pruned_loss=0.0835, over 4265355.88 frames. ], batch size: 211, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:06:30,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1159662.0, ans=0.125 2023-06-22 06:07:14,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1159782.0, ans=0.125 2023-06-22 06:07:20,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1159842.0, ans=0.125 2023-06-22 06:07:36,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1159842.0, ans=0.1 2023-06-22 06:07:43,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.10 vs. limit=15.0 2023-06-22 06:07:44,256 INFO [train.py:996] (1/4) Epoch 7, batch 10350, loss[loss=0.1971, simple_loss=0.2626, pruned_loss=0.06576, over 21508.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3176, pruned_loss=0.08348, over 4267865.29 frames. ], batch size: 195, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:07:47,497 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.368e+02 3.957e+02 4.921e+02 8.307e+02, threshold=7.914e+02, percent-clipped=7.0 2023-06-22 06:09:03,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1160082.0, ans=0.0 2023-06-22 06:09:24,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1160142.0, ans=0.125 2023-06-22 06:09:28,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1160142.0, ans=0.125 2023-06-22 06:09:31,087 INFO [train.py:996] (1/4) Epoch 7, batch 10400, loss[loss=0.2065, simple_loss=0.2761, pruned_loss=0.06846, over 21623.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3088, pruned_loss=0.08148, over 4274521.04 frames. ], batch size: 263, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:09:54,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1160262.0, ans=0.125 2023-06-22 06:09:54,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1160262.0, ans=0.125 2023-06-22 06:09:57,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1160262.0, ans=0.0 2023-06-22 06:10:01,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1160262.0, ans=0.5 2023-06-22 06:10:45,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1160382.0, ans=0.2 2023-06-22 06:11:13,843 INFO [train.py:996] (1/4) Epoch 7, batch 10450, loss[loss=0.3401, simple_loss=0.3891, pruned_loss=0.1455, over 21804.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3148, pruned_loss=0.08414, over 4270433.20 frames. ], batch size: 441, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:11:16,918 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.616e+02 3.263e+02 3.725e+02 4.769e+02 8.321e+02, threshold=7.450e+02, percent-clipped=2.0 2023-06-22 06:11:19,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-22 06:11:53,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-22 06:11:55,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1160622.0, ans=0.0 2023-06-22 06:12:31,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1160682.0, ans=0.2 2023-06-22 06:12:47,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1160742.0, ans=0.0 2023-06-22 06:12:58,243 INFO [train.py:996] (1/4) Epoch 7, batch 10500, loss[loss=0.1909, simple_loss=0.2608, pruned_loss=0.06047, over 21513.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3147, pruned_loss=0.08294, over 4264972.33 frames. ], batch size: 230, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:13:50,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-22 06:13:53,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.19 vs. limit=10.0 2023-06-22 06:14:27,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161042.0, ans=0.1 2023-06-22 06:14:37,830 INFO [train.py:996] (1/4) Epoch 7, batch 10550, loss[loss=0.2499, simple_loss=0.3037, pruned_loss=0.09799, over 21859.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3078, pruned_loss=0.08238, over 4253228.92 frames. ], batch size: 107, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:14:40,897 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.300e+02 2.934e+02 3.554e+02 4.294e+02 7.411e+02, threshold=7.109e+02, percent-clipped=0.0 2023-06-22 06:14:47,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1161102.0, ans=0.0 2023-06-22 06:15:13,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1161222.0, ans=0.125 2023-06-22 06:15:40,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1161282.0, ans=0.125 2023-06-22 06:15:40,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161282.0, ans=0.1 2023-06-22 06:15:51,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1161282.0, ans=10.0 2023-06-22 06:15:56,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1161342.0, ans=10.0 2023-06-22 06:16:14,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1161342.0, ans=0.125 2023-06-22 06:16:19,504 INFO [train.py:996] (1/4) Epoch 7, batch 10600, loss[loss=0.1918, simple_loss=0.2603, pruned_loss=0.06163, over 21992.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3034, pruned_loss=0.08055, over 4256710.36 frames. ], batch size: 103, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:17:33,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1161582.0, ans=0.0 2023-06-22 06:17:47,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1161642.0, ans=0.2 2023-06-22 06:18:06,276 INFO [train.py:996] (1/4) Epoch 7, batch 10650, loss[loss=0.1684, simple_loss=0.2543, pruned_loss=0.04126, over 21748.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3066, pruned_loss=0.07994, over 4255550.16 frames. ], batch size: 282, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:18:11,063 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 3.046e+02 3.763e+02 4.720e+02 8.386e+02, threshold=7.526e+02, percent-clipped=4.0 2023-06-22 06:18:13,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1161702.0, ans=0.125 2023-06-22 06:18:45,829 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:19:25,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1161942.0, ans=0.1 2023-06-22 06:19:47,783 INFO [train.py:996] (1/4) Epoch 7, batch 10700, loss[loss=0.2285, simple_loss=0.3038, pruned_loss=0.07655, over 21766.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3061, pruned_loss=0.0795, over 4251053.27 frames. ], batch size: 247, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:20:13,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1162062.0, ans=0.1 2023-06-22 06:20:41,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1162122.0, ans=0.125 2023-06-22 06:21:21,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1162242.0, ans=0.035 2023-06-22 06:21:30,737 INFO [train.py:996] (1/4) Epoch 7, batch 10750, loss[loss=0.2696, simple_loss=0.3566, pruned_loss=0.09126, over 21392.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3198, pruned_loss=0.08467, over 4254900.53 frames. ], batch size: 211, lr: 4.36e-03, grad_scale: 8.0 2023-06-22 06:21:42,681 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 3.648e+02 4.416e+02 6.142e+02 1.061e+03, threshold=8.833e+02, percent-clipped=11.0 2023-06-22 06:21:44,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1162302.0, ans=0.035 2023-06-22 06:22:32,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1162482.0, ans=0.0 2023-06-22 06:22:44,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1162482.0, ans=0.2 2023-06-22 06:22:46,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1162482.0, ans=0.2 2023-06-22 06:22:52,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1162482.0, ans=0.0 2023-06-22 06:22:59,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1162542.0, ans=0.125 2023-06-22 06:23:17,176 INFO [train.py:996] (1/4) Epoch 7, batch 10800, loss[loss=0.3241, simple_loss=0.381, pruned_loss=0.1336, over 21329.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3249, pruned_loss=0.08615, over 4261840.89 frames. ], batch size: 507, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:23:49,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1162662.0, ans=0.2 2023-06-22 06:24:25,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-22 06:24:35,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1162842.0, ans=0.125 2023-06-22 06:24:44,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1162842.0, ans=0.125 2023-06-22 06:24:53,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1162842.0, ans=0.2 2023-06-22 06:24:56,602 INFO [train.py:996] (1/4) Epoch 7, batch 10850, loss[loss=0.2312, simple_loss=0.2989, pruned_loss=0.08177, over 21623.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3225, pruned_loss=0.08569, over 4267684.01 frames. ], batch size: 415, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:25:07,917 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.182e+02 4.104e+02 5.003e+02 8.249e+02, threshold=8.208e+02, percent-clipped=0.0 2023-06-22 06:25:26,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1162962.0, ans=0.125 2023-06-22 06:26:11,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1163082.0, ans=0.1 2023-06-22 06:26:23,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1163142.0, ans=0.125 2023-06-22 06:26:41,645 INFO [train.py:996] (1/4) Epoch 7, batch 10900, loss[loss=0.3082, simple_loss=0.3864, pruned_loss=0.115, over 21400.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3147, pruned_loss=0.08304, over 4260415.65 frames. ], batch size: 507, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:26:55,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1163202.0, ans=0.125 2023-06-22 06:27:08,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1163262.0, ans=0.125 2023-06-22 06:27:14,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1163262.0, ans=0.125 2023-06-22 06:27:22,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1163322.0, ans=0.125 2023-06-22 06:27:57,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1163382.0, ans=0.125 2023-06-22 06:28:07,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-22 06:28:16,279 INFO [train.py:996] (1/4) Epoch 7, batch 10950, loss[loss=0.22, simple_loss=0.3072, pruned_loss=0.06641, over 19903.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3119, pruned_loss=0.08124, over 4255511.48 frames. ], batch size: 702, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:28:27,458 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.461e+02 3.282e+02 3.918e+02 4.735e+02 6.803e+02, threshold=7.835e+02, percent-clipped=0.0 2023-06-22 06:28:41,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1163562.0, ans=0.0 2023-06-22 06:29:03,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1163622.0, ans=0.0 2023-06-22 06:29:05,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1163622.0, ans=0.0 2023-06-22 06:29:23,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1163682.0, ans=0.0 2023-06-22 06:29:50,135 INFO [train.py:996] (1/4) Epoch 7, batch 11000, loss[loss=0.2332, simple_loss=0.2963, pruned_loss=0.08502, over 21804.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3089, pruned_loss=0.08143, over 4268012.20 frames. ], batch size: 282, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:30:53,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1163982.0, ans=0.0 2023-06-22 06:31:10,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1163982.0, ans=0.125 2023-06-22 06:31:30,339 INFO [train.py:996] (1/4) Epoch 7, batch 11050, loss[loss=0.2029, simple_loss=0.2653, pruned_loss=0.07023, over 21423.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.306, pruned_loss=0.08265, over 4274929.91 frames. ], batch size: 131, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:31:33,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1164102.0, ans=0.125 2023-06-22 06:31:40,920 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.144e+02 3.658e+02 4.347e+02 7.948e+02, threshold=7.316e+02, percent-clipped=1.0 2023-06-22 06:32:06,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1164162.0, ans=0.1 2023-06-22 06:32:28,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1164282.0, ans=0.04949747468305833 2023-06-22 06:33:02,555 INFO [train.py:996] (1/4) Epoch 7, batch 11100, loss[loss=0.2191, simple_loss=0.3046, pruned_loss=0.06683, over 21409.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3045, pruned_loss=0.08297, over 4267874.44 frames. ], batch size: 194, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:33:09,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1164402.0, ans=0.125 2023-06-22 06:33:52,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-22 06:33:58,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1164522.0, ans=0.1 2023-06-22 06:34:00,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1164522.0, ans=0.0 2023-06-22 06:34:00,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1164522.0, ans=0.0 2023-06-22 06:34:42,431 INFO [train.py:996] (1/4) Epoch 7, batch 11150, loss[loss=0.2197, simple_loss=0.3028, pruned_loss=0.06832, over 21323.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3023, pruned_loss=0.08235, over 4266887.60 frames. ], batch size: 176, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:34:48,752 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.401e+02 2.924e+02 3.298e+02 3.958e+02 6.309e+02, threshold=6.596e+02, percent-clipped=0.0 2023-06-22 06:35:00,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1164702.0, ans=0.125 2023-06-22 06:35:05,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1164762.0, ans=0.125 2023-06-22 06:35:27,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1164822.0, ans=0.125 2023-06-22 06:35:41,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1164882.0, ans=0.125 2023-06-22 06:35:54,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1164942.0, ans=0.125 2023-06-22 06:36:02,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1164942.0, ans=0.125 2023-06-22 06:36:16,933 INFO [train.py:996] (1/4) Epoch 7, batch 11200, loss[loss=0.2056, simple_loss=0.2748, pruned_loss=0.06822, over 21844.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3008, pruned_loss=0.08187, over 4264679.26 frames. ], batch size: 373, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:36:33,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1165002.0, ans=0.125 2023-06-22 06:36:49,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1165062.0, ans=0.2 2023-06-22 06:37:12,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1165122.0, ans=0.1 2023-06-22 06:37:16,033 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-22 06:37:23,306 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:37:29,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1165242.0, ans=0.0 2023-06-22 06:37:42,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1165242.0, ans=0.125 2023-06-22 06:37:52,335 INFO [train.py:996] (1/4) Epoch 7, batch 11250, loss[loss=0.2496, simple_loss=0.3297, pruned_loss=0.08478, over 21863.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3016, pruned_loss=0.08281, over 4261795.11 frames. ], batch size: 124, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:37:54,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1165302.0, ans=0.125 2023-06-22 06:37:57,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1165302.0, ans=0.0 2023-06-22 06:37:58,491 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.901e+02 3.332e+02 3.824e+02 5.999e+02, threshold=6.664e+02, percent-clipped=0.0 2023-06-22 06:38:40,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1165422.0, ans=0.2 2023-06-22 06:39:31,712 INFO [train.py:996] (1/4) Epoch 7, batch 11300, loss[loss=0.1969, simple_loss=0.27, pruned_loss=0.06186, over 21579.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3028, pruned_loss=0.08309, over 4268462.80 frames. ], batch size: 195, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:39:48,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1165602.0, ans=0.125 2023-06-22 06:39:56,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-22 06:40:33,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1165782.0, ans=0.125 2023-06-22 06:40:34,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1165782.0, ans=0.0 2023-06-22 06:41:12,132 INFO [train.py:996] (1/4) Epoch 7, batch 11350, loss[loss=0.2925, simple_loss=0.3633, pruned_loss=0.1109, over 21903.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3049, pruned_loss=0.08208, over 4265022.91 frames. ], batch size: 372, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:41:23,500 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.937e+02 3.595e+02 4.319e+02 9.423e+02, threshold=7.190e+02, percent-clipped=3.0 2023-06-22 06:42:14,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1166082.0, ans=0.1 2023-06-22 06:42:15,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1166082.0, ans=0.1 2023-06-22 06:42:17,769 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-22 06:42:59,014 INFO [train.py:996] (1/4) Epoch 7, batch 11400, loss[loss=0.2635, simple_loss=0.3382, pruned_loss=0.09438, over 21340.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3098, pruned_loss=0.0842, over 4268361.79 frames. ], batch size: 549, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:43:06,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-22 06:43:41,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1166322.0, ans=0.125 2023-06-22 06:43:46,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1166322.0, ans=0.07 2023-06-22 06:44:40,327 INFO [train.py:996] (1/4) Epoch 7, batch 11450, loss[loss=0.2206, simple_loss=0.2989, pruned_loss=0.0712, over 21695.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3096, pruned_loss=0.08248, over 4263454.82 frames. ], batch size: 247, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:44:52,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.083e+02 3.885e+02 5.108e+02 7.985e+02, threshold=7.771e+02, percent-clipped=2.0 2023-06-22 06:45:00,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1166502.0, ans=0.0 2023-06-22 06:45:26,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1166622.0, ans=0.1 2023-06-22 06:45:43,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=22.5 2023-06-22 06:46:21,793 INFO [train.py:996] (1/4) Epoch 7, batch 11500, loss[loss=0.237, simple_loss=0.3132, pruned_loss=0.08041, over 21159.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3145, pruned_loss=0.08472, over 4266082.05 frames. ], batch size: 159, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:46:41,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1166802.0, ans=0.125 2023-06-22 06:48:14,285 INFO [train.py:996] (1/4) Epoch 7, batch 11550, loss[loss=0.2858, simple_loss=0.388, pruned_loss=0.09174, over 21752.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3188, pruned_loss=0.08423, over 4266185.08 frames. ], batch size: 351, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:48:21,214 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.086e+02 3.744e+02 4.289e+02 8.491e+02, threshold=7.488e+02, percent-clipped=1.0 2023-06-22 06:49:56,247 INFO [train.py:996] (1/4) Epoch 7, batch 11600, loss[loss=0.2729, simple_loss=0.3564, pruned_loss=0.09473, over 21321.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.334, pruned_loss=0.08684, over 4269902.76 frames. ], batch size: 143, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:50:12,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1167402.0, ans=0.0 2023-06-22 06:50:18,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1167462.0, ans=0.0 2023-06-22 06:51:14,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1167582.0, ans=0.2 2023-06-22 06:51:20,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1167642.0, ans=0.1 2023-06-22 06:51:37,749 INFO [train.py:996] (1/4) Epoch 7, batch 11650, loss[loss=0.3092, simple_loss=0.3713, pruned_loss=0.1236, over 21498.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3408, pruned_loss=0.0875, over 4257813.81 frames. ], batch size: 441, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:51:48,468 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1167702.0, ans=0.125 2023-06-22 06:51:52,654 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.531e+02 4.483e+02 5.705e+02 9.764e+02, threshold=8.966e+02, percent-clipped=9.0 2023-06-22 06:51:59,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1167762.0, ans=0.0 2023-06-22 06:52:25,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1167822.0, ans=0.125 2023-06-22 06:52:38,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1167882.0, ans=0.125 2023-06-22 06:52:59,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1167882.0, ans=0.0 2023-06-22 06:53:07,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1167942.0, ans=0.125 2023-06-22 06:53:18,244 INFO [train.py:996] (1/4) Epoch 7, batch 11700, loss[loss=0.2259, simple_loss=0.287, pruned_loss=0.08238, over 21985.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3316, pruned_loss=0.08707, over 4264031.47 frames. ], batch size: 119, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:53:38,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1168062.0, ans=0.125 2023-06-22 06:54:16,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1168182.0, ans=0.2 2023-06-22 06:54:27,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1168182.0, ans=0.0 2023-06-22 06:54:56,786 INFO [train.py:996] (1/4) Epoch 7, batch 11750, loss[loss=0.2016, simple_loss=0.2685, pruned_loss=0.06729, over 21781.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3224, pruned_loss=0.08642, over 4261918.01 frames. ], batch size: 112, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:55:11,703 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.565e+02 3.104e+02 3.664e+02 4.523e+02 8.929e+02, threshold=7.328e+02, percent-clipped=0.0 2023-06-22 06:56:44,783 INFO [train.py:996] (1/4) Epoch 7, batch 11800, loss[loss=0.2722, simple_loss=0.3364, pruned_loss=0.104, over 21220.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3245, pruned_loss=0.0885, over 4268541.48 frames. ], batch size: 143, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:56:49,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-22 06:58:10,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1168842.0, ans=0.04949747468305833 2023-06-22 06:58:25,540 INFO [train.py:996] (1/4) Epoch 7, batch 11850, loss[loss=0.2441, simple_loss=0.3138, pruned_loss=0.08717, over 21482.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3246, pruned_loss=0.08683, over 4277102.72 frames. ], batch size: 211, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:58:39,974 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.224e+02 3.745e+02 4.482e+02 9.714e+02, threshold=7.491e+02, percent-clipped=2.0 2023-06-22 06:58:51,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1168962.0, ans=0.125 2023-06-22 06:58:54,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1168962.0, ans=0.0 2023-06-22 06:59:10,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1169022.0, ans=0.125 2023-06-22 06:59:24,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1169022.0, ans=0.125 2023-06-22 06:59:28,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1169082.0, ans=15.0 2023-06-22 06:59:57,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1169142.0, ans=0.0 2023-06-22 07:00:12,050 INFO [train.py:996] (1/4) Epoch 7, batch 11900, loss[loss=0.1894, simple_loss=0.282, pruned_loss=0.04839, over 21566.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3272, pruned_loss=0.08496, over 4271686.56 frames. ], batch size: 263, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 07:00:16,755 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-22 07:00:21,712 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-22 07:00:47,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1169262.0, ans=0.125 2023-06-22 07:00:55,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1169322.0, ans=0.1 2023-06-22 07:01:08,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1169382.0, ans=0.0 2023-06-22 07:01:41,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-22 07:01:48,763 INFO [train.py:996] (1/4) Epoch 7, batch 11950, loss[loss=0.2488, simple_loss=0.3414, pruned_loss=0.07816, over 21677.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3294, pruned_loss=0.08235, over 4271268.99 frames. ], batch size: 414, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 07:01:58,130 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 3.043e+02 3.599e+02 4.818e+02 9.282e+02, threshold=7.198e+02, percent-clipped=3.0 2023-06-22 07:02:32,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1169622.0, ans=0.2 2023-06-22 07:02:52,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1169682.0, ans=0.0 2023-06-22 07:03:02,480 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:03:05,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1169742.0, ans=0.125 2023-06-22 07:03:20,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1169742.0, ans=0.0 2023-06-22 07:03:27,372 INFO [train.py:996] (1/4) Epoch 7, batch 12000, loss[loss=0.2211, simple_loss=0.2933, pruned_loss=0.07439, over 21641.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3214, pruned_loss=0.0801, over 4279736.05 frames. ], batch size: 298, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:03:27,373 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 07:03:43,840 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2652, simple_loss=0.3601, pruned_loss=0.08515, over 1796401.00 frames. 2023-06-22 07:03:43,840 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 07:04:30,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1169922.0, ans=0.0 2023-06-22 07:05:04,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1169982.0, ans=0.0 2023-06-22 07:05:23,318 INFO [train.py:996] (1/4) Epoch 7, batch 12050, loss[loss=0.2127, simple_loss=0.283, pruned_loss=0.07117, over 21501.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3168, pruned_loss=0.08238, over 4288226.86 frames. ], batch size: 211, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:05:37,738 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 3.086e+02 3.580e+02 4.845e+02 1.189e+03, threshold=7.160e+02, percent-clipped=3.0 2023-06-22 07:05:58,258 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:06:14,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1170222.0, ans=0.0 2023-06-22 07:06:21,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1170222.0, ans=0.2 2023-06-22 07:06:22,694 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:06:29,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1170282.0, ans=0.1 2023-06-22 07:06:37,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1170282.0, ans=0.2 2023-06-22 07:06:38,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-22 07:06:43,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1170282.0, ans=0.0 2023-06-22 07:06:47,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-22 07:07:09,361 INFO [train.py:996] (1/4) Epoch 7, batch 12100, loss[loss=0.1945, simple_loss=0.2475, pruned_loss=0.0708, over 20107.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3205, pruned_loss=0.08577, over 4281562.26 frames. ], batch size: 702, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:07:24,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1170402.0, ans=0.125 2023-06-22 07:07:36,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1170462.0, ans=0.0 2023-06-22 07:07:43,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1170462.0, ans=0.125 2023-06-22 07:08:18,894 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:08:57,259 INFO [train.py:996] (1/4) Epoch 7, batch 12150, loss[loss=0.3454, simple_loss=0.4266, pruned_loss=0.1321, over 21455.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3247, pruned_loss=0.08553, over 4282448.61 frames. ], batch size: 507, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:08:57,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1170702.0, ans=0.125 2023-06-22 07:09:06,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1170702.0, ans=0.0 2023-06-22 07:09:07,099 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.401e+02 4.092e+02 5.164e+02 8.690e+02, threshold=8.185e+02, percent-clipped=4.0 2023-06-22 07:09:47,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=22.5 2023-06-22 07:09:49,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=22.5 2023-06-22 07:10:14,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.57 vs. limit=10.0 2023-06-22 07:10:35,919 INFO [train.py:996] (1/4) Epoch 7, batch 12200, loss[loss=0.2619, simple_loss=0.3068, pruned_loss=0.1085, over 21392.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3223, pruned_loss=0.0842, over 4275105.01 frames. ], batch size: 508, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:10:38,674 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-22 07:10:44,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1171002.0, ans=0.0 2023-06-22 07:11:06,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1171062.0, ans=0.125 2023-06-22 07:11:22,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1171122.0, ans=0.0 2023-06-22 07:11:27,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1171122.0, ans=0.1 2023-06-22 07:11:44,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1171182.0, ans=0.1 2023-06-22 07:11:48,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-22 07:12:13,464 INFO [train.py:996] (1/4) Epoch 7, batch 12250, loss[loss=0.175, simple_loss=0.2641, pruned_loss=0.04298, over 21767.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3136, pruned_loss=0.08033, over 4272505.58 frames. ], batch size: 371, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:12:17,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-22 07:12:24,068 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 3.345e+02 4.478e+02 6.132e+02 1.246e+03, threshold=8.957e+02, percent-clipped=10.0 2023-06-22 07:12:45,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.94 vs. limit=6.0 2023-06-22 07:13:19,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1171482.0, ans=0.2 2023-06-22 07:13:46,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1171542.0, ans=0.125 2023-06-22 07:13:51,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1171602.0, ans=0.1 2023-06-22 07:13:52,904 INFO [train.py:996] (1/4) Epoch 7, batch 12300, loss[loss=0.1802, simple_loss=0.2538, pruned_loss=0.0533, over 21148.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3031, pruned_loss=0.07434, over 4261960.01 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:14:06,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1171602.0, ans=0.125 2023-06-22 07:15:26,388 INFO [train.py:996] (1/4) Epoch 7, batch 12350, loss[loss=0.2056, simple_loss=0.2891, pruned_loss=0.06105, over 21287.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3075, pruned_loss=0.07545, over 4263469.16 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:15:37,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.594e+02 3.277e+02 4.549e+02 8.356e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-22 07:15:47,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1171962.0, ans=0.0 2023-06-22 07:16:19,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1172022.0, ans=0.125 2023-06-22 07:16:52,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1172142.0, ans=0.125 2023-06-22 07:17:05,120 INFO [train.py:996] (1/4) Epoch 7, batch 12400, loss[loss=0.2449, simple_loss=0.3102, pruned_loss=0.08981, over 21553.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3106, pruned_loss=0.07891, over 4271776.08 frames. ], batch size: 548, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:17:07,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-22 07:17:31,060 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:17:32,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1172262.0, ans=0.1 2023-06-22 07:17:56,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1172322.0, ans=0.0 2023-06-22 07:18:44,327 INFO [train.py:996] (1/4) Epoch 7, batch 12450, loss[loss=0.3292, simple_loss=0.3766, pruned_loss=0.141, over 21449.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3158, pruned_loss=0.08282, over 4276037.47 frames. ], batch size: 510, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:19:01,088 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.229e+02 3.781e+02 4.439e+02 8.175e+02, threshold=7.562e+02, percent-clipped=5.0 2023-06-22 07:19:15,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1172562.0, ans=0.0 2023-06-22 07:19:24,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-06-22 07:19:35,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1172622.0, ans=0.0 2023-06-22 07:20:14,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1172742.0, ans=0.125 2023-06-22 07:20:32,537 INFO [train.py:996] (1/4) Epoch 7, batch 12500, loss[loss=0.2761, simple_loss=0.3873, pruned_loss=0.08246, over 21653.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3284, pruned_loss=0.08615, over 4273715.63 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:21:20,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1172922.0, ans=0.2 2023-06-22 07:21:23,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1172922.0, ans=0.0 2023-06-22 07:21:28,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1172922.0, ans=0.0 2023-06-22 07:22:03,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1173042.0, ans=0.125 2023-06-22 07:22:14,676 INFO [train.py:996] (1/4) Epoch 7, batch 12550, loss[loss=0.2984, simple_loss=0.3667, pruned_loss=0.115, over 21605.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3318, pruned_loss=0.0884, over 4280539.57 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:22:32,598 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.611e+02 3.175e+02 3.622e+02 4.685e+02 7.876e+02, threshold=7.244e+02, percent-clipped=1.0 2023-06-22 07:22:34,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1173162.0, ans=0.0 2023-06-22 07:24:00,223 INFO [train.py:996] (1/4) Epoch 7, batch 12600, loss[loss=0.2021, simple_loss=0.2927, pruned_loss=0.05572, over 21827.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3313, pruned_loss=0.08658, over 4278641.03 frames. ], batch size: 333, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:24:23,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.99 vs. limit=10.0 2023-06-22 07:24:27,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1173462.0, ans=0.125 2023-06-22 07:24:32,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1173462.0, ans=0.2 2023-06-22 07:24:42,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1173522.0, ans=0.1 2023-06-22 07:25:28,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1173642.0, ans=0.125 2023-06-22 07:25:38,775 INFO [train.py:996] (1/4) Epoch 7, batch 12650, loss[loss=0.3057, simple_loss=0.3499, pruned_loss=0.1308, over 21706.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.324, pruned_loss=0.0832, over 4271550.94 frames. ], batch size: 507, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:25:47,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1173702.0, ans=0.125 2023-06-22 07:25:51,196 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.154e+02 3.639e+02 4.446e+02 1.064e+03, threshold=7.278e+02, percent-clipped=5.0 2023-06-22 07:26:03,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1173762.0, ans=0.0 2023-06-22 07:26:08,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1173762.0, ans=0.2 2023-06-22 07:26:23,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1173822.0, ans=0.125 2023-06-22 07:27:19,815 INFO [train.py:996] (1/4) Epoch 7, batch 12700, loss[loss=0.2576, simple_loss=0.3373, pruned_loss=0.08897, over 21468.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3222, pruned_loss=0.08479, over 4270909.52 frames. ], batch size: 131, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:27:51,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1174062.0, ans=0.1 2023-06-22 07:27:53,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-22 07:28:27,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1174182.0, ans=0.125 2023-06-22 07:28:27,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1174182.0, ans=0.125 2023-06-22 07:28:36,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-22 07:28:57,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1174242.0, ans=0.0 2023-06-22 07:29:00,202 INFO [train.py:996] (1/4) Epoch 7, batch 12750, loss[loss=0.2431, simple_loss=0.3336, pruned_loss=0.07633, over 21691.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3244, pruned_loss=0.08612, over 4270027.98 frames. ], batch size: 351, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:29:17,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.566e+02 3.268e+02 3.639e+02 4.556e+02 7.416e+02, threshold=7.278e+02, percent-clipped=1.0 2023-06-22 07:29:27,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1174362.0, ans=0.125 2023-06-22 07:29:31,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-22 07:29:34,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1174362.0, ans=0.1 2023-06-22 07:30:08,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1174482.0, ans=0.0 2023-06-22 07:30:18,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1174482.0, ans=0.125 2023-06-22 07:30:44,416 INFO [train.py:996] (1/4) Epoch 7, batch 12800, loss[loss=0.2614, simple_loss=0.3278, pruned_loss=0.09752, over 21639.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3241, pruned_loss=0.08642, over 4278595.90 frames. ], batch size: 230, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:30:52,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1174602.0, ans=0.125 2023-06-22 07:32:00,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-22 07:32:16,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1174842.0, ans=0.0 2023-06-22 07:32:25,169 INFO [train.py:996] (1/4) Epoch 7, batch 12850, loss[loss=0.2751, simple_loss=0.3715, pruned_loss=0.08933, over 19901.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3264, pruned_loss=0.08813, over 4276575.62 frames. ], batch size: 704, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:32:39,822 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.112e+02 3.554e+02 4.407e+02 7.373e+02, threshold=7.108e+02, percent-clipped=1.0 2023-06-22 07:32:52,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-22 07:33:10,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1175022.0, ans=0.125 2023-06-22 07:34:06,270 INFO [train.py:996] (1/4) Epoch 7, batch 12900, loss[loss=0.2742, simple_loss=0.3497, pruned_loss=0.09933, over 21471.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3232, pruned_loss=0.08468, over 4278870.47 frames. ], batch size: 471, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:34:42,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-22 07:34:55,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1175322.0, ans=0.04949747468305833 2023-06-22 07:35:12,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1175322.0, ans=0.1 2023-06-22 07:35:14,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1175382.0, ans=0.125 2023-06-22 07:35:25,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=8.0 2023-06-22 07:35:53,275 INFO [train.py:996] (1/4) Epoch 7, batch 12950, loss[loss=0.2504, simple_loss=0.3236, pruned_loss=0.08859, over 21483.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3213, pruned_loss=0.08291, over 4276641.69 frames. ], batch size: 211, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:36:08,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1175502.0, ans=0.0 2023-06-22 07:36:12,628 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 2.893e+02 3.599e+02 4.715e+02 8.391e+02, threshold=7.198e+02, percent-clipped=5.0 2023-06-22 07:36:16,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1175562.0, ans=0.125 2023-06-22 07:36:23,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-22 07:36:38,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1175622.0, ans=0.125 2023-06-22 07:37:33,428 INFO [train.py:996] (1/4) Epoch 7, batch 13000, loss[loss=0.2162, simple_loss=0.3002, pruned_loss=0.06606, over 21787.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3207, pruned_loss=0.08286, over 4265127.73 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:37:56,076 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-22 07:39:02,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-22 07:39:07,141 INFO [train.py:996] (1/4) Epoch 7, batch 13050, loss[loss=0.2691, simple_loss=0.3293, pruned_loss=0.1044, over 21405.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.318, pruned_loss=0.08089, over 4261751.91 frames. ], batch size: 159, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:39:30,668 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.860e+02 3.531e+02 4.680e+02 1.133e+03, threshold=7.061e+02, percent-clipped=2.0 2023-06-22 07:39:40,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-22 07:39:52,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1176222.0, ans=0.125 2023-06-22 07:39:58,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1176222.0, ans=0.125 2023-06-22 07:39:58,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1176222.0, ans=0.125 2023-06-22 07:40:03,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.55 vs. limit=10.0 2023-06-22 07:40:48,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1176342.0, ans=0.0 2023-06-22 07:40:56,525 INFO [train.py:996] (1/4) Epoch 7, batch 13100, loss[loss=0.2725, simple_loss=0.3511, pruned_loss=0.09693, over 21755.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.319, pruned_loss=0.08066, over 4272445.04 frames. ], batch size: 332, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:42:01,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1176582.0, ans=0.0 2023-06-22 07:42:35,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1176642.0, ans=0.1 2023-06-22 07:42:42,753 INFO [train.py:996] (1/4) Epoch 7, batch 13150, loss[loss=0.2228, simple_loss=0.2917, pruned_loss=0.0769, over 21598.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3215, pruned_loss=0.08328, over 4271969.00 frames. ], batch size: 263, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:42:56,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1176702.0, ans=0.125 2023-06-22 07:43:01,884 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 3.619e+02 4.512e+02 5.792e+02 9.632e+02, threshold=9.025e+02, percent-clipped=11.0 2023-06-22 07:43:37,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1176882.0, ans=0.125 2023-06-22 07:44:24,039 INFO [train.py:996] (1/4) Epoch 7, batch 13200, loss[loss=0.2695, simple_loss=0.3221, pruned_loss=0.1084, over 20021.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3204, pruned_loss=0.08377, over 4271115.55 frames. ], batch size: 702, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:45:15,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1177122.0, ans=0.125 2023-06-22 07:45:21,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1177182.0, ans=0.125 2023-06-22 07:45:32,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1177182.0, ans=0.125 2023-06-22 07:45:34,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1177182.0, ans=0.125 2023-06-22 07:45:51,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1177242.0, ans=0.1 2023-06-22 07:46:09,406 INFO [train.py:996] (1/4) Epoch 7, batch 13250, loss[loss=0.232, simple_loss=0.3067, pruned_loss=0.07866, over 21260.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3194, pruned_loss=0.08546, over 4275235.65 frames. ], batch size: 176, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:46:09,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1177302.0, ans=0.125 2023-06-22 07:46:18,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1177302.0, ans=0.0 2023-06-22 07:46:24,188 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.563e+02 3.281e+02 4.048e+02 5.234e+02 8.486e+02, threshold=8.096e+02, percent-clipped=0.0 2023-06-22 07:47:09,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-22 07:47:25,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1177482.0, ans=10.0 2023-06-22 07:47:50,917 INFO [train.py:996] (1/4) Epoch 7, batch 13300, loss[loss=0.2869, simple_loss=0.3635, pruned_loss=0.1052, over 21653.00 frames. ], tot_loss[loss=0.246, simple_loss=0.322, pruned_loss=0.08494, over 4279761.93 frames. ], batch size: 389, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:47:51,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1177602.0, ans=0.1 2023-06-22 07:48:00,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-22 07:48:15,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1177662.0, ans=0.1 2023-06-22 07:49:28,844 INFO [train.py:996] (1/4) Epoch 7, batch 13350, loss[loss=0.2824, simple_loss=0.3551, pruned_loss=0.1049, over 21740.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.327, pruned_loss=0.08739, over 4281079.03 frames. ], batch size: 247, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:49:43,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.139e+02 3.531e+02 4.158e+02 7.079e+02, threshold=7.062e+02, percent-clipped=0.0 2023-06-22 07:49:57,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1177962.0, ans=22.5 2023-06-22 07:50:34,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1178082.0, ans=0.125 2023-06-22 07:51:08,300 INFO [train.py:996] (1/4) Epoch 7, batch 13400, loss[loss=0.3235, simple_loss=0.3772, pruned_loss=0.1349, over 21491.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3291, pruned_loss=0.08939, over 4279509.78 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:52:00,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-22 07:52:23,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1178382.0, ans=0.125 2023-06-22 07:52:37,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1178442.0, ans=0.0 2023-06-22 07:52:48,441 INFO [train.py:996] (1/4) Epoch 7, batch 13450, loss[loss=0.2994, simple_loss=0.3634, pruned_loss=0.1177, over 21303.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3295, pruned_loss=0.09161, over 4278024.54 frames. ], batch size: 143, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:52:52,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1178502.0, ans=0.125 2023-06-22 07:53:11,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1178502.0, ans=0.125 2023-06-22 07:53:12,726 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.492e+02 3.365e+02 3.946e+02 4.575e+02 8.284e+02, threshold=7.892e+02, percent-clipped=1.0 2023-06-22 07:53:37,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1178622.0, ans=0.125 2023-06-22 07:54:28,492 INFO [train.py:996] (1/4) Epoch 7, batch 13500, loss[loss=0.2752, simple_loss=0.3464, pruned_loss=0.102, over 21643.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3181, pruned_loss=0.08828, over 4271718.88 frames. ], batch size: 441, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:54:54,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-22 07:55:47,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.23 vs. limit=10.0 2023-06-22 07:55:58,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1179042.0, ans=0.125 2023-06-22 07:56:15,653 INFO [train.py:996] (1/4) Epoch 7, batch 13550, loss[loss=0.3246, simple_loss=0.4155, pruned_loss=0.1169, over 21564.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3222, pruned_loss=0.08764, over 4272271.33 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:56:30,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1179102.0, ans=0.125 2023-06-22 07:56:36,015 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.476e+02 3.442e+02 4.149e+02 5.236e+02 8.278e+02, threshold=8.298e+02, percent-clipped=4.0 2023-06-22 07:57:03,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1179222.0, ans=0.0 2023-06-22 07:57:23,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-22 07:57:28,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=12.0 2023-06-22 07:57:33,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-22 07:57:48,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1179342.0, ans=0.125 2023-06-22 07:57:54,928 INFO [train.py:996] (1/4) Epoch 7, batch 13600, loss[loss=0.2387, simple_loss=0.3145, pruned_loss=0.0814, over 21802.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3237, pruned_loss=0.08877, over 4278116.59 frames. ], batch size: 298, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:58:28,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1179462.0, ans=0.125 2023-06-22 07:58:36,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.68 vs. limit=15.0 2023-06-22 07:58:55,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1179582.0, ans=0.1 2023-06-22 07:58:58,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1179582.0, ans=0.2 2023-06-22 07:59:01,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1179582.0, ans=0.0 2023-06-22 07:59:21,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1179642.0, ans=0.125 2023-06-22 07:59:23,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1179642.0, ans=0.1 2023-06-22 07:59:26,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1179642.0, ans=0.0 2023-06-22 07:59:34,119 INFO [train.py:996] (1/4) Epoch 7, batch 13650, loss[loss=0.2411, simple_loss=0.2993, pruned_loss=0.0915, over 21319.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3196, pruned_loss=0.08517, over 4277655.18 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:59:54,409 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.926e+02 3.440e+02 4.459e+02 9.365e+02, threshold=6.879e+02, percent-clipped=1.0 2023-06-22 07:59:59,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1179762.0, ans=0.035 2023-06-22 08:00:14,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1179822.0, ans=0.125 2023-06-22 08:00:32,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1179882.0, ans=0.1 2023-06-22 08:01:09,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1179942.0, ans=0.125 2023-06-22 08:01:13,451 INFO [train.py:996] (1/4) Epoch 7, batch 13700, loss[loss=0.3395, simple_loss=0.4034, pruned_loss=0.1378, over 21517.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3145, pruned_loss=0.08511, over 4278178.36 frames. ], batch size: 508, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:01:14,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1180002.0, ans=0.0 2023-06-22 08:01:27,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-22 08:02:04,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1180122.0, ans=0.1 2023-06-22 08:02:37,941 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-22 08:02:45,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1180242.0, ans=0.1 2023-06-22 08:02:58,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1180302.0, ans=0.2 2023-06-22 08:02:59,603 INFO [train.py:996] (1/4) Epoch 7, batch 13750, loss[loss=0.2174, simple_loss=0.2829, pruned_loss=0.07591, over 21240.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3135, pruned_loss=0.08413, over 4268007.31 frames. ], batch size: 176, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:03:02,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-06-22 08:03:20,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1180362.0, ans=0.2 2023-06-22 08:03:23,020 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.446e+02 3.304e+02 4.106e+02 4.985e+02 1.123e+03, threshold=8.212e+02, percent-clipped=9.0 2023-06-22 08:04:19,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-22 08:04:48,544 INFO [train.py:996] (1/4) Epoch 7, batch 13800, loss[loss=0.2694, simple_loss=0.3737, pruned_loss=0.08256, over 21655.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3188, pruned_loss=0.08332, over 4258956.09 frames. ], batch size: 414, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:05:38,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1180722.0, ans=0.125 2023-06-22 08:05:45,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1180722.0, ans=0.125 2023-06-22 08:06:10,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1180842.0, ans=0.0 2023-06-22 08:06:29,565 INFO [train.py:996] (1/4) Epoch 7, batch 13850, loss[loss=0.2967, simple_loss=0.3815, pruned_loss=0.1059, over 21673.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3249, pruned_loss=0.08385, over 4269787.14 frames. ], batch size: 414, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:06:33,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1180902.0, ans=0.125 2023-06-22 08:06:51,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 3.586e+02 4.613e+02 6.020e+02 1.189e+03, threshold=9.227e+02, percent-clipped=5.0 2023-06-22 08:07:53,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1181142.0, ans=0.125 2023-06-22 08:08:03,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1181142.0, ans=0.125 2023-06-22 08:08:08,738 INFO [train.py:996] (1/4) Epoch 7, batch 13900, loss[loss=0.2824, simple_loss=0.3398, pruned_loss=0.1125, over 21773.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3295, pruned_loss=0.08815, over 4277682.07 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 8.0 2023-06-22 08:08:41,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1181262.0, ans=0.0 2023-06-22 08:09:38,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181442.0, ans=0.1 2023-06-22 08:09:40,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1181442.0, ans=0.0 2023-06-22 08:09:49,805 INFO [train.py:996] (1/4) Epoch 7, batch 13950, loss[loss=0.2568, simple_loss=0.3202, pruned_loss=0.09667, over 21781.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3292, pruned_loss=0.08979, over 4280640.38 frames. ], batch size: 247, lr: 4.32e-03, grad_scale: 8.0 2023-06-22 08:10:14,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1181562.0, ans=0.04949747468305833 2023-06-22 08:10:18,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.09 vs. limit=12.0 2023-06-22 08:10:18,729 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 3.399e+02 3.923e+02 4.848e+02 6.986e+02, threshold=7.845e+02, percent-clipped=0.0 2023-06-22 08:10:23,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1181562.0, ans=0.125 2023-06-22 08:10:36,237 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:11:05,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1181682.0, ans=0.0 2023-06-22 08:11:28,478 INFO [train.py:996] (1/4) Epoch 7, batch 14000, loss[loss=0.1849, simple_loss=0.2498, pruned_loss=0.05996, over 21304.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3279, pruned_loss=0.08866, over 4278095.38 frames. ], batch size: 159, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:11:31,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1181802.0, ans=22.5 2023-06-22 08:11:56,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1181862.0, ans=0.0 2023-06-22 08:12:22,641 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:12:52,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1182042.0, ans=0.125 2023-06-22 08:13:00,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1182042.0, ans=0.125 2023-06-22 08:13:10,924 INFO [train.py:996] (1/4) Epoch 7, batch 14050, loss[loss=0.2466, simple_loss=0.2942, pruned_loss=0.09956, over 14790.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3207, pruned_loss=0.08444, over 4262334.37 frames. ], batch size: 60, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:13:34,946 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.004e+02 3.495e+02 4.384e+02 1.047e+03, threshold=6.990e+02, percent-clipped=3.0 2023-06-22 08:13:35,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1182162.0, ans=0.125 2023-06-22 08:14:18,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1182282.0, ans=0.1 2023-06-22 08:14:49,752 INFO [train.py:996] (1/4) Epoch 7, batch 14100, loss[loss=0.2322, simple_loss=0.2886, pruned_loss=0.08796, over 21709.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3142, pruned_loss=0.08379, over 4254539.97 frames. ], batch size: 282, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:15:00,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1182402.0, ans=0.035 2023-06-22 08:15:14,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1182462.0, ans=0.2 2023-06-22 08:16:21,803 INFO [train.py:996] (1/4) Epoch 7, batch 14150, loss[loss=0.2258, simple_loss=0.3054, pruned_loss=0.07306, over 21845.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3183, pruned_loss=0.08498, over 4254541.34 frames. ], batch size: 102, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:16:29,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1182702.0, ans=0.0 2023-06-22 08:16:44,675 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.878e+02 3.254e+02 3.924e+02 9.436e+02, threshold=6.508e+02, percent-clipped=4.0 2023-06-22 08:17:54,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1182942.0, ans=0.0 2023-06-22 08:17:57,564 INFO [train.py:996] (1/4) Epoch 7, batch 14200, loss[loss=0.2152, simple_loss=0.2811, pruned_loss=0.07462, over 21574.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3157, pruned_loss=0.08254, over 4265507.35 frames. ], batch size: 263, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:18:45,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1183122.0, ans=0.125 2023-06-22 08:18:58,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1183182.0, ans=0.2 2023-06-22 08:19:36,416 INFO [train.py:996] (1/4) Epoch 7, batch 14250, loss[loss=0.2496, simple_loss=0.3077, pruned_loss=0.09575, over 21265.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3097, pruned_loss=0.08235, over 4257287.41 frames. ], batch size: 143, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:19:49,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1183302.0, ans=0.125 2023-06-22 08:19:54,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1183362.0, ans=0.0 2023-06-22 08:19:55,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.870e+02 3.314e+02 3.996e+02 6.865e+02, threshold=6.627e+02, percent-clipped=2.0 2023-06-22 08:20:54,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1183542.0, ans=0.0 2023-06-22 08:21:15,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1183602.0, ans=0.125 2023-06-22 08:21:16,237 INFO [train.py:996] (1/4) Epoch 7, batch 14300, loss[loss=0.3564, simple_loss=0.4313, pruned_loss=0.1407, over 21596.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3103, pruned_loss=0.08205, over 4253017.57 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:22:56,615 INFO [train.py:996] (1/4) Epoch 7, batch 14350, loss[loss=0.2879, simple_loss=0.3747, pruned_loss=0.1006, over 21521.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3171, pruned_loss=0.0833, over 4260338.03 frames. ], batch size: 507, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:23:01,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1183902.0, ans=0.125 2023-06-22 08:23:15,129 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 3.408e+02 4.555e+02 6.047e+02 1.523e+03, threshold=9.110e+02, percent-clipped=21.0 2023-06-22 08:23:29,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1183962.0, ans=0.125 2023-06-22 08:23:57,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1184082.0, ans=0.125 2023-06-22 08:24:00,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1184082.0, ans=0.0 2023-06-22 08:24:14,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1184142.0, ans=0.125 2023-06-22 08:24:34,929 INFO [train.py:996] (1/4) Epoch 7, batch 14400, loss[loss=0.2718, simple_loss=0.3264, pruned_loss=0.1085, over 21727.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3146, pruned_loss=0.0837, over 4270767.18 frames. ], batch size: 124, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:25:16,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1184322.0, ans=0.125 2023-06-22 08:25:39,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1184382.0, ans=0.125 2023-06-22 08:26:11,495 INFO [train.py:996] (1/4) Epoch 7, batch 14450, loss[loss=0.2357, simple_loss=0.2943, pruned_loss=0.08856, over 21759.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.309, pruned_loss=0.08424, over 4262600.55 frames. ], batch size: 333, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:26:30,338 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 2.987e+02 3.327e+02 4.057e+02 7.605e+02, threshold=6.653e+02, percent-clipped=0.0 2023-06-22 08:26:33,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1184562.0, ans=0.1 2023-06-22 08:27:52,131 INFO [train.py:996] (1/4) Epoch 7, batch 14500, loss[loss=0.2735, simple_loss=0.3195, pruned_loss=0.1138, over 21243.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.308, pruned_loss=0.08429, over 4263605.58 frames. ], batch size: 471, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:28:09,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-22 08:28:45,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1184922.0, ans=0.2 2023-06-22 08:29:28,344 INFO [train.py:996] (1/4) Epoch 7, batch 14550, loss[loss=0.2993, simple_loss=0.3663, pruned_loss=0.1162, over 21613.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3142, pruned_loss=0.08663, over 4265061.52 frames. ], batch size: 389, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:29:57,828 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 3.217e+02 4.103e+02 5.336e+02 9.308e+02, threshold=8.206e+02, percent-clipped=6.0 2023-06-22 08:30:25,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1185222.0, ans=0.1 2023-06-22 08:30:28,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1185222.0, ans=0.2 2023-06-22 08:30:32,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1185282.0, ans=0.2 2023-06-22 08:30:46,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-22 08:30:50,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1185282.0, ans=0.1 2023-06-22 08:31:09,744 INFO [train.py:996] (1/4) Epoch 7, batch 14600, loss[loss=0.2886, simple_loss=0.3527, pruned_loss=0.1123, over 21803.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.322, pruned_loss=0.09031, over 4269082.46 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:31:21,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-22 08:31:57,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1185522.0, ans=0.125 2023-06-22 08:32:08,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1185522.0, ans=0.125 2023-06-22 08:32:18,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1185582.0, ans=0.0 2023-06-22 08:32:30,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-22 08:32:48,057 INFO [train.py:996] (1/4) Epoch 7, batch 14650, loss[loss=0.2362, simple_loss=0.3223, pruned_loss=0.07507, over 21747.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3229, pruned_loss=0.08839, over 4271719.07 frames. ], batch size: 332, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:33:22,489 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.921e+02 3.378e+02 4.532e+02 7.463e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-22 08:33:40,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1185822.0, ans=0.125 2023-06-22 08:34:11,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1185942.0, ans=0.1 2023-06-22 08:34:24,127 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:34:28,140 INFO [train.py:996] (1/4) Epoch 7, batch 14700, loss[loss=0.2513, simple_loss=0.3466, pruned_loss=0.07799, over 21692.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3165, pruned_loss=0.08238, over 4261632.28 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:34:41,292 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=22.5 2023-06-22 08:34:52,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1186002.0, ans=0.2 2023-06-22 08:35:05,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1186062.0, ans=0.125 2023-06-22 08:35:13,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1186122.0, ans=0.5 2023-06-22 08:35:40,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-22 08:35:52,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=12.0 2023-06-22 08:35:58,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1186242.0, ans=0.2 2023-06-22 08:36:01,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1186242.0, ans=0.09899494936611666 2023-06-22 08:36:19,345 INFO [train.py:996] (1/4) Epoch 7, batch 14750, loss[loss=0.2895, simple_loss=0.3495, pruned_loss=0.1147, over 21266.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3222, pruned_loss=0.08527, over 4265000.28 frames. ], batch size: 159, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:36:45,472 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 3.126e+02 3.786e+02 4.508e+02 7.747e+02, threshold=7.572e+02, percent-clipped=1.0 2023-06-22 08:36:49,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1186362.0, ans=0.025 2023-06-22 08:36:59,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1186362.0, ans=0.2 2023-06-22 08:37:02,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=15.0 2023-06-22 08:37:10,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1186422.0, ans=0.125 2023-06-22 08:37:14,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1186422.0, ans=0.1 2023-06-22 08:37:18,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1186482.0, ans=0.2 2023-06-22 08:37:37,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1186542.0, ans=0.1 2023-06-22 08:38:03,846 INFO [train.py:996] (1/4) Epoch 7, batch 14800, loss[loss=0.2176, simple_loss=0.2871, pruned_loss=0.07407, over 21366.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3338, pruned_loss=0.0902, over 4263224.33 frames. ], batch size: 211, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:38:19,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1186662.0, ans=0.09899494936611666 2023-06-22 08:38:57,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1186782.0, ans=0.125 2023-06-22 08:39:45,604 INFO [train.py:996] (1/4) Epoch 7, batch 14850, loss[loss=0.2492, simple_loss=0.303, pruned_loss=0.09771, over 21445.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3278, pruned_loss=0.08992, over 4249289.50 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:39:54,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1186902.0, ans=0.125 2023-06-22 08:40:12,147 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.436e+02 3.807e+02 4.957e+02 1.167e+03, threshold=7.615e+02, percent-clipped=4.0 2023-06-22 08:40:25,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-22 08:40:48,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=22.5 2023-06-22 08:41:18,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1187142.0, ans=0.125 2023-06-22 08:41:32,007 INFO [train.py:996] (1/4) Epoch 7, batch 14900, loss[loss=0.2965, simple_loss=0.3608, pruned_loss=0.1161, over 21617.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.332, pruned_loss=0.09266, over 4251234.06 frames. ], batch size: 389, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:41:45,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1187202.0, ans=0.0 2023-06-22 08:41:55,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-22 08:41:56,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1187262.0, ans=0.0 2023-06-22 08:42:45,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1187382.0, ans=0.125 2023-06-22 08:42:45,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1187382.0, ans=0.2 2023-06-22 08:43:05,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1187442.0, ans=0.0 2023-06-22 08:43:12,816 INFO [train.py:996] (1/4) Epoch 7, batch 14950, loss[loss=0.2807, simple_loss=0.3501, pruned_loss=0.1057, over 21437.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3318, pruned_loss=0.0918, over 4258236.66 frames. ], batch size: 211, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:43:13,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1187502.0, ans=0.125 2023-06-22 08:43:39,828 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.264e+02 3.667e+02 4.078e+02 7.613e+02, threshold=7.333e+02, percent-clipped=0.0 2023-06-22 08:43:57,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1187622.0, ans=0.0 2023-06-22 08:44:16,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1187682.0, ans=0.0 2023-06-22 08:44:24,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1187682.0, ans=0.2 2023-06-22 08:44:52,921 INFO [train.py:996] (1/4) Epoch 7, batch 15000, loss[loss=0.2573, simple_loss=0.3227, pruned_loss=0.09593, over 15360.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3336, pruned_loss=0.0935, over 4259721.86 frames. ], batch size: 60, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:44:52,921 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 08:45:09,858 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2588, simple_loss=0.3554, pruned_loss=0.08105, over 1796401.00 frames. 2023-06-22 08:45:09,859 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 08:46:04,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1187922.0, ans=10.0 2023-06-22 08:46:32,667 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-22 08:46:40,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1188042.0, ans=0.0 2023-06-22 08:46:56,320 INFO [train.py:996] (1/4) Epoch 7, batch 15050, loss[loss=0.2872, simple_loss=0.3736, pruned_loss=0.1004, over 21677.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3319, pruned_loss=0.09315, over 4266196.26 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:47:27,955 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 3.352e+02 4.069e+02 4.839e+02 9.529e+02, threshold=8.138e+02, percent-clipped=2.0 2023-06-22 08:47:33,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1188162.0, ans=0.0 2023-06-22 08:47:40,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1188222.0, ans=0.2 2023-06-22 08:48:27,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1188342.0, ans=0.2 2023-06-22 08:48:39,541 INFO [train.py:996] (1/4) Epoch 7, batch 15100, loss[loss=0.2526, simple_loss=0.3313, pruned_loss=0.08695, over 21616.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3309, pruned_loss=0.09123, over 4263255.21 frames. ], batch size: 389, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:48:54,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1188402.0, ans=0.125 2023-06-22 08:49:01,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1188462.0, ans=0.125 2023-06-22 08:49:29,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=12.0 2023-06-22 08:49:38,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1188582.0, ans=0.125 2023-06-22 08:50:19,367 INFO [train.py:996] (1/4) Epoch 7, batch 15150, loss[loss=0.2618, simple_loss=0.3058, pruned_loss=0.1089, over 21223.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3266, pruned_loss=0.09149, over 4262244.20 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 8.0 2023-06-22 08:50:40,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-22 08:50:49,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.254e+02 3.801e+02 4.686e+02 8.027e+02, threshold=7.602e+02, percent-clipped=0.0 2023-06-22 08:51:05,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1188822.0, ans=0.0 2023-06-22 08:51:18,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1188882.0, ans=0.0 2023-06-22 08:51:21,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1188882.0, ans=0.0 2023-06-22 08:51:23,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.96 vs. limit=6.0 2023-06-22 08:51:37,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1188942.0, ans=0.0 2023-06-22 08:52:04,649 INFO [train.py:996] (1/4) Epoch 7, batch 15200, loss[loss=0.205, simple_loss=0.2824, pruned_loss=0.06383, over 21392.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3185, pruned_loss=0.08723, over 4257664.37 frames. ], batch size: 194, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:52:49,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1189122.0, ans=0.125 2023-06-22 08:53:02,129 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.49 vs. limit=15.0 2023-06-22 08:53:23,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1189242.0, ans=0.125 2023-06-22 08:53:44,180 INFO [train.py:996] (1/4) Epoch 7, batch 15250, loss[loss=0.2168, simple_loss=0.283, pruned_loss=0.07527, over 21830.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.313, pruned_loss=0.08634, over 4262320.06 frames. ], batch size: 317, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:53:56,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1189302.0, ans=0.2 2023-06-22 08:54:13,569 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.041e+02 3.715e+02 4.659e+02 9.808e+02, threshold=7.430e+02, percent-clipped=2.0 2023-06-22 08:54:46,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1189482.0, ans=0.125 2023-06-22 08:55:09,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1189542.0, ans=0.125 2023-06-22 08:55:25,369 INFO [train.py:996] (1/4) Epoch 7, batch 15300, loss[loss=0.2941, simple_loss=0.3546, pruned_loss=0.1168, over 21818.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.318, pruned_loss=0.08881, over 4261294.91 frames. ], batch size: 124, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:55:35,791 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:55:45,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1189662.0, ans=0.125 2023-06-22 08:56:25,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1189782.0, ans=0.125 2023-06-22 08:56:59,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1189842.0, ans=0.1 2023-06-22 08:56:59,861 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-22 08:57:04,788 INFO [train.py:996] (1/4) Epoch 7, batch 15350, loss[loss=0.2472, simple_loss=0.3428, pruned_loss=0.07586, over 21881.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3228, pruned_loss=0.09158, over 4264247.08 frames. ], batch size: 316, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:57:08,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1189902.0, ans=0.0 2023-06-22 08:57:33,811 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.368e+02 3.940e+02 5.271e+02 1.051e+03, threshold=7.879e+02, percent-clipped=5.0 2023-06-22 08:57:57,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1190082.0, ans=0.0 2023-06-22 08:58:13,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-22 08:58:38,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1190142.0, ans=0.1 2023-06-22 08:58:43,329 INFO [train.py:996] (1/4) Epoch 7, batch 15400, loss[loss=0.2614, simple_loss=0.3306, pruned_loss=0.09614, over 21243.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3235, pruned_loss=0.09015, over 4273696.53 frames. ], batch size: 143, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:58:51,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1190202.0, ans=0.125 2023-06-22 08:59:21,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1190322.0, ans=0.0 2023-06-22 08:59:36,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1190382.0, ans=0.04949747468305833 2023-06-22 08:59:36,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-22 09:00:22,618 INFO [train.py:996] (1/4) Epoch 7, batch 15450, loss[loss=0.2219, simple_loss=0.3035, pruned_loss=0.07015, over 21477.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3199, pruned_loss=0.08853, over 4272721.00 frames. ], batch size: 548, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:00:28,259 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-22 09:00:42,317 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:00:51,253 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 2.924e+02 3.383e+02 4.121e+02 7.553e+02, threshold=6.767e+02, percent-clipped=0.0 2023-06-22 09:00:55,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-22 09:00:56,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1190562.0, ans=0.0 2023-06-22 09:01:53,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1190742.0, ans=0.0 2023-06-22 09:02:02,980 INFO [train.py:996] (1/4) Epoch 7, batch 15500, loss[loss=0.2575, simple_loss=0.3234, pruned_loss=0.09581, over 21469.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3241, pruned_loss=0.0885, over 4269018.40 frames. ], batch size: 211, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:02:03,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1190802.0, ans=0.125 2023-06-22 09:02:26,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1190862.0, ans=0.2 2023-06-22 09:03:20,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1190982.0, ans=0.125 2023-06-22 09:03:26,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.52 vs. limit=6.0 2023-06-22 09:03:27,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-22 09:03:28,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1191042.0, ans=10.0 2023-06-22 09:03:48,172 INFO [train.py:996] (1/4) Epoch 7, batch 15550, loss[loss=0.2038, simple_loss=0.2895, pruned_loss=0.05909, over 21597.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3215, pruned_loss=0.08567, over 4273004.45 frames. ], batch size: 263, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:03:54,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.54 vs. limit=15.0 2023-06-22 09:03:54,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1191102.0, ans=0.125 2023-06-22 09:04:00,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-22 09:04:12,444 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 3.104e+02 3.542e+02 4.427e+02 7.965e+02, threshold=7.084e+02, percent-clipped=2.0 2023-06-22 09:04:58,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1191282.0, ans=0.125 2023-06-22 09:05:14,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1191342.0, ans=0.0 2023-06-22 09:05:21,722 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-22 09:05:21,987 INFO [train.py:996] (1/4) Epoch 7, batch 15600, loss[loss=0.3171, simple_loss=0.445, pruned_loss=0.09464, over 19849.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3165, pruned_loss=0.08408, over 4254464.73 frames. ], batch size: 702, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 09:05:39,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-22 09:05:55,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1191462.0, ans=0.125 2023-06-22 09:06:32,892 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-22 09:07:01,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1191702.0, ans=0.0 2023-06-22 09:07:08,496 INFO [train.py:996] (1/4) Epoch 7, batch 15650, loss[loss=0.2476, simple_loss=0.3107, pruned_loss=0.09224, over 21363.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3181, pruned_loss=0.08409, over 4250923.66 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:07:09,643 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-22 09:07:26,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1191762.0, ans=0.1 2023-06-22 09:07:38,622 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.201e+02 3.774e+02 4.746e+02 8.455e+02, threshold=7.547e+02, percent-clipped=5.0 2023-06-22 09:08:22,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1191882.0, ans=0.0 2023-06-22 09:08:22,879 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:08:47,636 INFO [train.py:996] (1/4) Epoch 7, batch 15700, loss[loss=0.225, simple_loss=0.3076, pruned_loss=0.07123, over 21727.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3142, pruned_loss=0.08351, over 4251802.85 frames. ], batch size: 282, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:08:56,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1192002.0, ans=0.125 2023-06-22 09:10:01,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1192182.0, ans=0.125 2023-06-22 09:10:05,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1192242.0, ans=0.5 2023-06-22 09:10:11,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-22 09:10:27,348 INFO [train.py:996] (1/4) Epoch 7, batch 15750, loss[loss=0.2081, simple_loss=0.274, pruned_loss=0.07115, over 21755.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3092, pruned_loss=0.08334, over 4248477.42 frames. ], batch size: 112, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:10:27,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1192302.0, ans=0.0 2023-06-22 09:10:38,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1192302.0, ans=0.125 2023-06-22 09:10:56,924 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.176e+02 3.735e+02 4.754e+02 7.774e+02, threshold=7.471e+02, percent-clipped=1.0 2023-06-22 09:11:35,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1192482.0, ans=0.0 2023-06-22 09:12:01,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192542.0, ans=0.1 2023-06-22 09:12:07,000 INFO [train.py:996] (1/4) Epoch 7, batch 15800, loss[loss=0.2202, simple_loss=0.2869, pruned_loss=0.07678, over 21504.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.305, pruned_loss=0.08324, over 4255509.41 frames. ], batch size: 195, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:12:07,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1192602.0, ans=0.2 2023-06-22 09:12:12,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1192602.0, ans=0.0 2023-06-22 09:12:37,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1192662.0, ans=0.125 2023-06-22 09:12:40,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1192662.0, ans=0.0 2023-06-22 09:13:16,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-22 09:13:45,630 INFO [train.py:996] (1/4) Epoch 7, batch 15850, loss[loss=0.2548, simple_loss=0.3242, pruned_loss=0.09268, over 21563.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3072, pruned_loss=0.08524, over 4251462.12 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:13:58,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1192902.0, ans=0.125 2023-06-22 09:14:15,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.067e+02 3.802e+02 4.626e+02 8.154e+02, threshold=7.604e+02, percent-clipped=3.0 2023-06-22 09:14:32,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1193022.0, ans=0.125 2023-06-22 09:14:56,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1193082.0, ans=0.5 2023-06-22 09:15:26,024 INFO [train.py:996] (1/4) Epoch 7, batch 15900, loss[loss=0.22, simple_loss=0.2868, pruned_loss=0.07662, over 21820.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3058, pruned_loss=0.08589, over 4253514.07 frames. ], batch size: 107, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:15:58,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1193262.0, ans=0.125 2023-06-22 09:16:13,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-22 09:16:14,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1193322.0, ans=0.125 2023-06-22 09:16:32,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1193382.0, ans=0.07 2023-06-22 09:16:57,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1193442.0, ans=0.0 2023-06-22 09:17:05,286 INFO [train.py:996] (1/4) Epoch 7, batch 15950, loss[loss=0.1586, simple_loss=0.2451, pruned_loss=0.03602, over 21496.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3053, pruned_loss=0.08268, over 4254733.86 frames. ], batch size: 211, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:17:15,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1193502.0, ans=0.125 2023-06-22 09:17:18,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1193502.0, ans=0.0 2023-06-22 09:17:26,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-22 09:17:28,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1193562.0, ans=0.125 2023-06-22 09:17:31,184 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 3.037e+02 3.517e+02 4.251e+02 9.007e+02, threshold=7.034e+02, percent-clipped=1.0 2023-06-22 09:17:38,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1193562.0, ans=0.0 2023-06-22 09:17:43,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1193622.0, ans=0.0 2023-06-22 09:18:12,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1193682.0, ans=0.125 2023-06-22 09:18:46,881 INFO [train.py:996] (1/4) Epoch 7, batch 16000, loss[loss=0.1962, simple_loss=0.2865, pruned_loss=0.05293, over 21399.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3065, pruned_loss=0.08054, over 4249006.25 frames. ], batch size: 211, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:19:04,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-22 09:19:50,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-22 09:19:54,988 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-22 09:20:09,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1194042.0, ans=0.125 2023-06-22 09:20:10,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1194042.0, ans=0.1 2023-06-22 09:20:16,544 INFO [train.py:996] (1/4) Epoch 7, batch 16050, loss[loss=0.187, simple_loss=0.2714, pruned_loss=0.05129, over 21438.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3095, pruned_loss=0.0791, over 4259912.45 frames. ], batch size: 194, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:20:47,932 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.171e+02 3.896e+02 5.247e+02 9.817e+02, threshold=7.791e+02, percent-clipped=4.0 2023-06-22 09:21:24,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1194282.0, ans=0.125 2023-06-22 09:21:55,856 INFO [train.py:996] (1/4) Epoch 7, batch 16100, loss[loss=0.1911, simple_loss=0.2876, pruned_loss=0.04728, over 21674.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3143, pruned_loss=0.08029, over 4270734.25 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:22:16,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-22 09:22:37,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1194522.0, ans=0.0 2023-06-22 09:23:12,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1194582.0, ans=0.125 2023-06-22 09:23:26,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-22 09:23:30,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1194642.0, ans=0.0 2023-06-22 09:23:35,231 INFO [train.py:996] (1/4) Epoch 7, batch 16150, loss[loss=0.2459, simple_loss=0.3097, pruned_loss=0.09098, over 21481.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3152, pruned_loss=0.08303, over 4282494.25 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:23:52,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1194702.0, ans=0.1 2023-06-22 09:23:55,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1194762.0, ans=0.125 2023-06-22 09:24:02,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1194762.0, ans=0.07 2023-06-22 09:24:02,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1194762.0, ans=0.125 2023-06-22 09:24:03,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1194762.0, ans=0.125 2023-06-22 09:24:08,136 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.102e+02 3.921e+02 4.852e+02 9.563e+02, threshold=7.842e+02, percent-clipped=2.0 2023-06-22 09:25:18,421 INFO [train.py:996] (1/4) Epoch 7, batch 16200, loss[loss=0.2887, simple_loss=0.358, pruned_loss=0.1097, over 21453.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3189, pruned_loss=0.08483, over 4280877.01 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:25:18,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195002.0, ans=0.1 2023-06-22 09:25:54,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1195122.0, ans=0.125 2023-06-22 09:26:59,855 INFO [train.py:996] (1/4) Epoch 7, batch 16250, loss[loss=0.2153, simple_loss=0.2876, pruned_loss=0.07153, over 21856.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3183, pruned_loss=0.0851, over 4277795.91 frames. ], batch size: 373, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:27:02,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1195302.0, ans=0.125 2023-06-22 09:27:08,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1195302.0, ans=0.04949747468305833 2023-06-22 09:27:26,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1195362.0, ans=0.125 2023-06-22 09:27:31,797 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.018e+02 3.500e+02 4.433e+02 8.732e+02, threshold=7.000e+02, percent-clipped=2.0 2023-06-22 09:28:03,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1195482.0, ans=0.125 2023-06-22 09:28:16,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195482.0, ans=0.1 2023-06-22 09:28:36,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1195542.0, ans=0.0 2023-06-22 09:28:40,770 INFO [train.py:996] (1/4) Epoch 7, batch 16300, loss[loss=0.1845, simple_loss=0.2618, pruned_loss=0.0536, over 21247.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3121, pruned_loss=0.08126, over 4270854.33 frames. ], batch size: 176, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:29:49,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1195782.0, ans=0.2 2023-06-22 09:30:05,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1195782.0, ans=0.09899494936611666 2023-06-22 09:30:24,333 INFO [train.py:996] (1/4) Epoch 7, batch 16350, loss[loss=0.2066, simple_loss=0.2849, pruned_loss=0.06413, over 21671.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3117, pruned_loss=0.08202, over 4269593.69 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:31:06,947 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.307e+02 4.109e+02 5.521e+02 1.139e+03, threshold=8.218e+02, percent-clipped=11.0 2023-06-22 09:31:30,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1196082.0, ans=0.0 2023-06-22 09:31:39,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1196082.0, ans=0.2 2023-06-22 09:32:07,016 INFO [train.py:996] (1/4) Epoch 7, batch 16400, loss[loss=0.2363, simple_loss=0.3309, pruned_loss=0.07087, over 21310.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3156, pruned_loss=0.08451, over 4276830.61 frames. ], batch size: 548, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:32:33,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1196262.0, ans=0.0 2023-06-22 09:33:01,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-22 09:33:05,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-22 09:33:08,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1196382.0, ans=0.125 2023-06-22 09:33:31,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1196442.0, ans=0.0 2023-06-22 09:33:47,423 INFO [train.py:996] (1/4) Epoch 7, batch 16450, loss[loss=0.2247, simple_loss=0.2996, pruned_loss=0.07485, over 21419.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3162, pruned_loss=0.08551, over 4285011.99 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:34:15,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1196562.0, ans=0.09899494936611666 2023-06-22 09:34:18,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1196562.0, ans=0.125 2023-06-22 09:34:29,357 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.059e+02 3.522e+02 4.400e+02 7.364e+02, threshold=7.044e+02, percent-clipped=0.0 2023-06-22 09:34:29,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1196562.0, ans=0.125 2023-06-22 09:34:35,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-22 09:35:28,324 INFO [train.py:996] (1/4) Epoch 7, batch 16500, loss[loss=0.2093, simple_loss=0.2838, pruned_loss=0.0674, over 21759.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3154, pruned_loss=0.08538, over 4285249.94 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:36:17,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1196922.0, ans=0.125 2023-06-22 09:36:19,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1196922.0, ans=0.0 2023-06-22 09:36:49,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=12.0 2023-06-22 09:37:12,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1197042.0, ans=0.1 2023-06-22 09:37:16,099 INFO [train.py:996] (1/4) Epoch 7, batch 16550, loss[loss=0.4169, simple_loss=0.5134, pruned_loss=0.1602, over 19792.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3148, pruned_loss=0.08406, over 4269447.98 frames. ], batch size: 702, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:37:32,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1197102.0, ans=0.125 2023-06-22 09:37:48,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1197162.0, ans=0.2 2023-06-22 09:37:53,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.830e+02 4.900e+02 6.619e+02 1.240e+03, threshold=9.800e+02, percent-clipped=18.0 2023-06-22 09:38:23,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1197282.0, ans=0.125 2023-06-22 09:39:08,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.10 vs. limit=15.0 2023-06-22 09:39:08,766 INFO [train.py:996] (1/4) Epoch 7, batch 16600, loss[loss=0.2192, simple_loss=0.3183, pruned_loss=0.06007, over 20722.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3219, pruned_loss=0.08672, over 4270808.69 frames. ], batch size: 607, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:39:22,155 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:39:35,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1197462.0, ans=0.0 2023-06-22 09:40:09,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1197582.0, ans=10.0 2023-06-22 09:40:13,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1197582.0, ans=0.125 2023-06-22 09:40:50,886 INFO [train.py:996] (1/4) Epoch 7, batch 16650, loss[loss=0.2818, simple_loss=0.3622, pruned_loss=0.1007, over 21464.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.331, pruned_loss=0.08814, over 4268347.59 frames. ], batch size: 131, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:41:14,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1197762.0, ans=0.2 2023-06-22 09:41:23,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1197762.0, ans=0.0 2023-06-22 09:41:26,167 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.702e+02 3.520e+02 3.910e+02 4.811e+02 1.011e+03, threshold=7.820e+02, percent-clipped=1.0 2023-06-22 09:41:45,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-22 09:42:39,971 INFO [train.py:996] (1/4) Epoch 7, batch 16700, loss[loss=0.209, simple_loss=0.2673, pruned_loss=0.07537, over 21184.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3332, pruned_loss=0.08902, over 4266296.62 frames. ], batch size: 143, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:43:30,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-22 09:44:03,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1198182.0, ans=0.2 2023-06-22 09:44:06,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1198242.0, ans=0.125 2023-06-22 09:44:13,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1198242.0, ans=0.125 2023-06-22 09:44:27,977 INFO [train.py:996] (1/4) Epoch 7, batch 16750, loss[loss=0.3066, simple_loss=0.3861, pruned_loss=0.1135, over 21641.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3373, pruned_loss=0.09238, over 4265359.95 frames. ], batch size: 389, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:44:36,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-22 09:45:09,620 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.580e+02 3.471e+02 3.936e+02 4.958e+02 1.171e+03, threshold=7.873e+02, percent-clipped=3.0 2023-06-22 09:46:03,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1198542.0, ans=0.0 2023-06-22 09:46:11,338 INFO [train.py:996] (1/4) Epoch 7, batch 16800, loss[loss=0.3184, simple_loss=0.4115, pruned_loss=0.1127, over 21298.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3422, pruned_loss=0.09251, over 4260656.50 frames. ], batch size: 548, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:46:53,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-22 09:47:30,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-22 09:47:51,173 INFO [train.py:996] (1/4) Epoch 7, batch 16850, loss[loss=0.2702, simple_loss=0.3289, pruned_loss=0.1058, over 21765.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3377, pruned_loss=0.09222, over 4267628.23 frames. ], batch size: 473, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:47:53,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1198902.0, ans=0.0 2023-06-22 09:48:20,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1198962.0, ans=0.05 2023-06-22 09:48:20,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-22 09:48:29,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 3.467e+02 4.300e+02 5.663e+02 1.182e+03, threshold=8.599e+02, percent-clipped=7.0 2023-06-22 09:48:39,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1199022.0, ans=0.125 2023-06-22 09:49:29,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1199202.0, ans=0.125 2023-06-22 09:49:30,257 INFO [train.py:996] (1/4) Epoch 7, batch 16900, loss[loss=0.2216, simple_loss=0.2871, pruned_loss=0.07801, over 21639.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3314, pruned_loss=0.09052, over 4269078.69 frames. ], batch size: 247, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:50:11,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199322.0, ans=0.1 2023-06-22 09:50:22,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1199322.0, ans=0.0 2023-06-22 09:50:26,579 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-22 09:50:37,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1199382.0, ans=0.125 2023-06-22 09:51:05,729 INFO [train.py:996] (1/4) Epoch 7, batch 16950, loss[loss=0.2192, simple_loss=0.2928, pruned_loss=0.07275, over 21891.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3228, pruned_loss=0.08763, over 4274712.75 frames. ], batch size: 118, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:51:35,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199562.0, ans=0.1 2023-06-22 09:51:39,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1199562.0, ans=0.125 2023-06-22 09:51:45,907 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 2.915e+02 3.202e+02 3.763e+02 5.382e+02, threshold=6.404e+02, percent-clipped=0.0 2023-06-22 09:51:50,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1199622.0, ans=0.0 2023-06-22 09:52:50,022 INFO [train.py:996] (1/4) Epoch 7, batch 17000, loss[loss=0.2331, simple_loss=0.2994, pruned_loss=0.08341, over 21597.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3187, pruned_loss=0.08778, over 4281994.66 frames. ], batch size: 548, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:53:13,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1199862.0, ans=0.125 2023-06-22 09:53:43,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199922.0, ans=0.1 2023-06-22 09:53:44,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1199922.0, ans=0.125 2023-06-22 09:54:38,330 INFO [train.py:996] (1/4) Epoch 7, batch 17050, loss[loss=0.2452, simple_loss=0.3263, pruned_loss=0.08201, over 21813.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3271, pruned_loss=0.09056, over 4287912.37 frames. ], batch size: 298, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:54:51,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1200102.0, ans=0.125 2023-06-22 09:55:08,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.382e+02 4.158e+02 4.859e+02 8.252e+02, threshold=8.317e+02, percent-clipped=8.0 2023-06-22 09:55:46,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-22 09:55:47,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1200282.0, ans=0.125 2023-06-22 09:56:17,485 INFO [train.py:996] (1/4) Epoch 7, batch 17100, loss[loss=0.2422, simple_loss=0.3115, pruned_loss=0.08643, over 21783.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3264, pruned_loss=0.09108, over 4294134.64 frames. ], batch size: 441, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:56:43,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1200462.0, ans=0.0 2023-06-22 09:57:04,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-22 09:57:16,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1200582.0, ans=0.125 2023-06-22 09:57:26,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-22 09:57:50,022 INFO [train.py:996] (1/4) Epoch 7, batch 17150, loss[loss=0.2162, simple_loss=0.2891, pruned_loss=0.07165, over 21723.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3221, pruned_loss=0.09109, over 4300766.53 frames. ], batch size: 247, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:58:04,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-06-22 09:58:30,923 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.032e+02 3.543e+02 4.123e+02 6.537e+02, threshold=7.086e+02, percent-clipped=0.0 2023-06-22 09:58:32,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.78 vs. limit=5.0 2023-06-22 09:58:33,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1200822.0, ans=0.0 2023-06-22 09:58:42,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-22 09:58:48,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.12 vs. limit=6.0 2023-06-22 09:59:37,116 INFO [train.py:996] (1/4) Epoch 7, batch 17200, loss[loss=0.2579, simple_loss=0.3264, pruned_loss=0.09472, over 21470.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3207, pruned_loss=0.08944, over 4296019.03 frames. ], batch size: 194, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:00:23,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-22 10:01:20,181 INFO [train.py:996] (1/4) Epoch 7, batch 17250, loss[loss=0.2644, simple_loss=0.3404, pruned_loss=0.09416, over 21342.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3252, pruned_loss=0.09206, over 4292589.16 frames. ], batch size: 159, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:01:34,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1201302.0, ans=0.125 2023-06-22 10:01:42,377 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.468e-03 2023-06-22 10:02:01,153 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.703e+02 3.318e+02 3.860e+02 4.888e+02 8.680e+02, threshold=7.720e+02, percent-clipped=6.0 2023-06-22 10:02:17,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1201422.0, ans=0.1 2023-06-22 10:03:07,146 INFO [train.py:996] (1/4) Epoch 7, batch 17300, loss[loss=0.274, simple_loss=0.3534, pruned_loss=0.09727, over 21175.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3341, pruned_loss=0.09554, over 4291638.38 frames. ], batch size: 143, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:03:07,672 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:03:29,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1201662.0, ans=0.125 2023-06-22 10:04:05,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1201722.0, ans=0.0 2023-06-22 10:04:50,684 INFO [train.py:996] (1/4) Epoch 7, batch 17350, loss[loss=0.2047, simple_loss=0.2894, pruned_loss=0.06004, over 21645.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3351, pruned_loss=0.09528, over 4296246.10 frames. ], batch size: 230, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:04:51,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1201902.0, ans=0.2 2023-06-22 10:05:36,388 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.363e+02 3.779e+02 4.471e+02 7.201e+02, threshold=7.558e+02, percent-clipped=0.0 2023-06-22 10:05:48,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1202022.0, ans=0.125 2023-06-22 10:05:53,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1202082.0, ans=0.0 2023-06-22 10:06:08,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1202082.0, ans=0.2 2023-06-22 10:06:16,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1202142.0, ans=0.0 2023-06-22 10:06:37,420 INFO [train.py:996] (1/4) Epoch 7, batch 17400, loss[loss=0.363, simple_loss=0.4262, pruned_loss=0.1499, over 21508.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3311, pruned_loss=0.09152, over 4288644.41 frames. ], batch size: 508, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:07:32,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1202322.0, ans=0.125 2023-06-22 10:08:24,864 INFO [train.py:996] (1/4) Epoch 7, batch 17450, loss[loss=0.228, simple_loss=0.3222, pruned_loss=0.06689, over 21701.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3278, pruned_loss=0.08872, over 4284461.36 frames. ], batch size: 414, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 10:08:37,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1202502.0, ans=0.1 2023-06-22 10:09:02,636 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 3.174e+02 3.775e+02 5.488e+02 9.226e+02, threshold=7.551e+02, percent-clipped=5.0 2023-06-22 10:09:08,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-22 10:09:11,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-22 10:09:24,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2023-06-22 10:09:37,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1202682.0, ans=0.0 2023-06-22 10:09:51,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1202742.0, ans=0.125 2023-06-22 10:10:06,323 INFO [train.py:996] (1/4) Epoch 7, batch 17500, loss[loss=0.2155, simple_loss=0.2836, pruned_loss=0.07368, over 21138.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3228, pruned_loss=0.0853, over 4282809.89 frames. ], batch size: 608, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 10:10:28,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1202862.0, ans=0.125 2023-06-22 10:10:28,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1202862.0, ans=0.0 2023-06-22 10:10:32,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.46 vs. limit=10.0 2023-06-22 10:10:46,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1202922.0, ans=0.1 2023-06-22 10:11:12,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1202982.0, ans=0.0 2023-06-22 10:11:31,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1203042.0, ans=0.125 2023-06-22 10:11:41,910 INFO [train.py:996] (1/4) Epoch 7, batch 17550, loss[loss=0.2137, simple_loss=0.3033, pruned_loss=0.06206, over 21796.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3223, pruned_loss=0.08438, over 4282824.30 frames. ], batch size: 112, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:12:14,423 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.829e+02 3.350e+02 3.891e+02 7.522e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-22 10:12:44,757 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:13:22,602 INFO [train.py:996] (1/4) Epoch 7, batch 17600, loss[loss=0.2927, simple_loss=0.3535, pruned_loss=0.116, over 21933.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3258, pruned_loss=0.0857, over 4285766.36 frames. ], batch size: 372, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:13:25,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1203402.0, ans=0.125 2023-06-22 10:13:32,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1203402.0, ans=0.07 2023-06-22 10:13:49,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1203462.0, ans=0.025 2023-06-22 10:13:54,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1203522.0, ans=0.125 2023-06-22 10:14:33,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1203582.0, ans=0.0 2023-06-22 10:14:41,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1203642.0, ans=0.2 2023-06-22 10:15:03,732 INFO [train.py:996] (1/4) Epoch 7, batch 17650, loss[loss=0.2064, simple_loss=0.2698, pruned_loss=0.07152, over 21635.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3236, pruned_loss=0.08632, over 4279918.95 frames. ], batch size: 263, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:15:10,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-22 10:15:15,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-22 10:15:36,733 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.660e+02 3.234e+02 3.859e+02 4.407e+02 8.519e+02, threshold=7.719e+02, percent-clipped=7.0 2023-06-22 10:16:45,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1204002.0, ans=0.125 2023-06-22 10:16:46,343 INFO [train.py:996] (1/4) Epoch 7, batch 17700, loss[loss=0.2605, simple_loss=0.3432, pruned_loss=0.08891, over 21742.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3166, pruned_loss=0.08291, over 4281261.16 frames. ], batch size: 351, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:16:54,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.57 vs. limit=10.0 2023-06-22 10:17:07,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1204062.0, ans=0.2 2023-06-22 10:17:44,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.13 vs. limit=10.0 2023-06-22 10:17:51,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-22 10:18:29,560 INFO [train.py:996] (1/4) Epoch 7, batch 17750, loss[loss=0.2885, simple_loss=0.3732, pruned_loss=0.1019, over 21844.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3245, pruned_loss=0.08611, over 4280760.30 frames. ], batch size: 124, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:18:31,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1204302.0, ans=0.1 2023-06-22 10:18:50,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1204362.0, ans=0.0 2023-06-22 10:19:13,755 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.318e+02 4.087e+02 5.384e+02 1.002e+03, threshold=8.174e+02, percent-clipped=10.0 2023-06-22 10:20:11,951 INFO [train.py:996] (1/4) Epoch 7, batch 17800, loss[loss=0.2168, simple_loss=0.3022, pruned_loss=0.0657, over 20764.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3226, pruned_loss=0.08493, over 4274193.78 frames. ], batch size: 607, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:21:18,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1204782.0, ans=0.09899494936611666 2023-06-22 10:21:38,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1204842.0, ans=0.0 2023-06-22 10:21:39,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1204842.0, ans=0.0 2023-06-22 10:21:55,084 INFO [train.py:996] (1/4) Epoch 7, batch 17850, loss[loss=0.3548, simple_loss=0.4193, pruned_loss=0.1451, over 21500.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3232, pruned_loss=0.08566, over 4271859.75 frames. ], batch size: 471, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:21:57,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1204902.0, ans=0.125 2023-06-22 10:22:02,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1204902.0, ans=0.125 2023-06-22 10:22:45,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.209e+02 3.990e+02 4.443e+02 8.332e+02, threshold=7.980e+02, percent-clipped=3.0 2023-06-22 10:22:56,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1205022.0, ans=0.1 2023-06-22 10:23:38,648 INFO [train.py:996] (1/4) Epoch 7, batch 17900, loss[loss=0.2231, simple_loss=0.2866, pruned_loss=0.07978, over 20302.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3289, pruned_loss=0.08851, over 4270642.15 frames. ], batch size: 702, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:24:16,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1205262.0, ans=0.125 2023-06-22 10:24:46,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1205382.0, ans=0.125 2023-06-22 10:25:07,548 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:25:24,795 INFO [train.py:996] (1/4) Epoch 7, batch 17950, loss[loss=0.2266, simple_loss=0.3243, pruned_loss=0.06446, over 21596.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3275, pruned_loss=0.08518, over 4266163.12 frames. ], batch size: 441, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:25:50,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1205562.0, ans=0.125 2023-06-22 10:26:08,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.466e+02 3.180e+02 3.649e+02 4.821e+02 7.234e+02, threshold=7.298e+02, percent-clipped=0.0 2023-06-22 10:26:58,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1205742.0, ans=0.125 2023-06-22 10:27:10,981 INFO [train.py:996] (1/4) Epoch 7, batch 18000, loss[loss=0.2091, simple_loss=0.2738, pruned_loss=0.07217, over 21787.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3195, pruned_loss=0.08232, over 4252243.26 frames. ], batch size: 118, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:27:10,982 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 10:27:22,416 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.5610, 4.1838, 4.0727, 2.5098], device='cuda:1') 2023-06-22 10:27:30,135 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.265, simple_loss=0.3646, pruned_loss=0.08269, over 1796401.00 frames. 2023-06-22 10:27:30,136 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 10:27:56,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1205862.0, ans=0.125 2023-06-22 10:28:23,584 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-22 10:28:30,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-22 10:28:51,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1206042.0, ans=0.125 2023-06-22 10:29:01,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1206042.0, ans=0.0 2023-06-22 10:29:12,740 INFO [train.py:996] (1/4) Epoch 7, batch 18050, loss[loss=0.2618, simple_loss=0.3225, pruned_loss=0.1006, over 21613.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3133, pruned_loss=0.08158, over 4250208.09 frames. ], batch size: 263, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:29:23,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1206102.0, ans=0.2 2023-06-22 10:29:26,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1206102.0, ans=0.2 2023-06-22 10:29:41,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1206162.0, ans=0.2 2023-06-22 10:29:52,830 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.553e+02 3.561e+02 4.207e+02 5.144e+02 1.104e+03, threshold=8.414e+02, percent-clipped=10.0 2023-06-22 10:29:55,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.01 vs. limit=15.0 2023-06-22 10:30:55,016 INFO [train.py:996] (1/4) Epoch 7, batch 18100, loss[loss=0.2753, simple_loss=0.3505, pruned_loss=0.1001, over 21289.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3175, pruned_loss=0.08349, over 4247063.52 frames. ], batch size: 143, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:30:55,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-22 10:31:18,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1206462.0, ans=0.125 2023-06-22 10:31:34,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1206522.0, ans=0.125 2023-06-22 10:31:37,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1206522.0, ans=0.125 2023-06-22 10:32:13,605 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:32:29,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1206642.0, ans=0.125 2023-06-22 10:32:34,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1206702.0, ans=0.125 2023-06-22 10:32:35,108 INFO [train.py:996] (1/4) Epoch 7, batch 18150, loss[loss=0.2283, simple_loss=0.2867, pruned_loss=0.08494, over 21852.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.32, pruned_loss=0.08359, over 4255413.00 frames. ], batch size: 107, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:33:15,096 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 3.134e+02 3.517e+02 4.943e+02 8.965e+02, threshold=7.034e+02, percent-clipped=1.0 2023-06-22 10:33:37,217 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.94 vs. limit=10.0 2023-06-22 10:33:38,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1206882.0, ans=0.5 2023-06-22 10:33:38,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-22 10:34:01,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1206942.0, ans=0.125 2023-06-22 10:34:13,171 INFO [train.py:996] (1/4) Epoch 7, batch 18200, loss[loss=0.2084, simple_loss=0.2725, pruned_loss=0.07213, over 21392.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3156, pruned_loss=0.08445, over 4255706.70 frames. ], batch size: 144, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:34:15,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1207002.0, ans=0.1 2023-06-22 10:34:30,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1207062.0, ans=0.0 2023-06-22 10:34:37,531 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-22 10:35:08,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-22 10:35:29,310 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-22 10:35:34,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1207242.0, ans=0.0 2023-06-22 10:35:50,313 INFO [train.py:996] (1/4) Epoch 7, batch 18250, loss[loss=0.1746, simple_loss=0.2537, pruned_loss=0.0478, over 21776.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3064, pruned_loss=0.08084, over 4270616.35 frames. ], batch size: 124, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:35:55,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1207302.0, ans=0.0 2023-06-22 10:35:55,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1207302.0, ans=0.0 2023-06-22 10:36:25,628 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.178e+02 4.108e+02 6.214e+02 1.567e+03, threshold=8.215e+02, percent-clipped=16.0 2023-06-22 10:37:03,490 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=22.5 2023-06-22 10:37:29,336 INFO [train.py:996] (1/4) Epoch 7, batch 18300, loss[loss=0.2522, simple_loss=0.361, pruned_loss=0.0717, over 20962.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3055, pruned_loss=0.08105, over 4265917.46 frames. ], batch size: 607, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:37:29,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1207602.0, ans=0.2 2023-06-22 10:37:44,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1207662.0, ans=0.125 2023-06-22 10:37:46,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1207662.0, ans=10.0 2023-06-22 10:38:25,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1207782.0, ans=0.035 2023-06-22 10:38:25,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1207782.0, ans=0.125 2023-06-22 10:38:54,865 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:39:08,705 INFO [train.py:996] (1/4) Epoch 7, batch 18350, loss[loss=0.2215, simple_loss=0.2926, pruned_loss=0.0752, over 21711.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3126, pruned_loss=0.08136, over 4268394.14 frames. ], batch size: 316, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:39:12,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-22 10:39:12,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=15.0 2023-06-22 10:39:43,993 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.179e+02 3.735e+02 4.992e+02 1.231e+03, threshold=7.469e+02, percent-clipped=7.0 2023-06-22 10:39:46,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1208022.0, ans=0.05 2023-06-22 10:39:46,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1208022.0, ans=0.125 2023-06-22 10:40:49,862 INFO [train.py:996] (1/4) Epoch 7, batch 18400, loss[loss=0.1984, simple_loss=0.2662, pruned_loss=0.06536, over 21394.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3074, pruned_loss=0.08027, over 4258489.65 frames. ], batch size: 131, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:41:05,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1208262.0, ans=0.2 2023-06-22 10:42:29,252 INFO [train.py:996] (1/4) Epoch 7, batch 18450, loss[loss=0.2312, simple_loss=0.3177, pruned_loss=0.07234, over 21556.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3047, pruned_loss=0.07718, over 4252996.13 frames. ], batch size: 442, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:42:29,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208502.0, ans=0.1 2023-06-22 10:42:31,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1208502.0, ans=0.125 2023-06-22 10:42:51,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1208562.0, ans=0.125 2023-06-22 10:43:04,264 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.170e+02 3.772e+02 5.072e+02 1.044e+03, threshold=7.545e+02, percent-clipped=1.0 2023-06-22 10:43:11,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208622.0, ans=0.1 2023-06-22 10:43:24,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1208682.0, ans=0.125 2023-06-22 10:44:09,106 INFO [train.py:996] (1/4) Epoch 7, batch 18500, loss[loss=0.2229, simple_loss=0.2815, pruned_loss=0.08212, over 21380.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.299, pruned_loss=0.07543, over 4239322.72 frames. ], batch size: 160, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:44:13,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=10.0 2023-06-22 10:44:29,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208862.0, ans=0.1 2023-06-22 10:44:40,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208922.0, ans=0.1 2023-06-22 10:45:39,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2023-06-22 10:45:50,062 INFO [train.py:996] (1/4) Epoch 7, batch 18550, loss[loss=0.2179, simple_loss=0.2855, pruned_loss=0.07513, over 21742.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2982, pruned_loss=0.07518, over 4233605.93 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:46:06,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1209162.0, ans=0.125 2023-06-22 10:46:32,521 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 3.124e+02 3.693e+02 4.756e+02 1.140e+03, threshold=7.385e+02, percent-clipped=12.0 2023-06-22 10:46:40,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1209222.0, ans=0.125 2023-06-22 10:47:12,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.47 vs. limit=10.0 2023-06-22 10:47:30,160 INFO [train.py:996] (1/4) Epoch 7, batch 18600, loss[loss=0.2123, simple_loss=0.2715, pruned_loss=0.07659, over 20764.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2969, pruned_loss=0.07583, over 4245020.12 frames. ], batch size: 608, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:47:33,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1209402.0, ans=0.125 2023-06-22 10:49:09,432 INFO [train.py:996] (1/4) Epoch 7, batch 18650, loss[loss=0.21, simple_loss=0.2716, pruned_loss=0.07423, over 21489.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2967, pruned_loss=0.07603, over 4244697.70 frames. ], batch size: 230, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:49:19,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1209702.0, ans=0.125 2023-06-22 10:49:27,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209762.0, ans=0.1 2023-06-22 10:49:41,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.58 vs. limit=6.0 2023-06-22 10:49:45,395 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 3.160e+02 3.578e+02 4.366e+02 8.700e+02, threshold=7.156e+02, percent-clipped=2.0 2023-06-22 10:50:08,713 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-22 10:50:30,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1209942.0, ans=0.035 2023-06-22 10:50:33,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1209942.0, ans=0.1 2023-06-22 10:50:35,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-22 10:50:44,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1209942.0, ans=0.125 2023-06-22 10:50:45,262 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-22 10:50:47,164 INFO [train.py:996] (1/4) Epoch 7, batch 18700, loss[loss=0.1988, simple_loss=0.2698, pruned_loss=0.06388, over 21690.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2957, pruned_loss=0.07777, over 4243132.19 frames. ], batch size: 264, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:50:59,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1210002.0, ans=0.0 2023-06-22 10:51:07,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1210062.0, ans=0.125 2023-06-22 10:51:59,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1210182.0, ans=0.0 2023-06-22 10:52:20,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-22 10:52:26,879 INFO [train.py:996] (1/4) Epoch 7, batch 18750, loss[loss=0.2357, simple_loss=0.2966, pruned_loss=0.08744, over 21343.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2973, pruned_loss=0.08035, over 4256268.93 frames. ], batch size: 159, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:52:27,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1210302.0, ans=0.0 2023-06-22 10:52:33,781 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:53:03,963 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.195e+02 3.885e+02 4.969e+02 1.061e+03, threshold=7.770e+02, percent-clipped=4.0 2023-06-22 10:53:04,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1210422.0, ans=0.0 2023-06-22 10:53:47,595 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:54:01,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1210542.0, ans=0.0 2023-06-22 10:54:05,699 INFO [train.py:996] (1/4) Epoch 7, batch 18800, loss[loss=0.2864, simple_loss=0.3739, pruned_loss=0.0995, over 21625.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3049, pruned_loss=0.08247, over 4265651.61 frames. ], batch size: 389, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:54:40,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=22.5 2023-06-22 10:54:57,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-22 10:55:44,308 INFO [train.py:996] (1/4) Epoch 7, batch 18850, loss[loss=0.1816, simple_loss=0.2753, pruned_loss=0.04393, over 21691.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3026, pruned_loss=0.07798, over 4274732.81 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:56:03,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1210962.0, ans=0.125 2023-06-22 10:56:19,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1211022.0, ans=0.125 2023-06-22 10:56:21,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 3.160e+02 3.995e+02 5.299e+02 8.301e+02, threshold=7.991e+02, percent-clipped=3.0 2023-06-22 10:57:24,834 INFO [train.py:996] (1/4) Epoch 7, batch 18900, loss[loss=0.2511, simple_loss=0.3115, pruned_loss=0.09536, over 21856.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2993, pruned_loss=0.07811, over 4270153.86 frames. ], batch size: 98, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:57:38,269 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-22 10:58:05,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1211322.0, ans=0.0 2023-06-22 10:58:07,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1211322.0, ans=10.0 2023-06-22 10:58:40,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=8.0 2023-06-22 10:59:00,583 INFO [train.py:996] (1/4) Epoch 7, batch 18950, loss[loss=0.2461, simple_loss=0.3416, pruned_loss=0.07527, over 21710.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2989, pruned_loss=0.08044, over 4271728.44 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:59:01,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1211502.0, ans=0.0 2023-06-22 10:59:02,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1211502.0, ans=0.0 2023-06-22 10:59:08,716 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.82 vs. limit=5.0 2023-06-22 10:59:16,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-22 10:59:30,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1211562.0, ans=0.125 2023-06-22 10:59:38,384 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.670e+02 3.300e+02 3.868e+02 4.844e+02 6.994e+02, threshold=7.736e+02, percent-clipped=0.0 2023-06-22 10:59:40,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1211622.0, ans=0.05 2023-06-22 10:59:55,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1211622.0, ans=0.07 2023-06-22 11:00:25,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1211742.0, ans=0.125 2023-06-22 11:00:37,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1211802.0, ans=0.125 2023-06-22 11:00:38,070 INFO [train.py:996] (1/4) Epoch 7, batch 19000, loss[loss=0.2639, simple_loss=0.3596, pruned_loss=0.08403, over 21714.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3092, pruned_loss=0.08188, over 4274028.18 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 11:00:41,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1211802.0, ans=0.0 2023-06-22 11:00:50,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1211802.0, ans=0.1 2023-06-22 11:00:51,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1211802.0, ans=0.04949747468305833 2023-06-22 11:01:06,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-22 11:02:18,618 INFO [train.py:996] (1/4) Epoch 7, batch 19050, loss[loss=0.2869, simple_loss=0.3459, pruned_loss=0.114, over 21719.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3149, pruned_loss=0.08614, over 4278816.72 frames. ], batch size: 389, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:03:06,390 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.274e+02 3.680e+02 4.051e+02 6.947e+02, threshold=7.360e+02, percent-clipped=0.0 2023-06-22 11:03:50,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-22 11:03:57,469 INFO [train.py:996] (1/4) Epoch 7, batch 19100, loss[loss=0.223, simple_loss=0.2834, pruned_loss=0.08132, over 21760.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3137, pruned_loss=0.08777, over 4278373.59 frames. ], batch size: 112, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:04:22,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1212462.0, ans=0.0 2023-06-22 11:04:22,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1212462.0, ans=0.2 2023-06-22 11:04:41,414 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:05:34,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1212642.0, ans=0.125 2023-06-22 11:05:34,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1212642.0, ans=0.07 2023-06-22 11:05:39,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1212702.0, ans=0.0 2023-06-22 11:05:40,914 INFO [train.py:996] (1/4) Epoch 7, batch 19150, loss[loss=0.3146, simple_loss=0.404, pruned_loss=0.1126, over 21612.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3157, pruned_loss=0.08832, over 4275485.87 frames. ], batch size: 414, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:06:42,992 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.634e+02 3.652e+02 4.521e+02 6.039e+02 1.131e+03, threshold=9.042e+02, percent-clipped=10.0 2023-06-22 11:07:16,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-22 11:07:23,680 INFO [train.py:996] (1/4) Epoch 7, batch 19200, loss[loss=0.2357, simple_loss=0.3338, pruned_loss=0.06879, over 21155.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3238, pruned_loss=0.08838, over 4275201.60 frames. ], batch size: 143, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:08:31,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=15.0 2023-06-22 11:09:03,390 INFO [train.py:996] (1/4) Epoch 7, batch 19250, loss[loss=0.2266, simple_loss=0.2966, pruned_loss=0.07833, over 21478.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3232, pruned_loss=0.0826, over 4269640.98 frames. ], batch size: 131, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:10:04,473 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 3.337e+02 4.181e+02 5.636e+02 1.044e+03, threshold=8.362e+02, percent-clipped=4.0 2023-06-22 11:10:08,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1213422.0, ans=0.0 2023-06-22 11:10:43,624 INFO [train.py:996] (1/4) Epoch 7, batch 19300, loss[loss=0.2517, simple_loss=0.3156, pruned_loss=0.09395, over 21793.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3202, pruned_loss=0.08073, over 4271216.48 frames. ], batch size: 112, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:11:27,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1213722.0, ans=0.125 2023-06-22 11:11:39,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1213722.0, ans=0.0 2023-06-22 11:11:55,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1213782.0, ans=0.0 2023-06-22 11:12:02,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1213782.0, ans=0.125 2023-06-22 11:12:31,347 INFO [train.py:996] (1/4) Epoch 7, batch 19350, loss[loss=0.2043, simple_loss=0.2897, pruned_loss=0.05942, over 21687.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3144, pruned_loss=0.07702, over 4277884.36 frames. ], batch size: 247, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:12:31,748 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:12:48,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-22 11:13:20,171 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.132e+02 3.696e+02 4.468e+02 9.223e+02, threshold=7.391e+02, percent-clipped=2.0 2023-06-22 11:13:21,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1214022.0, ans=0.035 2023-06-22 11:13:32,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.57 vs. limit=10.0 2023-06-22 11:13:36,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1214082.0, ans=0.0 2023-06-22 11:13:39,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1214082.0, ans=0.125 2023-06-22 11:14:04,205 INFO [train.py:996] (1/4) Epoch 7, batch 19400, loss[loss=0.2369, simple_loss=0.305, pruned_loss=0.08443, over 21548.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3119, pruned_loss=0.07625, over 4280981.33 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:15:12,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1214382.0, ans=0.0 2023-06-22 11:15:21,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1214442.0, ans=0.125 2023-06-22 11:15:24,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1214442.0, ans=0.2 2023-06-22 11:15:43,335 INFO [train.py:996] (1/4) Epoch 7, batch 19450, loss[loss=0.2167, simple_loss=0.2852, pruned_loss=0.07413, over 20181.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.31, pruned_loss=0.07862, over 4285483.05 frames. ], batch size: 702, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:15:52,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-22 11:15:59,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1214502.0, ans=0.125 2023-06-22 11:16:29,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1214622.0, ans=0.0 2023-06-22 11:16:37,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1214622.0, ans=0.125 2023-06-22 11:16:38,389 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.537e+02 3.055e+02 3.774e+02 4.517e+02 1.086e+03, threshold=7.548e+02, percent-clipped=5.0 2023-06-22 11:16:42,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-22 11:16:43,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1214622.0, ans=0.2 2023-06-22 11:16:59,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1214682.0, ans=0.0 2023-06-22 11:17:06,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1214742.0, ans=0.1 2023-06-22 11:17:12,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1214742.0, ans=0.125 2023-06-22 11:17:23,340 INFO [train.py:996] (1/4) Epoch 7, batch 19500, loss[loss=0.1944, simple_loss=0.2517, pruned_loss=0.06852, over 21157.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3037, pruned_loss=0.07905, over 4285955.15 frames. ], batch size: 143, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:18:07,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1214862.0, ans=0.125 2023-06-22 11:18:07,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1214862.0, ans=0.0 2023-06-22 11:18:07,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1214862.0, ans=0.125 2023-06-22 11:18:20,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1214922.0, ans=0.125 2023-06-22 11:18:38,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1214982.0, ans=0.2 2023-06-22 11:18:38,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1214982.0, ans=0.0 2023-06-22 11:18:39,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1214982.0, ans=0.125 2023-06-22 11:18:52,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=1215042.0, ans=12.0 2023-06-22 11:19:05,706 INFO [train.py:996] (1/4) Epoch 7, batch 19550, loss[loss=0.2199, simple_loss=0.3103, pruned_loss=0.06477, over 21376.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3001, pruned_loss=0.07737, over 4273256.96 frames. ], batch size: 194, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:19:06,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1215102.0, ans=0.5 2023-06-22 11:19:25,701 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:19:38,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1215162.0, ans=0.1 2023-06-22 11:19:39,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1215162.0, ans=0.125 2023-06-22 11:19:55,194 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 3.073e+02 3.530e+02 4.388e+02 8.690e+02, threshold=7.059e+02, percent-clipped=2.0 2023-06-22 11:20:10,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.68 vs. limit=5.0 2023-06-22 11:20:39,274 INFO [train.py:996] (1/4) Epoch 7, batch 19600, loss[loss=0.2584, simple_loss=0.3225, pruned_loss=0.09714, over 21849.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3023, pruned_loss=0.07838, over 4278228.54 frames. ], batch size: 332, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:21:22,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-22 11:22:22,676 INFO [train.py:996] (1/4) Epoch 7, batch 19650, loss[loss=0.2511, simple_loss=0.3194, pruned_loss=0.09143, over 21888.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3085, pruned_loss=0.08338, over 4282652.51 frames. ], batch size: 371, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:22:23,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1215702.0, ans=0.125 2023-06-22 11:22:23,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-22 11:22:31,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1215702.0, ans=0.0 2023-06-22 11:23:15,248 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-22 11:23:15,652 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.627e+02 4.077e+02 5.113e+02 8.180e+02, threshold=8.154e+02, percent-clipped=7.0 2023-06-22 11:24:16,455 INFO [train.py:996] (1/4) Epoch 7, batch 19700, loss[loss=0.1926, simple_loss=0.2756, pruned_loss=0.05486, over 21509.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3094, pruned_loss=0.08328, over 4273458.26 frames. ], batch size: 212, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:24:22,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1216002.0, ans=0.95 2023-06-22 11:24:25,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1216002.0, ans=0.0 2023-06-22 11:25:15,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1216182.0, ans=0.125 2023-06-22 11:25:34,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=15.0 2023-06-22 11:25:58,837 INFO [train.py:996] (1/4) Epoch 7, batch 19750, loss[loss=0.2188, simple_loss=0.2985, pruned_loss=0.06953, over 21753.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3186, pruned_loss=0.08431, over 4270097.15 frames. ], batch size: 124, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:26:26,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-22 11:26:44,456 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.745e+02 3.729e+02 4.611e+02 5.991e+02 1.312e+03, threshold=9.223e+02, percent-clipped=7.0 2023-06-22 11:27:38,677 INFO [train.py:996] (1/4) Epoch 7, batch 19800, loss[loss=0.207, simple_loss=0.2776, pruned_loss=0.06821, over 21826.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3193, pruned_loss=0.0858, over 4282905.04 frames. ], batch size: 247, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:28:17,395 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:28:20,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1216722.0, ans=0.125 2023-06-22 11:28:21,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-22 11:29:00,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1216782.0, ans=0.125 2023-06-22 11:29:05,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1216842.0, ans=0.1 2023-06-22 11:29:21,357 INFO [train.py:996] (1/4) Epoch 7, batch 19850, loss[loss=0.1969, simple_loss=0.2653, pruned_loss=0.06428, over 21447.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3125, pruned_loss=0.08165, over 4273428.21 frames. ], batch size: 194, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:29:59,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.28 vs. limit=22.5 2023-06-22 11:30:12,597 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 2.965e+02 3.588e+02 4.617e+02 1.028e+03, threshold=7.176e+02, percent-clipped=3.0 2023-06-22 11:30:49,170 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-22 11:31:00,270 INFO [train.py:996] (1/4) Epoch 7, batch 19900, loss[loss=0.2926, simple_loss=0.4032, pruned_loss=0.09097, over 19705.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3144, pruned_loss=0.0792, over 4274913.33 frames. ], batch size: 703, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:31:10,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1217202.0, ans=0.125 2023-06-22 11:31:28,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1217262.0, ans=0.07 2023-06-22 11:31:46,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.66 vs. limit=10.0 2023-06-22 11:32:35,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-22 11:32:37,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-22 11:32:42,605 INFO [train.py:996] (1/4) Epoch 7, batch 19950, loss[loss=0.2077, simple_loss=0.27, pruned_loss=0.07277, over 21215.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.308, pruned_loss=0.07909, over 4272683.00 frames. ], batch size: 131, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:33:39,617 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.305e+02 4.065e+02 5.440e+02 9.798e+02, threshold=8.130e+02, percent-clipped=10.0 2023-06-22 11:34:02,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1217682.0, ans=0.0 2023-06-22 11:34:22,886 INFO [train.py:996] (1/4) Epoch 7, batch 20000, loss[loss=0.2171, simple_loss=0.2952, pruned_loss=0.06953, over 21818.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3101, pruned_loss=0.07943, over 4277986.02 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:34:37,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1217802.0, ans=0.125 2023-06-22 11:35:04,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.07 vs. limit=12.0 2023-06-22 11:35:59,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1218042.0, ans=0.2 2023-06-22 11:36:02,167 INFO [train.py:996] (1/4) Epoch 7, batch 20050, loss[loss=0.2611, simple_loss=0.3316, pruned_loss=0.09529, over 21868.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3118, pruned_loss=0.08215, over 4282793.96 frames. ], batch size: 414, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:36:25,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1218162.0, ans=0.125 2023-06-22 11:36:45,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1218222.0, ans=0.0 2023-06-22 11:37:00,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.182e+02 3.871e+02 4.475e+02 7.153e+02, threshold=7.741e+02, percent-clipped=0.0 2023-06-22 11:37:31,111 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-22 11:37:49,411 INFO [train.py:996] (1/4) Epoch 7, batch 20100, loss[loss=0.2528, simple_loss=0.3426, pruned_loss=0.08148, over 21784.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3144, pruned_loss=0.08432, over 4287903.48 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:38:41,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1218522.0, ans=0.125 2023-06-22 11:39:26,431 INFO [train.py:996] (1/4) Epoch 7, batch 20150, loss[loss=0.2654, simple_loss=0.3424, pruned_loss=0.09413, over 21675.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3241, pruned_loss=0.08807, over 4288353.82 frames. ], batch size: 351, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:39:27,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218702.0, ans=0.1 2023-06-22 11:39:28,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1218702.0, ans=0.1 2023-06-22 11:39:42,862 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=12.0 2023-06-22 11:39:56,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1218762.0, ans=0.0 2023-06-22 11:40:28,338 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 4.003e+02 4.787e+02 6.267e+02 1.040e+03, threshold=9.575e+02, percent-clipped=17.0 2023-06-22 11:41:08,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218942.0, ans=0.1 2023-06-22 11:41:12,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1218942.0, ans=0.125 2023-06-22 11:41:15,569 INFO [train.py:996] (1/4) Epoch 7, batch 20200, loss[loss=0.2821, simple_loss=0.3719, pruned_loss=0.09622, over 21825.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3307, pruned_loss=0.09144, over 4290563.22 frames. ], batch size: 371, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:42:22,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1219182.0, ans=0.07 2023-06-22 11:43:00,763 INFO [train.py:996] (1/4) Epoch 7, batch 20250, loss[loss=0.2104, simple_loss=0.2924, pruned_loss=0.06424, over 21011.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3308, pruned_loss=0.08922, over 4284987.86 frames. ], batch size: 607, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:43:54,250 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.109e+02 3.852e+02 4.558e+02 1.289e+03, threshold=7.704e+02, percent-clipped=1.0 2023-06-22 11:44:06,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1219482.0, ans=0.125 2023-06-22 11:44:22,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1219542.0, ans=0.125 2023-06-22 11:44:40,590 INFO [train.py:996] (1/4) Epoch 7, batch 20300, loss[loss=0.22, simple_loss=0.3167, pruned_loss=0.0617, over 21730.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3284, pruned_loss=0.08611, over 4289006.81 frames. ], batch size: 351, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:45:24,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1219722.0, ans=0.125 2023-06-22 11:45:38,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1219782.0, ans=0.0 2023-06-22 11:46:14,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1219842.0, ans=0.125 2023-06-22 11:46:18,499 INFO [train.py:996] (1/4) Epoch 7, batch 20350, loss[loss=0.2009, simple_loss=0.2766, pruned_loss=0.06253, over 17320.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3283, pruned_loss=0.08674, over 4277468.84 frames. ], batch size: 65, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:46:52,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1219962.0, ans=0.125 2023-06-22 11:47:01,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1220022.0, ans=0.125 2023-06-22 11:47:11,890 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.199e+02 3.639e+02 4.659e+02 8.452e+02, threshold=7.278e+02, percent-clipped=1.0 2023-06-22 11:47:41,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1220142.0, ans=10.0 2023-06-22 11:47:58,710 INFO [train.py:996] (1/4) Epoch 7, batch 20400, loss[loss=0.3127, simple_loss=0.3779, pruned_loss=0.1237, over 21779.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3298, pruned_loss=0.08891, over 4255184.42 frames. ], batch size: 441, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 11:48:14,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-22 11:48:31,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1220262.0, ans=0.0 2023-06-22 11:48:37,452 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-06-22 11:48:38,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1220322.0, ans=0.1 2023-06-22 11:49:29,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1220442.0, ans=0.125 2023-06-22 11:49:43,962 INFO [train.py:996] (1/4) Epoch 7, batch 20450, loss[loss=0.2664, simple_loss=0.3275, pruned_loss=0.1027, over 21818.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3314, pruned_loss=0.09224, over 4266263.54 frames. ], batch size: 441, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 11:50:04,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1220562.0, ans=0.125 2023-06-22 11:50:30,937 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 3.380e+02 3.850e+02 4.870e+02 7.513e+02, threshold=7.700e+02, percent-clipped=1.0 2023-06-22 11:50:35,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1220682.0, ans=0.0 2023-06-22 11:50:54,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-22 11:50:55,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1220742.0, ans=0.125 2023-06-22 11:50:58,691 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.79 vs. limit=10.0 2023-06-22 11:51:16,555 INFO [train.py:996] (1/4) Epoch 7, batch 20500, loss[loss=0.2398, simple_loss=0.2982, pruned_loss=0.09073, over 21872.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3264, pruned_loss=0.09187, over 4262541.50 frames. ], batch size: 107, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:52:34,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1221042.0, ans=0.0 2023-06-22 11:52:39,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1221042.0, ans=0.125 2023-06-22 11:53:01,612 INFO [train.py:996] (1/4) Epoch 7, batch 20550, loss[loss=0.2703, simple_loss=0.3489, pruned_loss=0.09591, over 21851.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3175, pruned_loss=0.08919, over 4265647.05 frames. ], batch size: 372, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:53:15,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1221102.0, ans=0.125 2023-06-22 11:53:51,916 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.265e+02 4.144e+02 5.422e+02 9.318e+02, threshold=8.288e+02, percent-clipped=6.0 2023-06-22 11:54:40,926 INFO [train.py:996] (1/4) Epoch 7, batch 20600, loss[loss=0.2383, simple_loss=0.3306, pruned_loss=0.07301, over 21733.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3183, pruned_loss=0.08704, over 4257327.74 frames. ], batch size: 298, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:54:57,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1221462.0, ans=0.1 2023-06-22 11:55:18,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1221522.0, ans=0.125 2023-06-22 11:55:25,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1221522.0, ans=0.125 2023-06-22 11:55:25,925 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-22 11:55:45,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-22 11:55:48,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1221582.0, ans=0.0 2023-06-22 11:56:19,437 INFO [train.py:996] (1/4) Epoch 7, batch 20650, loss[loss=0.2258, simple_loss=0.2894, pruned_loss=0.08109, over 21848.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3159, pruned_loss=0.0873, over 4253598.23 frames. ], batch size: 98, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:56:31,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1221702.0, ans=0.125 2023-06-22 11:57:08,837 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.629e+02 3.307e+02 4.027e+02 4.834e+02 1.059e+03, threshold=8.054e+02, percent-clipped=3.0 2023-06-22 11:57:22,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1221882.0, ans=0.125 2023-06-22 11:57:37,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1221942.0, ans=0.0 2023-06-22 11:57:59,199 INFO [train.py:996] (1/4) Epoch 7, batch 20700, loss[loss=0.2523, simple_loss=0.3333, pruned_loss=0.08563, over 21722.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.309, pruned_loss=0.08406, over 4246873.52 frames. ], batch size: 332, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:58:13,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1222002.0, ans=0.125 2023-06-22 11:58:34,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1222062.0, ans=0.0 2023-06-22 11:58:42,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1222122.0, ans=0.125 2023-06-22 11:59:01,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1222182.0, ans=0.125 2023-06-22 11:59:10,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1222182.0, ans=0.0 2023-06-22 11:59:29,634 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-22 11:59:41,100 INFO [train.py:996] (1/4) Epoch 7, batch 20750, loss[loss=0.2684, simple_loss=0.3959, pruned_loss=0.07046, over 20752.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3114, pruned_loss=0.08297, over 4250455.44 frames. ], batch size: 607, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:59:53,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.08 vs. limit=22.5 2023-06-22 11:59:54,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1222302.0, ans=0.125 2023-06-22 12:00:28,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1222422.0, ans=0.0 2023-06-22 12:00:36,501 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.565e+02 3.483e+02 4.528e+02 6.877e+02 1.317e+03, threshold=9.056e+02, percent-clipped=16.0 2023-06-22 12:01:01,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1222482.0, ans=0.1 2023-06-22 12:01:21,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1222602.0, ans=0.0 2023-06-22 12:01:26,505 INFO [train.py:996] (1/4) Epoch 7, batch 20800, loss[loss=0.2353, simple_loss=0.2939, pruned_loss=0.0884, over 21821.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3132, pruned_loss=0.08318, over 4245629.87 frames. ], batch size: 372, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:01:26,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1222602.0, ans=0.1 2023-06-22 12:01:37,425 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-22 12:01:48,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1222662.0, ans=0.07 2023-06-22 12:02:06,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1222722.0, ans=0.0 2023-06-22 12:02:06,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1222722.0, ans=0.1 2023-06-22 12:02:10,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-22 12:02:33,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1222782.0, ans=0.95 2023-06-22 12:03:02,326 INFO [train.py:996] (1/4) Epoch 7, batch 20850, loss[loss=0.2334, simple_loss=0.3028, pruned_loss=0.08201, over 21842.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3085, pruned_loss=0.0814, over 4249387.35 frames. ], batch size: 371, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:03:04,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1222902.0, ans=0.025 2023-06-22 12:03:29,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1222962.0, ans=0.125 2023-06-22 12:03:31,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1222962.0, ans=0.0 2023-06-22 12:04:01,381 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.647e+02 5.072e+02 6.568e+02 1.337e+03, threshold=1.014e+03, percent-clipped=9.0 2023-06-22 12:04:46,062 INFO [train.py:996] (1/4) Epoch 7, batch 20900, loss[loss=0.2879, simple_loss=0.4119, pruned_loss=0.08195, over 19821.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3093, pruned_loss=0.08213, over 4254773.08 frames. ], batch size: 702, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:05:21,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1223322.0, ans=0.2 2023-06-22 12:05:30,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1223322.0, ans=0.0 2023-06-22 12:06:01,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1223382.0, ans=0.1 2023-06-22 12:06:06,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1223442.0, ans=0.0 2023-06-22 12:06:19,603 INFO [train.py:996] (1/4) Epoch 7, batch 20950, loss[loss=0.2333, simple_loss=0.3041, pruned_loss=0.08128, over 21849.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3035, pruned_loss=0.0782, over 4264909.30 frames. ], batch size: 371, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:06:40,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1223562.0, ans=0.125 2023-06-22 12:07:15,571 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.036e+02 3.516e+02 4.387e+02 8.628e+02, threshold=7.032e+02, percent-clipped=0.0 2023-06-22 12:07:25,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1223682.0, ans=0.0 2023-06-22 12:07:53,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1223742.0, ans=0.125 2023-06-22 12:07:57,799 INFO [train.py:996] (1/4) Epoch 7, batch 21000, loss[loss=0.2639, simple_loss=0.3267, pruned_loss=0.1005, over 21195.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3026, pruned_loss=0.07961, over 4273139.66 frames. ], batch size: 143, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:07:57,799 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 12:08:15,950 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2689, simple_loss=0.3672, pruned_loss=0.08525, over 1796401.00 frames. 2023-06-22 12:08:15,951 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 12:08:28,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1223802.0, ans=0.1 2023-06-22 12:08:31,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-22 12:09:47,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1224042.0, ans=0.0 2023-06-22 12:09:54,869 INFO [train.py:996] (1/4) Epoch 7, batch 21050, loss[loss=0.2035, simple_loss=0.2745, pruned_loss=0.06623, over 21858.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3013, pruned_loss=0.0804, over 4278059.00 frames. ], batch size: 118, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:10:00,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-22 12:10:04,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1224102.0, ans=0.125 2023-06-22 12:10:07,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1224102.0, ans=0.1 2023-06-22 12:10:49,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.015e+02 3.354e+02 4.094e+02 5.427e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-22 12:10:54,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1224282.0, ans=0.125 2023-06-22 12:11:33,501 INFO [train.py:996] (1/4) Epoch 7, batch 21100, loss[loss=0.2422, simple_loss=0.2965, pruned_loss=0.09393, over 21816.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2982, pruned_loss=0.08003, over 4279445.56 frames. ], batch size: 352, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:12:57,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-22 12:13:07,744 INFO [train.py:996] (1/4) Epoch 7, batch 21150, loss[loss=0.2622, simple_loss=0.3084, pruned_loss=0.1079, over 21330.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2956, pruned_loss=0.08097, over 4277477.13 frames. ], batch size: 473, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:13:27,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-22 12:14:08,456 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.198e+02 3.741e+02 4.699e+02 9.376e+02, threshold=7.483e+02, percent-clipped=8.0 2023-06-22 12:14:08,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1224882.0, ans=0.0 2023-06-22 12:14:46,335 INFO [train.py:996] (1/4) Epoch 7, batch 21200, loss[loss=0.2177, simple_loss=0.2815, pruned_loss=0.07693, over 21269.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2915, pruned_loss=0.07982, over 4269160.94 frames. ], batch size: 144, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:15:03,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1225002.0, ans=0.125 2023-06-22 12:15:06,367 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:16:30,850 INFO [train.py:996] (1/4) Epoch 7, batch 21250, loss[loss=0.2019, simple_loss=0.2675, pruned_loss=0.06812, over 21589.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2905, pruned_loss=0.07987, over 4273327.01 frames. ], batch size: 263, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:16:41,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.02 vs. limit=10.0 2023-06-22 12:17:09,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1225422.0, ans=0.0 2023-06-22 12:17:27,710 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.297e+02 3.945e+02 5.021e+02 1.062e+03, threshold=7.890e+02, percent-clipped=7.0 2023-06-22 12:17:56,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1225542.0, ans=0.125 2023-06-22 12:18:03,972 INFO [train.py:996] (1/4) Epoch 7, batch 21300, loss[loss=0.2107, simple_loss=0.2688, pruned_loss=0.07629, over 21267.00 frames. ], tot_loss[loss=0.232, simple_loss=0.2984, pruned_loss=0.08281, over 4275165.43 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:18:25,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1225662.0, ans=0.0 2023-06-22 12:18:45,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1225722.0, ans=0.0 2023-06-22 12:19:10,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1225782.0, ans=0.125 2023-06-22 12:19:47,797 INFO [train.py:996] (1/4) Epoch 7, batch 21350, loss[loss=0.2221, simple_loss=0.288, pruned_loss=0.07806, over 16718.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3015, pruned_loss=0.08299, over 4272711.31 frames. ], batch size: 62, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:19:51,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1225902.0, ans=0.2 2023-06-22 12:20:45,055 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.191e+02 3.567e+02 4.757e+02 8.464e+02, threshold=7.133e+02, percent-clipped=1.0 2023-06-22 12:21:21,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1226142.0, ans=0.125 2023-06-22 12:21:26,950 INFO [train.py:996] (1/4) Epoch 7, batch 21400, loss[loss=0.291, simple_loss=0.3647, pruned_loss=0.1087, over 21408.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3049, pruned_loss=0.08235, over 4281330.06 frames. ], batch size: 471, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:22:30,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=22.5 2023-06-22 12:22:34,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1226382.0, ans=15.0 2023-06-22 12:22:42,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.17 vs. limit=15.0 2023-06-22 12:23:06,020 INFO [train.py:996] (1/4) Epoch 7, batch 21450, loss[loss=0.2194, simple_loss=0.2855, pruned_loss=0.0766, over 21812.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3082, pruned_loss=0.08359, over 4279697.38 frames. ], batch size: 247, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:23:13,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-22 12:23:34,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1226562.0, ans=0.125 2023-06-22 12:24:04,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 3.251e+02 3.638e+02 4.479e+02 7.872e+02, threshold=7.276e+02, percent-clipped=2.0 2023-06-22 12:24:08,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-22 12:24:19,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1226682.0, ans=0.125 2023-06-22 12:24:45,026 INFO [train.py:996] (1/4) Epoch 7, batch 21500, loss[loss=0.2398, simple_loss=0.2948, pruned_loss=0.09236, over 21881.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3079, pruned_loss=0.08542, over 4260804.52 frames. ], batch size: 373, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:24:46,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-22 12:24:51,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1226802.0, ans=0.0 2023-06-22 12:25:06,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1226862.0, ans=0.125 2023-06-22 12:26:19,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1227042.0, ans=0.07 2023-06-22 12:26:22,145 INFO [train.py:996] (1/4) Epoch 7, batch 21550, loss[loss=0.1882, simple_loss=0.252, pruned_loss=0.06218, over 21474.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3021, pruned_loss=0.08302, over 4261407.05 frames. ], batch size: 212, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:26:22,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1227102.0, ans=0.0 2023-06-22 12:26:28,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1227102.0, ans=0.2 2023-06-22 12:26:33,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1227102.0, ans=0.125 2023-06-22 12:26:53,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1227162.0, ans=0.125 2023-06-22 12:27:18,007 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-22 12:27:22,176 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.394e+02 4.273e+02 5.102e+02 8.166e+02, threshold=8.546e+02, percent-clipped=3.0 2023-06-22 12:27:39,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1227282.0, ans=0.2 2023-06-22 12:27:53,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-22 12:28:03,024 INFO [train.py:996] (1/4) Epoch 7, batch 21600, loss[loss=0.2195, simple_loss=0.2785, pruned_loss=0.08023, over 21332.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3002, pruned_loss=0.08195, over 4255704.15 frames. ], batch size: 144, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:28:06,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1227402.0, ans=0.125 2023-06-22 12:28:39,452 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.80 vs. limit=15.0 2023-06-22 12:28:43,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1227462.0, ans=0.125 2023-06-22 12:29:08,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1227582.0, ans=0.1 2023-06-22 12:29:26,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1227582.0, ans=0.2 2023-06-22 12:29:44,723 INFO [train.py:996] (1/4) Epoch 7, batch 21650, loss[loss=0.2241, simple_loss=0.3137, pruned_loss=0.06728, over 21633.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3031, pruned_loss=0.08007, over 4256357.57 frames. ], batch size: 230, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:30:38,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1227822.0, ans=0.125 2023-06-22 12:30:46,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1227882.0, ans=0.125 2023-06-22 12:30:48,058 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.272e+02 4.072e+02 5.244e+02 1.561e+03, threshold=8.145e+02, percent-clipped=7.0 2023-06-22 12:30:53,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-22 12:31:22,609 INFO [train.py:996] (1/4) Epoch 7, batch 21700, loss[loss=0.242, simple_loss=0.3046, pruned_loss=0.0897, over 21644.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3021, pruned_loss=0.07764, over 4265379.57 frames. ], batch size: 332, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:31:53,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1228062.0, ans=0.2 2023-06-22 12:32:06,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1228122.0, ans=0.0 2023-06-22 12:32:08,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1228122.0, ans=0.0 2023-06-22 12:32:51,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1228242.0, ans=0.0 2023-06-22 12:33:01,441 INFO [train.py:996] (1/4) Epoch 7, batch 21750, loss[loss=0.2273, simple_loss=0.2766, pruned_loss=0.08905, over 21488.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2975, pruned_loss=0.07783, over 4267178.35 frames. ], batch size: 212, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:33:07,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1228302.0, ans=0.125 2023-06-22 12:33:15,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1228302.0, ans=0.125 2023-06-22 12:33:46,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1228422.0, ans=0.125 2023-06-22 12:33:57,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1228482.0, ans=0.125 2023-06-22 12:34:00,305 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.122e+02 3.637e+02 4.917e+02 1.048e+03, threshold=7.274e+02, percent-clipped=3.0 2023-06-22 12:34:21,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1228482.0, ans=0.125 2023-06-22 12:34:40,477 INFO [train.py:996] (1/4) Epoch 7, batch 21800, loss[loss=0.2304, simple_loss=0.296, pruned_loss=0.08242, over 21320.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2955, pruned_loss=0.07857, over 4267825.28 frames. ], batch size: 160, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:34:51,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-22 12:36:19,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1228902.0, ans=0.125 2023-06-22 12:36:20,679 INFO [train.py:996] (1/4) Epoch 7, batch 21850, loss[loss=0.2257, simple_loss=0.2842, pruned_loss=0.08362, over 21596.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2999, pruned_loss=0.07912, over 4259694.08 frames. ], batch size: 263, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:36:56,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1228962.0, ans=0.2 2023-06-22 12:37:04,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1229022.0, ans=0.125 2023-06-22 12:37:05,093 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-22 12:37:24,827 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.283e+02 3.846e+02 4.671e+02 1.030e+03, threshold=7.692e+02, percent-clipped=3.0 2023-06-22 12:37:51,985 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-22 12:38:00,778 INFO [train.py:996] (1/4) Epoch 7, batch 21900, loss[loss=0.2555, simple_loss=0.333, pruned_loss=0.08899, over 20059.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3021, pruned_loss=0.08091, over 4267569.57 frames. ], batch size: 702, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:38:08,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1229202.0, ans=0.125 2023-06-22 12:38:38,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-22 12:38:52,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.17 vs. limit=6.0 2023-06-22 12:39:33,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1229442.0, ans=0.125 2023-06-22 12:39:33,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1229442.0, ans=0.125 2023-06-22 12:39:44,774 INFO [train.py:996] (1/4) Epoch 7, batch 21950, loss[loss=0.1747, simple_loss=0.2559, pruned_loss=0.04672, over 21713.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2968, pruned_loss=0.07943, over 4268464.84 frames. ], batch size: 282, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:39:50,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1229502.0, ans=0.125 2023-06-22 12:40:26,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1229622.0, ans=0.125 2023-06-22 12:40:40,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1229682.0, ans=0.125 2023-06-22 12:40:48,665 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.971e+02 3.645e+02 4.413e+02 9.727e+02, threshold=7.291e+02, percent-clipped=1.0 2023-06-22 12:41:24,801 INFO [train.py:996] (1/4) Epoch 7, batch 22000, loss[loss=0.2195, simple_loss=0.2887, pruned_loss=0.07517, over 21798.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2912, pruned_loss=0.07617, over 4274147.25 frames. ], batch size: 352, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:41:42,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1229802.0, ans=0.0 2023-06-22 12:41:46,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1229862.0, ans=0.125 2023-06-22 12:41:49,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1229862.0, ans=0.015 2023-06-22 12:41:49,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1229862.0, ans=0.125 2023-06-22 12:41:55,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1229862.0, ans=0.125 2023-06-22 12:42:29,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1229982.0, ans=0.0 2023-06-22 12:42:32,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.34 vs. limit=15.0 2023-06-22 12:42:39,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1229982.0, ans=0.0 2023-06-22 12:42:44,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1230042.0, ans=0.125 2023-06-22 12:43:10,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1230102.0, ans=0.0 2023-06-22 12:43:11,556 INFO [train.py:996] (1/4) Epoch 7, batch 22050, loss[loss=0.2792, simple_loss=0.3425, pruned_loss=0.1079, over 21307.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.294, pruned_loss=0.07692, over 4261000.00 frames. ], batch size: 159, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:43:18,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-22 12:43:25,520 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2023-06-22 12:43:40,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1230162.0, ans=0.0 2023-06-22 12:43:45,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1230162.0, ans=0.2 2023-06-22 12:44:02,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1230222.0, ans=0.025 2023-06-22 12:44:14,089 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 3.765e+02 5.011e+02 6.386e+02 1.691e+03, threshold=1.002e+03, percent-clipped=17.0 2023-06-22 12:44:31,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-22 12:44:36,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-22 12:44:52,496 INFO [train.py:996] (1/4) Epoch 7, batch 22100, loss[loss=0.3388, simple_loss=0.393, pruned_loss=0.1423, over 21765.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3053, pruned_loss=0.08234, over 4270079.88 frames. ], batch size: 441, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:44:53,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1230402.0, ans=0.5 2023-06-22 12:45:09,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1230462.0, ans=0.1 2023-06-22 12:45:09,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1230462.0, ans=0.125 2023-06-22 12:45:47,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1230522.0, ans=0.0 2023-06-22 12:45:56,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1230582.0, ans=0.125 2023-06-22 12:46:16,020 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.39 vs. limit=15.0 2023-06-22 12:46:25,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-22 12:46:30,172 INFO [train.py:996] (1/4) Epoch 7, batch 22150, loss[loss=0.2253, simple_loss=0.3047, pruned_loss=0.07294, over 21818.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3091, pruned_loss=0.08525, over 4280917.31 frames. ], batch size: 282, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:46:31,248 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=22.5 2023-06-22 12:46:39,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1230702.0, ans=0.0 2023-06-22 12:47:06,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1230822.0, ans=0.125 2023-06-22 12:47:29,585 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.676e+02 4.235e+02 5.035e+02 1.205e+03, threshold=8.469e+02, percent-clipped=1.0 2023-06-22 12:48:02,761 INFO [train.py:996] (1/4) Epoch 7, batch 22200, loss[loss=0.2571, simple_loss=0.3462, pruned_loss=0.08404, over 21822.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3112, pruned_loss=0.08592, over 4292291.65 frames. ], batch size: 282, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:48:16,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-22 12:48:33,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1231062.0, ans=0.125 2023-06-22 12:49:42,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-22 12:49:48,304 INFO [train.py:996] (1/4) Epoch 7, batch 22250, loss[loss=0.2736, simple_loss=0.34, pruned_loss=0.1036, over 21496.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3178, pruned_loss=0.08762, over 4286342.45 frames. ], batch size: 194, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:49:52,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-22 12:49:58,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1231302.0, ans=0.125 2023-06-22 12:50:11,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1231362.0, ans=0.2 2023-06-22 12:50:33,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1231422.0, ans=0.0 2023-06-22 12:50:50,639 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.541e+02 3.545e+02 4.219e+02 5.859e+02 1.258e+03, threshold=8.437e+02, percent-clipped=3.0 2023-06-22 12:51:29,308 INFO [train.py:996] (1/4) Epoch 7, batch 22300, loss[loss=0.2258, simple_loss=0.2838, pruned_loss=0.08391, over 21229.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3194, pruned_loss=0.0895, over 4281655.49 frames. ], batch size: 608, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:51:56,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231662.0, ans=0.1 2023-06-22 12:52:05,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1231662.0, ans=0.125 2023-06-22 12:52:48,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1231842.0, ans=0.0 2023-06-22 12:53:10,183 INFO [train.py:996] (1/4) Epoch 7, batch 22350, loss[loss=0.2456, simple_loss=0.3154, pruned_loss=0.0879, over 21821.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3177, pruned_loss=0.09022, over 4287676.07 frames. ], batch size: 414, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:53:13,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1231902.0, ans=0.0 2023-06-22 12:53:48,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1231962.0, ans=0.5 2023-06-22 12:54:17,420 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.296e+02 3.742e+02 4.441e+02 8.144e+02, threshold=7.483e+02, percent-clipped=0.0 2023-06-22 12:54:43,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1232142.0, ans=0.125 2023-06-22 12:54:45,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1232142.0, ans=0.125 2023-06-22 12:54:50,725 INFO [train.py:996] (1/4) Epoch 7, batch 22400, loss[loss=0.2556, simple_loss=0.3168, pruned_loss=0.09722, over 21480.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3149, pruned_loss=0.08656, over 4283578.06 frames. ], batch size: 389, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 12:55:16,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1232262.0, ans=0.0 2023-06-22 12:55:18,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1232262.0, ans=0.1 2023-06-22 12:55:26,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1232262.0, ans=0.0 2023-06-22 12:55:41,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1232322.0, ans=0.125 2023-06-22 12:55:41,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1232322.0, ans=0.0 2023-06-22 12:55:54,635 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-22 12:56:21,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1232442.0, ans=0.1 2023-06-22 12:56:30,519 INFO [train.py:996] (1/4) Epoch 7, batch 22450, loss[loss=0.2425, simple_loss=0.2891, pruned_loss=0.09798, over 21333.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3085, pruned_loss=0.08556, over 4275994.50 frames. ], batch size: 473, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 12:56:45,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1232502.0, ans=0.0 2023-06-22 12:57:01,860 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:57:34,111 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.003e+02 3.355e+02 3.883e+02 5.692e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-22 12:58:16,670 INFO [train.py:996] (1/4) Epoch 7, batch 22500, loss[loss=0.2496, simple_loss=0.293, pruned_loss=0.1031, over 16594.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3032, pruned_loss=0.08423, over 4275302.20 frames. ], batch size: 69, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:58:24,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1232802.0, ans=0.2 2023-06-22 12:59:01,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1232922.0, ans=0.125 2023-06-22 12:59:10,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1232982.0, ans=0.0 2023-06-22 12:59:16,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1232982.0, ans=0.0 2023-06-22 13:00:01,059 INFO [train.py:996] (1/4) Epoch 7, batch 22550, loss[loss=0.2487, simple_loss=0.3241, pruned_loss=0.08659, over 22000.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3065, pruned_loss=0.08376, over 4278012.72 frames. ], batch size: 113, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:00:09,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1233102.0, ans=0.015 2023-06-22 13:00:29,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1233162.0, ans=0.125 2023-06-22 13:01:06,761 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.865e+02 3.475e+02 4.180e+02 5.606e+02 1.235e+03, threshold=8.360e+02, percent-clipped=11.0 2023-06-22 13:01:45,213 INFO [train.py:996] (1/4) Epoch 7, batch 22600, loss[loss=0.198, simple_loss=0.2621, pruned_loss=0.06702, over 21309.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3094, pruned_loss=0.08421, over 4272376.02 frames. ], batch size: 131, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:01:45,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1233402.0, ans=0.125 2023-06-22 13:03:06,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1233582.0, ans=0.07 2023-06-22 13:03:11,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1233642.0, ans=0.0 2023-06-22 13:03:24,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1233702.0, ans=0.1 2023-06-22 13:03:25,475 INFO [train.py:996] (1/4) Epoch 7, batch 22650, loss[loss=0.214, simple_loss=0.2767, pruned_loss=0.07562, over 21783.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3073, pruned_loss=0.08377, over 4274631.28 frames. ], batch size: 118, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:03:29,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.97 vs. limit=6.0 2023-06-22 13:03:50,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1233762.0, ans=0.0 2023-06-22 13:04:11,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1233822.0, ans=0.1 2023-06-22 13:04:19,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1233822.0, ans=0.09899494936611666 2023-06-22 13:04:19,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-22 13:04:25,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1233882.0, ans=0.0 2023-06-22 13:04:32,422 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.918e+02 3.837e+02 4.775e+02 6.238e+02 8.753e+02, threshold=9.549e+02, percent-clipped=4.0 2023-06-22 13:05:04,940 INFO [train.py:996] (1/4) Epoch 7, batch 22700, loss[loss=0.2634, simple_loss=0.3025, pruned_loss=0.1122, over 21436.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3038, pruned_loss=0.08372, over 4269199.26 frames. ], batch size: 476, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:05:06,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1234002.0, ans=0.09899494936611666 2023-06-22 13:05:14,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234002.0, ans=0.1 2023-06-22 13:05:36,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1234062.0, ans=0.0 2023-06-22 13:05:36,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234062.0, ans=0.1 2023-06-22 13:05:42,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1234122.0, ans=0.0 2023-06-22 13:06:26,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1234182.0, ans=0.125 2023-06-22 13:06:46,576 INFO [train.py:996] (1/4) Epoch 7, batch 22750, loss[loss=0.2172, simple_loss=0.2792, pruned_loss=0.07758, over 21729.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3058, pruned_loss=0.08525, over 4275086.53 frames. ], batch size: 300, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:07:34,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1234422.0, ans=0.0 2023-06-22 13:07:36,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1234422.0, ans=0.125 2023-06-22 13:07:41,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1234422.0, ans=0.0 2023-06-22 13:07:53,254 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.685e+02 3.471e+02 4.141e+02 5.452e+02 1.173e+03, threshold=8.282e+02, percent-clipped=2.0 2023-06-22 13:08:25,196 INFO [train.py:996] (1/4) Epoch 7, batch 22800, loss[loss=0.2648, simple_loss=0.3311, pruned_loss=0.09923, over 21732.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3078, pruned_loss=0.08738, over 4285140.23 frames. ], batch size: 389, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 13:08:57,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1234662.0, ans=0.95 2023-06-22 13:09:01,284 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:10:00,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-22 13:10:03,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1234902.0, ans=0.125 2023-06-22 13:10:04,266 INFO [train.py:996] (1/4) Epoch 7, batch 22850, loss[loss=0.2199, simple_loss=0.2904, pruned_loss=0.07465, over 21760.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3043, pruned_loss=0.08683, over 4276438.80 frames. ], batch size: 371, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 13:11:09,857 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.461e+02 4.069e+02 5.005e+02 9.619e+02, threshold=8.139e+02, percent-clipped=3.0 2023-06-22 13:11:34,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1235142.0, ans=0.1 2023-06-22 13:11:42,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1235142.0, ans=0.125 2023-06-22 13:11:42,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1235142.0, ans=0.125 2023-06-22 13:11:45,236 INFO [train.py:996] (1/4) Epoch 7, batch 22900, loss[loss=0.2301, simple_loss=0.3155, pruned_loss=0.0724, over 21352.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3054, pruned_loss=0.08573, over 4257380.69 frames. ], batch size: 176, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:12:09,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-22 13:13:32,258 INFO [train.py:996] (1/4) Epoch 7, batch 22950, loss[loss=0.2446, simple_loss=0.3448, pruned_loss=0.07223, over 21553.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.32, pruned_loss=0.08467, over 4262202.00 frames. ], batch size: 195, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:13:38,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1235502.0, ans=0.125 2023-06-22 13:14:04,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1235562.0, ans=0.125 2023-06-22 13:14:14,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1235622.0, ans=0.125 2023-06-22 13:14:42,485 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.271e+02 4.396e+02 6.484e+02 1.017e+03, threshold=8.792e+02, percent-clipped=10.0 2023-06-22 13:14:52,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1235742.0, ans=0.125 2023-06-22 13:15:07,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1235742.0, ans=0.125 2023-06-22 13:15:11,727 INFO [train.py:996] (1/4) Epoch 7, batch 23000, loss[loss=0.2483, simple_loss=0.3731, pruned_loss=0.06174, over 20770.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3186, pruned_loss=0.08265, over 4262492.73 frames. ], batch size: 607, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:15:12,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1235802.0, ans=0.0 2023-06-22 13:15:12,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1235802.0, ans=0.2 2023-06-22 13:15:13,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1235802.0, ans=0.125 2023-06-22 13:16:44,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1236042.0, ans=0.125 2023-06-22 13:16:51,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1236102.0, ans=0.2 2023-06-22 13:16:52,157 INFO [train.py:996] (1/4) Epoch 7, batch 23050, loss[loss=0.2551, simple_loss=0.3208, pruned_loss=0.09473, over 21369.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3189, pruned_loss=0.08463, over 4264409.18 frames. ], batch size: 176, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:16:56,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1236102.0, ans=0.125 2023-06-22 13:17:19,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1236162.0, ans=0.0 2023-06-22 13:17:27,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-22 13:17:32,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1236222.0, ans=0.125 2023-06-22 13:18:00,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1236282.0, ans=0.0 2023-06-22 13:18:03,106 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.533e+02 3.591e+02 4.450e+02 5.560e+02 1.062e+03, threshold=8.900e+02, percent-clipped=1.0 2023-06-22 13:18:14,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1236282.0, ans=0.125 2023-06-22 13:18:28,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1236342.0, ans=0.125 2023-06-22 13:18:33,444 INFO [train.py:996] (1/4) Epoch 7, batch 23100, loss[loss=0.2219, simple_loss=0.2809, pruned_loss=0.08144, over 21622.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3164, pruned_loss=0.08536, over 4272751.18 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:18:46,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1236402.0, ans=0.035 2023-06-22 13:18:54,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-22 13:20:03,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1236642.0, ans=0.0 2023-06-22 13:20:11,914 INFO [train.py:996] (1/4) Epoch 7, batch 23150, loss[loss=0.258, simple_loss=0.3127, pruned_loss=0.1017, over 21677.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3121, pruned_loss=0.08521, over 4280163.34 frames. ], batch size: 263, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:20:37,500 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:21:02,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1236822.0, ans=0.125 2023-06-22 13:21:17,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.75 vs. limit=10.0 2023-06-22 13:21:21,230 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.574e+02 4.225e+02 5.615e+02 9.377e+02, threshold=8.449e+02, percent-clipped=1.0 2023-06-22 13:21:26,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1236882.0, ans=0.125 2023-06-22 13:21:28,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1236942.0, ans=0.125 2023-06-22 13:21:50,760 INFO [train.py:996] (1/4) Epoch 7, batch 23200, loss[loss=0.2444, simple_loss=0.3036, pruned_loss=0.09255, over 21588.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3115, pruned_loss=0.08639, over 4286372.51 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:22:28,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1237122.0, ans=0.0 2023-06-22 13:22:52,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1237182.0, ans=0.2 2023-06-22 13:23:10,311 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-22 13:23:17,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1237242.0, ans=0.1 2023-06-22 13:23:30,218 INFO [train.py:996] (1/4) Epoch 7, batch 23250, loss[loss=0.2468, simple_loss=0.3105, pruned_loss=0.09158, over 21826.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.311, pruned_loss=0.0868, over 4285881.49 frames. ], batch size: 332, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:23:34,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1237302.0, ans=0.125 2023-06-22 13:23:37,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1237302.0, ans=0.0 2023-06-22 13:23:54,394 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-22 13:24:46,820 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 3.587e+02 4.602e+02 6.286e+02 1.178e+03, threshold=9.205e+02, percent-clipped=7.0 2023-06-22 13:24:56,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-22 13:25:16,770 INFO [train.py:996] (1/4) Epoch 7, batch 23300, loss[loss=0.2779, simple_loss=0.3875, pruned_loss=0.08418, over 21252.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3185, pruned_loss=0.08831, over 4286608.52 frames. ], batch size: 548, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:25:25,856 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-22 13:26:26,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1237782.0, ans=0.2 2023-06-22 13:26:30,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1237782.0, ans=0.125 2023-06-22 13:26:55,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-22 13:26:57,967 INFO [train.py:996] (1/4) Epoch 7, batch 23350, loss[loss=0.2151, simple_loss=0.2727, pruned_loss=0.07871, over 20257.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3225, pruned_loss=0.08806, over 4281037.16 frames. ], batch size: 702, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:27:03,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1237902.0, ans=0.125 2023-06-22 13:27:22,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1237962.0, ans=0.0 2023-06-22 13:27:42,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1238022.0, ans=0.125 2023-06-22 13:28:09,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.411e+02 4.317e+02 5.452e+02 1.291e+03, threshold=8.634e+02, percent-clipped=4.0 2023-06-22 13:28:21,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=11.91 vs. limit=15.0 2023-06-22 13:28:24,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1238142.0, ans=0.1 2023-06-22 13:28:38,126 INFO [train.py:996] (1/4) Epoch 7, batch 23400, loss[loss=0.2703, simple_loss=0.3293, pruned_loss=0.1056, over 21819.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3144, pruned_loss=0.08369, over 4283322.04 frames. ], batch size: 124, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:29:01,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-22 13:29:35,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1238322.0, ans=0.125 2023-06-22 13:29:52,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1238382.0, ans=0.0 2023-06-22 13:30:16,915 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:30:16,971 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:30:24,430 INFO [train.py:996] (1/4) Epoch 7, batch 23450, loss[loss=0.2474, simple_loss=0.319, pruned_loss=0.08791, over 21702.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3165, pruned_loss=0.08607, over 4280308.97 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:30:31,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1238502.0, ans=0.0 2023-06-22 13:30:40,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.70 vs. limit=22.5 2023-06-22 13:30:45,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-22 13:30:51,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238562.0, ans=0.1 2023-06-22 13:30:54,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1238562.0, ans=0.0 2023-06-22 13:30:54,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1238562.0, ans=0.0 2023-06-22 13:31:01,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1238622.0, ans=0.0 2023-06-22 13:31:17,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1238622.0, ans=0.2 2023-06-22 13:31:34,395 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.929e+02 4.982e+02 6.736e+02 9.588e+02, threshold=9.965e+02, percent-clipped=2.0 2023-06-22 13:31:40,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1238742.0, ans=0.2 2023-06-22 13:32:07,441 INFO [train.py:996] (1/4) Epoch 7, batch 23500, loss[loss=0.2636, simple_loss=0.3282, pruned_loss=0.09951, over 21506.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3185, pruned_loss=0.08855, over 4280441.77 frames. ], batch size: 211, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:32:35,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1238862.0, ans=0.0 2023-06-22 13:32:50,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-22 13:33:41,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1239042.0, ans=0.05 2023-06-22 13:33:48,044 INFO [train.py:996] (1/4) Epoch 7, batch 23550, loss[loss=0.2161, simple_loss=0.2821, pruned_loss=0.07511, over 21817.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3135, pruned_loss=0.08853, over 4274788.52 frames. ], batch size: 98, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:34:00,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-22 13:34:11,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1239162.0, ans=0.125 2023-06-22 13:34:13,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1239162.0, ans=0.1 2023-06-22 13:34:13,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1239162.0, ans=0.0 2023-06-22 13:34:30,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1239222.0, ans=0.125 2023-06-22 13:34:30,554 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-06-22 13:34:51,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1239282.0, ans=0.125 2023-06-22 13:34:55,619 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.824e+02 3.361e+02 3.874e+02 4.874e+02 9.234e+02, threshold=7.748e+02, percent-clipped=0.0 2023-06-22 13:34:57,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1239282.0, ans=0.2 2023-06-22 13:35:23,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1239342.0, ans=0.125 2023-06-22 13:35:29,854 INFO [train.py:996] (1/4) Epoch 7, batch 23600, loss[loss=0.2553, simple_loss=0.3286, pruned_loss=0.09103, over 21790.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3116, pruned_loss=0.08759, over 4268640.28 frames. ], batch size: 332, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:36:29,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1239582.0, ans=0.125 2023-06-22 13:36:47,434 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-22 13:37:12,167 INFO [train.py:996] (1/4) Epoch 7, batch 23650, loss[loss=0.295, simple_loss=0.3645, pruned_loss=0.1127, over 21451.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.312, pruned_loss=0.08551, over 4275247.65 frames. ], batch size: 471, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:37:16,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1239702.0, ans=0.125 2023-06-22 13:37:24,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1239702.0, ans=0.0 2023-06-22 13:38:04,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1239822.0, ans=0.5 2023-06-22 13:38:29,772 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.732e+02 3.692e+02 5.064e+02 6.588e+02 1.428e+03, threshold=1.013e+03, percent-clipped=16.0 2023-06-22 13:38:43,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1239942.0, ans=0.0 2023-06-22 13:38:47,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1239942.0, ans=0.125 2023-06-22 13:38:53,689 INFO [train.py:996] (1/4) Epoch 7, batch 23700, loss[loss=0.2657, simple_loss=0.338, pruned_loss=0.09677, over 21424.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3139, pruned_loss=0.08465, over 4278859.14 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:39:05,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1240002.0, ans=0.125 2023-06-22 13:39:32,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1240062.0, ans=0.2 2023-06-22 13:39:36,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1240122.0, ans=0.025 2023-06-22 13:40:26,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1240242.0, ans=0.0 2023-06-22 13:40:40,731 INFO [train.py:996] (1/4) Epoch 7, batch 23750, loss[loss=0.1859, simple_loss=0.2807, pruned_loss=0.04553, over 21264.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3158, pruned_loss=0.08442, over 4281800.09 frames. ], batch size: 176, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:40:44,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1240302.0, ans=0.0 2023-06-22 13:40:54,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1240302.0, ans=0.0 2023-06-22 13:41:47,998 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.292e+02 4.228e+02 5.463e+02 1.067e+03, threshold=8.456e+02, percent-clipped=1.0 2023-06-22 13:41:49,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.94 vs. limit=22.5 2023-06-22 13:42:08,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1240542.0, ans=0.125 2023-06-22 13:42:23,039 INFO [train.py:996] (1/4) Epoch 7, batch 23800, loss[loss=0.291, simple_loss=0.3778, pruned_loss=0.1021, over 21669.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3152, pruned_loss=0.08255, over 4280786.24 frames. ], batch size: 298, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:42:28,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1240602.0, ans=0.0 2023-06-22 13:42:31,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1240602.0, ans=0.0 2023-06-22 13:42:48,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-22 13:43:01,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1240662.0, ans=0.5 2023-06-22 13:43:38,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1240782.0, ans=0.125 2023-06-22 13:44:04,759 INFO [train.py:996] (1/4) Epoch 7, batch 23850, loss[loss=0.3034, simple_loss=0.3653, pruned_loss=0.1207, over 21207.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3232, pruned_loss=0.08447, over 4280406.56 frames. ], batch size: 143, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:44:19,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=22.5 2023-06-22 13:44:47,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1241022.0, ans=0.125 2023-06-22 13:45:23,644 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.822e+02 3.704e+02 4.399e+02 5.519e+02 1.068e+03, threshold=8.797e+02, percent-clipped=5.0 2023-06-22 13:45:33,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1241142.0, ans=0.0 2023-06-22 13:45:47,906 INFO [train.py:996] (1/4) Epoch 7, batch 23900, loss[loss=0.2539, simple_loss=0.3268, pruned_loss=0.09055, over 21704.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.329, pruned_loss=0.08701, over 4283106.35 frames. ], batch size: 282, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:46:22,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1241262.0, ans=0.0 2023-06-22 13:46:35,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1241322.0, ans=0.5 2023-06-22 13:47:30,442 INFO [train.py:996] (1/4) Epoch 7, batch 23950, loss[loss=0.2644, simple_loss=0.3346, pruned_loss=0.09705, over 21171.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3239, pruned_loss=0.08678, over 4278081.13 frames. ], batch size: 143, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:47:30,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1241502.0, ans=0.0 2023-06-22 13:47:54,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1241502.0, ans=0.125 2023-06-22 13:48:47,879 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.707e+02 3.506e+02 4.327e+02 5.535e+02 8.905e+02, threshold=8.653e+02, percent-clipped=1.0 2023-06-22 13:49:02,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1241742.0, ans=0.125 2023-06-22 13:49:06,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-22 13:49:10,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1241742.0, ans=0.2 2023-06-22 13:49:17,848 INFO [train.py:996] (1/4) Epoch 7, batch 24000, loss[loss=0.2478, simple_loss=0.3213, pruned_loss=0.08718, over 21809.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3245, pruned_loss=0.09001, over 4277572.23 frames. ], batch size: 282, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:49:17,849 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 13:49:33,457 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2773, simple_loss=0.3696, pruned_loss=0.09254, over 1796401.00 frames. 2023-06-22 13:49:33,457 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 13:50:19,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1241922.0, ans=0.1 2023-06-22 13:50:59,526 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-22 13:51:16,770 INFO [train.py:996] (1/4) Epoch 7, batch 24050, loss[loss=0.2267, simple_loss=0.3152, pruned_loss=0.06917, over 21630.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3264, pruned_loss=0.0904, over 4284045.71 frames. ], batch size: 263, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:51:41,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1242162.0, ans=0.0 2023-06-22 13:51:55,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1242162.0, ans=0.0 2023-06-22 13:52:14,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1242282.0, ans=0.0 2023-06-22 13:52:35,700 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.538e+02 3.768e+02 5.092e+02 6.295e+02 1.003e+03, threshold=1.018e+03, percent-clipped=2.0 2023-06-22 13:52:36,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1242282.0, ans=0.0 2023-06-22 13:52:58,080 INFO [train.py:996] (1/4) Epoch 7, batch 24100, loss[loss=0.2581, simple_loss=0.3278, pruned_loss=0.09422, over 21298.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3259, pruned_loss=0.08863, over 4282678.13 frames. ], batch size: 176, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:53:16,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1242462.0, ans=0.09899494936611666 2023-06-22 13:53:34,578 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-22 13:53:38,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1242522.0, ans=0.125 2023-06-22 13:54:38,984 INFO [train.py:996] (1/4) Epoch 7, batch 24150, loss[loss=0.2577, simple_loss=0.3201, pruned_loss=0.09763, over 21892.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3261, pruned_loss=0.09002, over 4286672.30 frames. ], batch size: 371, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:54:44,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1242702.0, ans=0.0 2023-06-22 13:54:55,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242702.0, ans=0.1 2023-06-22 13:55:57,970 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.873e+02 3.729e+02 4.537e+02 5.592e+02 8.815e+02, threshold=9.074e+02, percent-clipped=0.0 2023-06-22 13:56:06,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1242942.0, ans=0.1 2023-06-22 13:56:18,606 INFO [train.py:996] (1/4) Epoch 7, batch 24200, loss[loss=0.2214, simple_loss=0.3102, pruned_loss=0.06626, over 21805.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3296, pruned_loss=0.09197, over 4294217.54 frames. ], batch size: 282, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:57:30,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1243182.0, ans=22.5 2023-06-22 13:58:03,642 INFO [train.py:996] (1/4) Epoch 7, batch 24250, loss[loss=0.1813, simple_loss=0.2668, pruned_loss=0.04785, over 21202.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3272, pruned_loss=0.08592, over 4288731.05 frames. ], batch size: 143, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 13:58:19,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1243362.0, ans=0.125 2023-06-22 13:58:28,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1243362.0, ans=0.125 2023-06-22 13:58:30,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1243362.0, ans=0.2 2023-06-22 13:58:44,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1243422.0, ans=0.1 2023-06-22 13:58:51,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2023-06-22 13:58:54,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1243422.0, ans=0.0 2023-06-22 13:59:07,505 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.091e+02 3.731e+02 4.711e+02 7.099e+02, threshold=7.462e+02, percent-clipped=0.0 2023-06-22 13:59:14,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1243542.0, ans=0.0 2023-06-22 13:59:14,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1243542.0, ans=0.0 2023-06-22 13:59:38,175 INFO [train.py:996] (1/4) Epoch 7, batch 24300, loss[loss=0.2143, simple_loss=0.2839, pruned_loss=0.07233, over 21858.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3201, pruned_loss=0.07961, over 4272706.25 frames. ], batch size: 107, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:00:18,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1243722.0, ans=0.0 2023-06-22 14:00:31,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1243722.0, ans=0.125 2023-06-22 14:00:48,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1243782.0, ans=0.2 2023-06-22 14:01:07,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1243842.0, ans=0.0 2023-06-22 14:01:17,187 INFO [train.py:996] (1/4) Epoch 7, batch 24350, loss[loss=0.2703, simple_loss=0.3333, pruned_loss=0.1037, over 21818.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.313, pruned_loss=0.07812, over 4275670.80 frames. ], batch size: 247, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:01:25,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1243902.0, ans=0.1 2023-06-22 14:01:52,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1243962.0, ans=0.0 2023-06-22 14:02:31,001 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 3.282e+02 3.867e+02 5.054e+02 1.141e+03, threshold=7.734e+02, percent-clipped=5.0 2023-06-22 14:02:54,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1244142.0, ans=0.125 2023-06-22 14:02:56,628 INFO [train.py:996] (1/4) Epoch 7, batch 24400, loss[loss=0.2575, simple_loss=0.3335, pruned_loss=0.09072, over 21680.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3186, pruned_loss=0.08237, over 4274313.56 frames. ], batch size: 332, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:03:49,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1244322.0, ans=0.95 2023-06-22 14:03:49,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1244322.0, ans=0.125 2023-06-22 14:04:21,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1244442.0, ans=0.1 2023-06-22 14:04:42,199 INFO [train.py:996] (1/4) Epoch 7, batch 24450, loss[loss=0.2247, simple_loss=0.3155, pruned_loss=0.0669, over 21657.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3201, pruned_loss=0.08376, over 4260973.08 frames. ], batch size: 263, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:04:42,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1244502.0, ans=0.035 2023-06-22 14:04:45,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.63 vs. limit=10.0 2023-06-22 14:04:51,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1244502.0, ans=0.035 2023-06-22 14:05:04,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1244562.0, ans=0.1 2023-06-22 14:05:57,132 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.418e+02 4.354e+02 5.626e+02 1.189e+03, threshold=8.709e+02, percent-clipped=8.0 2023-06-22 14:06:22,881 INFO [train.py:996] (1/4) Epoch 7, batch 24500, loss[loss=0.3088, simple_loss=0.3757, pruned_loss=0.1209, over 21501.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.322, pruned_loss=0.08448, over 4271615.38 frames. ], batch size: 507, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:06:42,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-22 14:07:19,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1244922.0, ans=0.1 2023-06-22 14:07:50,358 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:07:58,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-22 14:08:10,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1245102.0, ans=0.125 2023-06-22 14:08:11,151 INFO [train.py:996] (1/4) Epoch 7, batch 24550, loss[loss=0.2867, simple_loss=0.3589, pruned_loss=0.1073, over 21796.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3245, pruned_loss=0.08682, over 4276867.30 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:08:22,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1245102.0, ans=0.125 2023-06-22 14:09:25,440 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 3.383e+02 4.024e+02 4.711e+02 8.418e+02, threshold=8.048e+02, percent-clipped=0.0 2023-06-22 14:09:51,575 INFO [train.py:996] (1/4) Epoch 7, batch 24600, loss[loss=0.2401, simple_loss=0.2927, pruned_loss=0.09377, over 21251.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3221, pruned_loss=0.08879, over 4273575.24 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:11:03,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1245582.0, ans=0.125 2023-06-22 14:11:32,143 INFO [train.py:996] (1/4) Epoch 7, batch 24650, loss[loss=0.2425, simple_loss=0.3792, pruned_loss=0.05284, over 19740.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3152, pruned_loss=0.08723, over 4266415.34 frames. ], batch size: 702, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:11:32,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1245702.0, ans=0.125 2023-06-22 14:11:34,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1245702.0, ans=0.125 2023-06-22 14:12:06,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-06-22 14:12:06,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1245762.0, ans=0.0 2023-06-22 14:12:09,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1245822.0, ans=0.125 2023-06-22 14:12:18,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1245822.0, ans=0.125 2023-06-22 14:12:46,323 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.556e+02 4.332e+02 6.409e+02 1.192e+03, threshold=8.664e+02, percent-clipped=12.0 2023-06-22 14:12:46,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1245882.0, ans=0.125 2023-06-22 14:13:12,584 INFO [train.py:996] (1/4) Epoch 7, batch 24700, loss[loss=0.2447, simple_loss=0.3021, pruned_loss=0.09367, over 21787.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.313, pruned_loss=0.08503, over 4270994.28 frames. ], batch size: 102, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:13:38,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1246062.0, ans=0.125 2023-06-22 14:14:00,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1246122.0, ans=0.125 2023-06-22 14:14:52,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-22 14:14:54,490 INFO [train.py:996] (1/4) Epoch 7, batch 24750, loss[loss=0.2089, simple_loss=0.2739, pruned_loss=0.0719, over 21735.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3067, pruned_loss=0.08252, over 4274765.53 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:14:59,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-22 14:15:18,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1246362.0, ans=0.1 2023-06-22 14:15:29,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1246422.0, ans=0.1 2023-06-22 14:15:43,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1246422.0, ans=22.5 2023-06-22 14:16:07,201 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 3.441e+02 4.025e+02 5.655e+02 9.587e+02, threshold=8.050e+02, percent-clipped=3.0 2023-06-22 14:16:11,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-06-22 14:16:33,222 INFO [train.py:996] (1/4) Epoch 7, batch 24800, loss[loss=0.2678, simple_loss=0.3294, pruned_loss=0.103, over 21874.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3019, pruned_loss=0.08231, over 4261512.15 frames. ], batch size: 351, lr: 4.21e-03, grad_scale: 32.0 2023-06-22 14:17:01,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1246662.0, ans=0.0 2023-06-22 14:17:01,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1246662.0, ans=0.1 2023-06-22 14:18:02,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1246842.0, ans=0.0 2023-06-22 14:18:02,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1246842.0, ans=0.125 2023-06-22 14:18:14,878 INFO [train.py:996] (1/4) Epoch 7, batch 24850, loss[loss=0.212, simple_loss=0.2855, pruned_loss=0.06924, over 21820.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3025, pruned_loss=0.08338, over 4272926.55 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:18:19,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-22 14:18:20,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1246902.0, ans=0.0 2023-06-22 14:19:10,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1247022.0, ans=0.2 2023-06-22 14:19:31,876 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.875e+02 3.674e+02 4.443e+02 5.678e+02 1.334e+03, threshold=8.887e+02, percent-clipped=8.0 2023-06-22 14:19:44,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1247142.0, ans=0.0 2023-06-22 14:19:56,853 INFO [train.py:996] (1/4) Epoch 7, batch 24900, loss[loss=0.3448, simple_loss=0.396, pruned_loss=0.1468, over 21399.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.306, pruned_loss=0.08462, over 4273845.35 frames. ], batch size: 471, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:20:15,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1247202.0, ans=0.0 2023-06-22 14:20:22,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.12 vs. limit=10.0 2023-06-22 14:20:32,912 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2023-06-22 14:20:59,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1247382.0, ans=0.125 2023-06-22 14:21:29,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1247442.0, ans=0.2 2023-06-22 14:21:30,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1247442.0, ans=0.1 2023-06-22 14:21:43,162 INFO [train.py:996] (1/4) Epoch 7, batch 24950, loss[loss=0.3044, simple_loss=0.3587, pruned_loss=0.125, over 21527.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3136, pruned_loss=0.08876, over 4281592.45 frames. ], batch size: 211, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:21:49,768 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:23:02,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 3.624e+02 4.184e+02 5.539e+02 8.778e+02, threshold=8.368e+02, percent-clipped=0.0 2023-06-22 14:23:16,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1247742.0, ans=0.125 2023-06-22 14:23:26,317 INFO [train.py:996] (1/4) Epoch 7, batch 25000, loss[loss=0.2763, simple_loss=0.3535, pruned_loss=0.09958, over 21649.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3203, pruned_loss=0.09079, over 4285602.14 frames. ], batch size: 263, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:23:28,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1247802.0, ans=0.0 2023-06-22 14:24:17,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1247922.0, ans=0.125 2023-06-22 14:24:21,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1247922.0, ans=0.0 2023-06-22 14:24:48,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1248042.0, ans=0.125 2023-06-22 14:24:51,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1248042.0, ans=0.95 2023-06-22 14:25:10,074 INFO [train.py:996] (1/4) Epoch 7, batch 25050, loss[loss=0.2011, simple_loss=0.2579, pruned_loss=0.07215, over 21456.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3151, pruned_loss=0.08986, over 4284983.87 frames. ], batch size: 212, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:25:18,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-22 14:25:27,093 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-22 14:25:59,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1248222.0, ans=0.2 2023-06-22 14:26:28,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 3.310e+02 3.900e+02 4.690e+02 8.099e+02, threshold=7.799e+02, percent-clipped=0.0 2023-06-22 14:26:51,218 INFO [train.py:996] (1/4) Epoch 7, batch 25100, loss[loss=0.2385, simple_loss=0.2978, pruned_loss=0.08957, over 21552.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3094, pruned_loss=0.08822, over 4287342.46 frames. ], batch size: 391, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:26:54,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1248402.0, ans=0.04949747468305833 2023-06-22 14:26:57,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1248402.0, ans=0.0 2023-06-22 14:27:25,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-22 14:28:04,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1248582.0, ans=0.0 2023-06-22 14:28:10,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1248642.0, ans=0.1 2023-06-22 14:28:25,034 INFO [train.py:996] (1/4) Epoch 7, batch 25150, loss[loss=0.229, simple_loss=0.3111, pruned_loss=0.07345, over 21875.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3119, pruned_loss=0.08571, over 4282855.31 frames. ], batch size: 371, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:28:32,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1248702.0, ans=0.125 2023-06-22 14:28:34,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1248702.0, ans=0.0 2023-06-22 14:28:44,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=22.5 2023-06-22 14:29:02,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1248762.0, ans=0.1 2023-06-22 14:29:42,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1248882.0, ans=0.125 2023-06-22 14:29:48,297 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.302e+02 3.375e+02 4.359e+02 5.711e+02 9.666e+02, threshold=8.717e+02, percent-clipped=5.0 2023-06-22 14:29:50,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-22 14:30:02,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1248942.0, ans=0.07 2023-06-22 14:30:06,574 INFO [train.py:996] (1/4) Epoch 7, batch 25200, loss[loss=0.2069, simple_loss=0.2955, pruned_loss=0.05918, over 21434.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3109, pruned_loss=0.08338, over 4275073.47 frames. ], batch size: 211, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:30:07,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1249002.0, ans=0.125 2023-06-22 14:31:12,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-22 14:31:45,957 INFO [train.py:996] (1/4) Epoch 7, batch 25250, loss[loss=0.2177, simple_loss=0.2687, pruned_loss=0.08341, over 21257.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.309, pruned_loss=0.08181, over 4263201.47 frames. ], batch size: 144, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:33:04,023 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 3.268e+02 4.285e+02 6.443e+02 1.274e+03, threshold=8.570e+02, percent-clipped=9.0 2023-06-22 14:33:16,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1249542.0, ans=0.125 2023-06-22 14:33:30,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1249602.0, ans=0.1 2023-06-22 14:33:32,040 INFO [train.py:996] (1/4) Epoch 7, batch 25300, loss[loss=0.1956, simple_loss=0.26, pruned_loss=0.06555, over 21363.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3048, pruned_loss=0.08029, over 4245060.02 frames. ], batch size: 131, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:33:56,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1249662.0, ans=0.125 2023-06-22 14:34:13,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-22 14:35:02,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1249842.0, ans=0.125 2023-06-22 14:35:09,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1249842.0, ans=0.2 2023-06-22 14:35:12,517 INFO [train.py:996] (1/4) Epoch 7, batch 25350, loss[loss=0.2253, simple_loss=0.3299, pruned_loss=0.06031, over 20781.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.307, pruned_loss=0.08026, over 4233033.84 frames. ], batch size: 607, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:35:18,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.97 vs. limit=22.5 2023-06-22 14:35:57,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1250022.0, ans=0.0 2023-06-22 14:36:25,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.273e+02 3.948e+02 5.188e+02 1.185e+03, threshold=7.895e+02, percent-clipped=3.0 2023-06-22 14:36:47,888 INFO [train.py:996] (1/4) Epoch 7, batch 25400, loss[loss=0.2168, simple_loss=0.2708, pruned_loss=0.08141, over 21500.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3016, pruned_loss=0.07866, over 4198906.65 frames. ], batch size: 441, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:36:56,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1250202.0, ans=10.0 2023-06-22 14:37:54,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1250382.0, ans=0.125 2023-06-22 14:38:02,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1250382.0, ans=0.0 2023-06-22 14:38:12,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1250442.0, ans=0.125 2023-06-22 14:38:24,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-22 14:38:33,467 INFO [train.py:996] (1/4) Epoch 7, batch 25450, loss[loss=0.2191, simple_loss=0.3185, pruned_loss=0.05981, over 21817.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3036, pruned_loss=0.08165, over 4211011.49 frames. ], batch size: 351, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:38:58,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1250562.0, ans=0.0 2023-06-22 14:39:18,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-22 14:39:38,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=15.0 2023-06-22 14:39:43,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1250682.0, ans=0.5 2023-06-22 14:39:47,980 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.153e+02 3.590e+02 4.840e+02 1.017e+03, threshold=7.180e+02, percent-clipped=5.0 2023-06-22 14:40:16,028 INFO [train.py:996] (1/4) Epoch 7, batch 25500, loss[loss=0.1802, simple_loss=0.2696, pruned_loss=0.04538, over 21580.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3039, pruned_loss=0.07783, over 4226150.24 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:40:22,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1250802.0, ans=0.125 2023-06-22 14:40:31,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1250802.0, ans=0.2 2023-06-22 14:41:06,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1250922.0, ans=0.09899494936611666 2023-06-22 14:41:16,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1250982.0, ans=0.2 2023-06-22 14:41:56,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1251102.0, ans=0.0 2023-06-22 14:41:57,280 INFO [train.py:996] (1/4) Epoch 7, batch 25550, loss[loss=0.2323, simple_loss=0.3369, pruned_loss=0.06385, over 21633.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.311, pruned_loss=0.07802, over 4239178.19 frames. ], batch size: 414, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:42:14,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1251102.0, ans=0.1 2023-06-22 14:42:28,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-22 14:42:29,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1251162.0, ans=0.1 2023-06-22 14:42:56,567 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-06-22 14:43:13,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-22 14:43:16,633 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.498e+02 4.719e+02 6.187e+02 1.035e+03, threshold=9.438e+02, percent-clipped=13.0 2023-06-22 14:43:18,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1251342.0, ans=0.125 2023-06-22 14:43:22,128 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.44 vs. limit=10.0 2023-06-22 14:43:39,340 INFO [train.py:996] (1/4) Epoch 7, batch 25600, loss[loss=0.2945, simple_loss=0.3592, pruned_loss=0.1149, over 21729.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3153, pruned_loss=0.07833, over 4247449.55 frames. ], batch size: 351, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:44:01,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-22 14:44:07,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1251462.0, ans=0.125 2023-06-22 14:44:25,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-06-22 14:45:08,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1251642.0, ans=0.125 2023-06-22 14:45:16,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=22.5 2023-06-22 14:45:20,820 INFO [train.py:996] (1/4) Epoch 7, batch 25650, loss[loss=0.2283, simple_loss=0.2886, pruned_loss=0.08403, over 21631.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3154, pruned_loss=0.08126, over 4258621.09 frames. ], batch size: 282, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:45:42,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1251762.0, ans=0.125 2023-06-22 14:45:58,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1251822.0, ans=0.125 2023-06-22 14:46:10,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1251822.0, ans=0.5 2023-06-22 14:46:10,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1251822.0, ans=0.04949747468305833 2023-06-22 14:46:37,528 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.765e+02 3.700e+02 4.314e+02 5.225e+02 1.015e+03, threshold=8.627e+02, percent-clipped=1.0 2023-06-22 14:46:40,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1251942.0, ans=0.125 2023-06-22 14:46:40,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1251942.0, ans=0.1 2023-06-22 14:47:04,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1252002.0, ans=0.125 2023-06-22 14:47:05,293 INFO [train.py:996] (1/4) Epoch 7, batch 25700, loss[loss=0.2039, simple_loss=0.2843, pruned_loss=0.06175, over 21370.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3129, pruned_loss=0.08252, over 4257382.63 frames. ], batch size: 211, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:47:35,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1252062.0, ans=0.2 2023-06-22 14:48:24,094 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-06-22 14:48:29,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1252242.0, ans=0.0 2023-06-22 14:48:47,664 INFO [train.py:996] (1/4) Epoch 7, batch 25750, loss[loss=0.3985, simple_loss=0.4769, pruned_loss=0.1601, over 21486.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3191, pruned_loss=0.08638, over 4257120.70 frames. ], batch size: 471, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:49:41,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1252422.0, ans=0.0 2023-06-22 14:49:51,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-22 14:50:13,062 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 3.834e+02 4.976e+02 6.309e+02 1.046e+03, threshold=9.952e+02, percent-clipped=8.0 2023-06-22 14:50:31,420 INFO [train.py:996] (1/4) Epoch 7, batch 25800, loss[loss=0.2517, simple_loss=0.3291, pruned_loss=0.08715, over 21741.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3309, pruned_loss=0.09006, over 4262512.24 frames. ], batch size: 332, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:50:35,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1252602.0, ans=0.0 2023-06-22 14:51:58,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-22 14:52:13,840 INFO [train.py:996] (1/4) Epoch 7, batch 25850, loss[loss=0.2584, simple_loss=0.3187, pruned_loss=0.09903, over 21515.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3326, pruned_loss=0.0898, over 4266662.03 frames. ], batch size: 131, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:52:40,045 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-22 14:53:22,852 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.650e-02 2023-06-22 14:53:33,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-22 14:53:39,300 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 3.379e+02 4.198e+02 5.392e+02 1.106e+03, threshold=8.396e+02, percent-clipped=2.0 2023-06-22 14:53:48,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1253142.0, ans=0.125 2023-06-22 14:53:49,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1253142.0, ans=0.125 2023-06-22 14:54:05,903 INFO [train.py:996] (1/4) Epoch 7, batch 25900, loss[loss=0.258, simple_loss=0.3453, pruned_loss=0.08538, over 21399.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3341, pruned_loss=0.09119, over 4271232.72 frames. ], batch size: 211, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:54:16,766 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:54:25,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1253202.0, ans=0.0 2023-06-22 14:54:42,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1253262.0, ans=0.125 2023-06-22 14:55:16,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-22 14:55:18,008 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.34 vs. limit=15.0 2023-06-22 14:55:28,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1253442.0, ans=0.2 2023-06-22 14:55:52,705 INFO [train.py:996] (1/4) Epoch 7, batch 25950, loss[loss=0.2774, simple_loss=0.3462, pruned_loss=0.1043, over 21574.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3416, pruned_loss=0.09434, over 4269751.53 frames. ], batch size: 414, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:56:15,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1253562.0, ans=0.125 2023-06-22 14:56:19,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1253562.0, ans=0.0 2023-06-22 14:56:29,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1253622.0, ans=0.125 2023-06-22 14:57:03,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1253682.0, ans=0.125 2023-06-22 14:57:10,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1253742.0, ans=0.125 2023-06-22 14:57:11,655 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.589e+02 4.463e+02 5.427e+02 9.904e+02, threshold=8.927e+02, percent-clipped=3.0 2023-06-22 14:57:37,693 INFO [train.py:996] (1/4) Epoch 7, batch 26000, loss[loss=0.3042, simple_loss=0.361, pruned_loss=0.1237, over 20035.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3402, pruned_loss=0.09264, over 4265343.51 frames. ], batch size: 703, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:58:26,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-22 14:58:49,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1253982.0, ans=0.2 2023-06-22 14:59:12,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1254042.0, ans=0.05 2023-06-22 14:59:18,565 INFO [train.py:996] (1/4) Epoch 7, batch 26050, loss[loss=0.2436, simple_loss=0.3043, pruned_loss=0.09145, over 21684.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3414, pruned_loss=0.09463, over 4266732.54 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:59:35,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=8.0 2023-06-22 14:59:44,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1254162.0, ans=0.125 2023-06-22 14:59:55,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-22 15:00:37,211 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.714e+02 3.622e+02 4.374e+02 5.321e+02 8.486e+02, threshold=8.748e+02, percent-clipped=0.0 2023-06-22 15:00:37,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254342.0, ans=0.1 2023-06-22 15:00:56,284 INFO [train.py:996] (1/4) Epoch 7, batch 26100, loss[loss=0.2406, simple_loss=0.308, pruned_loss=0.08659, over 21878.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3348, pruned_loss=0.09389, over 4275890.88 frames. ], batch size: 391, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:00:58,506 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:01:13,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1254462.0, ans=0.125 2023-06-22 15:01:13,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1254462.0, ans=0.2 2023-06-22 15:01:21,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1254462.0, ans=0.07 2023-06-22 15:01:30,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1254522.0, ans=0.0 2023-06-22 15:02:36,845 INFO [train.py:996] (1/4) Epoch 7, batch 26150, loss[loss=0.2498, simple_loss=0.3159, pruned_loss=0.09192, over 21474.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3308, pruned_loss=0.09281, over 4277307.89 frames. ], batch size: 194, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:03:44,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1254882.0, ans=0.125 2023-06-22 15:03:50,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1254882.0, ans=0.0 2023-06-22 15:04:04,145 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.200e+02 3.625e+02 4.544e+02 6.697e+02, threshold=7.250e+02, percent-clipped=0.0 2023-06-22 15:04:19,704 INFO [train.py:996] (1/4) Epoch 7, batch 26200, loss[loss=0.2102, simple_loss=0.311, pruned_loss=0.05469, over 21626.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3315, pruned_loss=0.09075, over 4281647.79 frames. ], batch size: 230, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:04:23,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1255002.0, ans=0.0 2023-06-22 15:04:31,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255002.0, ans=0.1 2023-06-22 15:04:35,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-22 15:04:55,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1255122.0, ans=0.125 2023-06-22 15:05:23,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-22 15:05:26,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1255182.0, ans=0.1 2023-06-22 15:05:59,795 INFO [train.py:996] (1/4) Epoch 7, batch 26250, loss[loss=0.2481, simple_loss=0.3216, pruned_loss=0.08731, over 21471.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3333, pruned_loss=0.08907, over 4288047.49 frames. ], batch size: 131, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:06:19,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1255362.0, ans=0.0 2023-06-22 15:07:24,253 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.543e+02 3.568e+02 4.427e+02 6.029e+02 1.421e+03, threshold=8.855e+02, percent-clipped=13.0 2023-06-22 15:07:38,971 INFO [train.py:996] (1/4) Epoch 7, batch 26300, loss[loss=0.2283, simple_loss=0.2932, pruned_loss=0.08171, over 21482.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3294, pruned_loss=0.08971, over 4290931.95 frames. ], batch size: 211, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:07:52,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1255602.0, ans=0.125 2023-06-22 15:08:11,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1255662.0, ans=0.1 2023-06-22 15:08:44,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1255722.0, ans=0.125 2023-06-22 15:08:52,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1255782.0, ans=0.125 2023-06-22 15:09:07,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1255842.0, ans=0.0 2023-06-22 15:09:10,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1255842.0, ans=0.125 2023-06-22 15:09:17,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1255842.0, ans=0.0 2023-06-22 15:09:19,836 INFO [train.py:996] (1/4) Epoch 7, batch 26350, loss[loss=0.2592, simple_loss=0.3257, pruned_loss=0.09639, over 21625.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3286, pruned_loss=0.09106, over 4296577.55 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:09:25,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1255902.0, ans=0.125 2023-06-22 15:09:27,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.39 vs. limit=6.0 2023-06-22 15:10:22,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1256022.0, ans=15.0 2023-06-22 15:10:28,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1256082.0, ans=0.125 2023-06-22 15:10:31,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1256082.0, ans=0.1 2023-06-22 15:10:39,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1256082.0, ans=0.0 2023-06-22 15:10:45,401 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.781e+02 3.544e+02 4.038e+02 5.377e+02 1.137e+03, threshold=8.075e+02, percent-clipped=4.0 2023-06-22 15:10:49,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1256142.0, ans=0.0 2023-06-22 15:11:00,710 INFO [train.py:996] (1/4) Epoch 7, batch 26400, loss[loss=0.209, simple_loss=0.269, pruned_loss=0.07455, over 21271.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3231, pruned_loss=0.09135, over 4284123.02 frames. ], batch size: 549, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:11:14,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1256202.0, ans=0.2 2023-06-22 15:11:37,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1256262.0, ans=0.1 2023-06-22 15:11:51,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1256322.0, ans=0.125 2023-06-22 15:11:52,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1256322.0, ans=0.125 2023-06-22 15:12:02,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1256322.0, ans=0.0 2023-06-22 15:12:08,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.76 vs. limit=15.0 2023-06-22 15:12:40,494 INFO [train.py:996] (1/4) Epoch 7, batch 26450, loss[loss=0.3391, simple_loss=0.4312, pruned_loss=0.1234, over 21408.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3222, pruned_loss=0.09005, over 4277925.03 frames. ], batch size: 507, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:14:01,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-06-22 15:14:08,665 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 3.688e+02 4.621e+02 8.318e+02 2.033e+03, threshold=9.242e+02, percent-clipped=27.0 2023-06-22 15:14:12,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1256742.0, ans=0.0 2023-06-22 15:14:37,756 INFO [train.py:996] (1/4) Epoch 7, batch 26500, loss[loss=0.1985, simple_loss=0.2643, pruned_loss=0.06639, over 21365.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3237, pruned_loss=0.08784, over 4267648.87 frames. ], batch size: 176, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:15:02,506 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-22 15:15:33,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1256982.0, ans=0.07 2023-06-22 15:16:27,025 INFO [train.py:996] (1/4) Epoch 7, batch 26550, loss[loss=0.2603, simple_loss=0.3649, pruned_loss=0.07787, over 19852.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3215, pruned_loss=0.08518, over 4262697.36 frames. ], batch size: 703, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:16:28,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1257102.0, ans=0.07 2023-06-22 15:17:55,912 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.702e+02 5.125e+02 7.307e+02 1.235e+03, threshold=1.025e+03, percent-clipped=15.0 2023-06-22 15:18:08,973 INFO [train.py:996] (1/4) Epoch 7, batch 26600, loss[loss=0.2533, simple_loss=0.3167, pruned_loss=0.09493, over 21579.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3186, pruned_loss=0.08249, over 4257373.91 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:18:28,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1257462.0, ans=0.125 2023-06-22 15:19:29,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1257642.0, ans=0.0 2023-06-22 15:19:46,588 INFO [train.py:996] (1/4) Epoch 7, batch 26650, loss[loss=0.2063, simple_loss=0.2772, pruned_loss=0.06767, over 21459.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3124, pruned_loss=0.08189, over 4251467.97 frames. ], batch size: 195, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:19:59,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1257702.0, ans=0.125 2023-06-22 15:20:31,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1257822.0, ans=0.125 2023-06-22 15:21:03,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1257882.0, ans=0.2 2023-06-22 15:21:12,601 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.375e+02 4.059e+02 5.279e+02 9.271e+02, threshold=8.118e+02, percent-clipped=0.0 2023-06-22 15:21:14,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1257942.0, ans=0.1 2023-06-22 15:21:25,481 INFO [train.py:996] (1/4) Epoch 7, batch 26700, loss[loss=0.2234, simple_loss=0.2919, pruned_loss=0.07748, over 21574.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3051, pruned_loss=0.07833, over 4261674.66 frames. ], batch size: 212, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:21:30,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1258002.0, ans=0.1 2023-06-22 15:21:32,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1258002.0, ans=0.2 2023-06-22 15:21:33,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-22 15:21:34,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1258002.0, ans=0.125 2023-06-22 15:22:04,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1258122.0, ans=0.035 2023-06-22 15:22:16,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1258122.0, ans=15.0 2023-06-22 15:22:42,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1258182.0, ans=0.0 2023-06-22 15:23:08,484 INFO [train.py:996] (1/4) Epoch 7, batch 26750, loss[loss=0.2666, simple_loss=0.3483, pruned_loss=0.09248, over 20702.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3054, pruned_loss=0.07749, over 4270565.85 frames. ], batch size: 607, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:23:10,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1258302.0, ans=0.125 2023-06-22 15:23:16,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1258302.0, ans=0.125 2023-06-22 15:23:54,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1258422.0, ans=0.5 2023-06-22 15:24:34,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1258542.0, ans=0.0 2023-06-22 15:24:37,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.345e+02 4.329e+02 5.688e+02 1.116e+03, threshold=8.657e+02, percent-clipped=5.0 2023-06-22 15:24:50,386 INFO [train.py:996] (1/4) Epoch 7, batch 26800, loss[loss=0.3414, simple_loss=0.3945, pruned_loss=0.1441, over 21442.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3138, pruned_loss=0.08259, over 4270153.79 frames. ], batch size: 471, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:25:58,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1258782.0, ans=0.125 2023-06-22 15:26:05,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1258782.0, ans=0.125 2023-06-22 15:26:10,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-22 15:26:37,074 INFO [train.py:996] (1/4) Epoch 7, batch 26850, loss[loss=0.2844, simple_loss=0.3451, pruned_loss=0.1119, over 21548.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3165, pruned_loss=0.08605, over 4257637.71 frames. ], batch size: 389, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:26:38,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-22 15:26:46,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-22 15:27:29,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1259022.0, ans=0.015 2023-06-22 15:27:59,616 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 3.440e+02 3.815e+02 4.750e+02 7.886e+02, threshold=7.630e+02, percent-clipped=0.0 2023-06-22 15:28:06,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1259142.0, ans=0.125 2023-06-22 15:28:17,074 INFO [train.py:996] (1/4) Epoch 7, batch 26900, loss[loss=0.2102, simple_loss=0.2705, pruned_loss=0.07498, over 21537.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3079, pruned_loss=0.08441, over 4264910.42 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:28:17,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1259202.0, ans=0.0 2023-06-22 15:29:38,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1259442.0, ans=0.125 2023-06-22 15:29:56,691 INFO [train.py:996] (1/4) Epoch 7, batch 26950, loss[loss=0.26, simple_loss=0.3351, pruned_loss=0.09242, over 21788.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.307, pruned_loss=0.08387, over 4263022.94 frames. ], batch size: 371, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:30:28,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-22 15:30:40,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1259562.0, ans=0.125 2023-06-22 15:30:49,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1259622.0, ans=0.2 2023-06-22 15:31:18,844 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.421e+02 4.204e+02 5.403e+02 1.152e+03, threshold=8.409e+02, percent-clipped=7.0 2023-06-22 15:31:40,926 INFO [train.py:996] (1/4) Epoch 7, batch 27000, loss[loss=0.1999, simple_loss=0.2526, pruned_loss=0.07364, over 19949.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3076, pruned_loss=0.08202, over 4257853.89 frames. ], batch size: 703, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:31:40,926 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 15:31:59,798 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2427, simple_loss=0.3424, pruned_loss=0.07152, over 1796401.00 frames. 2023-06-22 15:31:59,798 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 15:32:01,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1259802.0, ans=0.0 2023-06-22 15:32:32,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1259862.0, ans=0.125 2023-06-22 15:33:38,760 INFO [train.py:996] (1/4) Epoch 7, batch 27050, loss[loss=0.2121, simple_loss=0.3054, pruned_loss=0.05941, over 21799.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3088, pruned_loss=0.07864, over 4264909.65 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:34:25,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1260222.0, ans=0.125 2023-06-22 15:35:07,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.078e+02 3.767e+02 4.545e+02 7.806e+02, threshold=7.533e+02, percent-clipped=0.0 2023-06-22 15:35:23,474 INFO [train.py:996] (1/4) Epoch 7, batch 27100, loss[loss=0.2356, simple_loss=0.3323, pruned_loss=0.0694, over 21619.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3112, pruned_loss=0.08036, over 4273348.45 frames. ], batch size: 230, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:35:27,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1260402.0, ans=0.1 2023-06-22 15:35:37,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1260402.0, ans=0.125 2023-06-22 15:36:27,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1260582.0, ans=0.2 2023-06-22 15:36:37,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1260642.0, ans=0.125 2023-06-22 15:37:02,879 INFO [train.py:996] (1/4) Epoch 7, batch 27150, loss[loss=0.3093, simple_loss=0.4189, pruned_loss=0.0998, over 20770.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3232, pruned_loss=0.08349, over 4272360.21 frames. ], batch size: 607, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:37:11,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1260702.0, ans=0.2 2023-06-22 15:37:13,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1260702.0, ans=0.125 2023-06-22 15:37:21,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1260762.0, ans=0.0 2023-06-22 15:37:27,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1260762.0, ans=0.125 2023-06-22 15:38:25,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1260942.0, ans=0.125 2023-06-22 15:38:31,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.782e+02 4.019e+02 5.172e+02 7.242e+02 1.500e+03, threshold=1.034e+03, percent-clipped=23.0 2023-06-22 15:38:42,523 INFO [train.py:996] (1/4) Epoch 7, batch 27200, loss[loss=0.2583, simple_loss=0.3288, pruned_loss=0.09386, over 21797.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3325, pruned_loss=0.08654, over 4271021.81 frames. ], batch size: 124, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:39:30,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1261122.0, ans=0.0 2023-06-22 15:39:36,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-06-22 15:40:23,233 INFO [train.py:996] (1/4) Epoch 7, batch 27250, loss[loss=0.2735, simple_loss=0.3396, pruned_loss=0.1037, over 21818.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3357, pruned_loss=0.09124, over 4272344.52 frames. ], batch size: 247, lr: 4.18e-03, grad_scale: 32.0 2023-06-22 15:40:24,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1261302.0, ans=0.125 2023-06-22 15:41:52,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1261542.0, ans=0.1 2023-06-22 15:41:54,926 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.033e+02 3.741e+02 4.375e+02 5.438e+02 9.965e+02, threshold=8.750e+02, percent-clipped=0.0 2023-06-22 15:41:58,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1261542.0, ans=0.2 2023-06-22 15:42:02,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.01 vs. limit=15.0 2023-06-22 15:42:08,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1261602.0, ans=0.125 2023-06-22 15:42:08,917 INFO [train.py:996] (1/4) Epoch 7, batch 27300, loss[loss=0.2168, simple_loss=0.3353, pruned_loss=0.04914, over 20757.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3374, pruned_loss=0.09194, over 4278920.07 frames. ], batch size: 607, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:42:24,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-22 15:42:26,039 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:42:40,189 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:42:45,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-22 15:43:03,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1261722.0, ans=0.125 2023-06-22 15:43:06,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-22 15:43:16,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1261782.0, ans=0.0 2023-06-22 15:43:40,929 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:43:44,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1261842.0, ans=0.2 2023-06-22 15:43:48,261 INFO [train.py:996] (1/4) Epoch 7, batch 27350, loss[loss=0.267, simple_loss=0.337, pruned_loss=0.0985, over 21210.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3401, pruned_loss=0.09246, over 4274660.22 frames. ], batch size: 143, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:44:36,253 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-22 15:44:51,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1262082.0, ans=0.0 2023-06-22 15:45:06,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-22 15:45:15,961 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.698e+02 3.868e+02 4.861e+02 6.523e+02 1.170e+03, threshold=9.722e+02, percent-clipped=10.0 2023-06-22 15:45:21,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-22 15:45:25,600 INFO [train.py:996] (1/4) Epoch 7, batch 27400, loss[loss=0.2245, simple_loss=0.2915, pruned_loss=0.07875, over 21400.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3345, pruned_loss=0.09225, over 4279075.53 frames. ], batch size: 131, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:45:44,155 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-22 15:46:19,936 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-22 15:46:27,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1262382.0, ans=0.125 2023-06-22 15:46:47,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-22 15:47:09,278 INFO [train.py:996] (1/4) Epoch 7, batch 27450, loss[loss=0.2794, simple_loss=0.3415, pruned_loss=0.1086, over 21861.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.327, pruned_loss=0.09017, over 4276083.54 frames. ], batch size: 118, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:47:53,899 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-06-22 15:48:21,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1262742.0, ans=0.0 2023-06-22 15:48:28,401 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.635e+02 3.611e+02 4.155e+02 4.904e+02 8.641e+02, threshold=8.310e+02, percent-clipped=0.0 2023-06-22 15:48:30,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1262742.0, ans=0.125 2023-06-22 15:48:40,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-22 15:48:41,027 INFO [train.py:996] (1/4) Epoch 7, batch 27500, loss[loss=0.2294, simple_loss=0.3026, pruned_loss=0.07809, over 21910.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3256, pruned_loss=0.09035, over 4287644.00 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 15:49:03,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1262862.0, ans=0.2 2023-06-22 15:49:41,269 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-22 15:50:10,005 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-22 15:50:23,461 INFO [train.py:996] (1/4) Epoch 7, batch 27550, loss[loss=0.2283, simple_loss=0.2963, pruned_loss=0.08011, over 21361.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3202, pruned_loss=0.08665, over 4284579.27 frames. ], batch size: 471, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 15:50:35,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1263102.0, ans=0.125 2023-06-22 15:51:02,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1263222.0, ans=0.2 2023-06-22 15:51:04,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1263222.0, ans=0.0 2023-06-22 15:51:34,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1263342.0, ans=0.1 2023-06-22 15:51:47,990 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.463e+02 4.160e+02 5.154e+02 1.063e+03, threshold=8.319e+02, percent-clipped=3.0 2023-06-22 15:51:57,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1263342.0, ans=0.125 2023-06-22 15:52:01,061 INFO [train.py:996] (1/4) Epoch 7, batch 27600, loss[loss=0.197, simple_loss=0.2548, pruned_loss=0.06958, over 21158.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3139, pruned_loss=0.08555, over 4280585.50 frames. ], batch size: 548, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:52:43,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1263522.0, ans=0.0 2023-06-22 15:53:30,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1263642.0, ans=0.1 2023-06-22 15:53:32,828 INFO [train.py:996] (1/4) Epoch 7, batch 27650, loss[loss=0.2566, simple_loss=0.3291, pruned_loss=0.09201, over 21755.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3082, pruned_loss=0.08492, over 4277952.02 frames. ], batch size: 332, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:53:58,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1263762.0, ans=0.125 2023-06-22 15:54:36,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1263882.0, ans=0.07 2023-06-22 15:54:58,115 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.224e+02 3.872e+02 5.377e+02 9.163e+02, threshold=7.744e+02, percent-clipped=1.0 2023-06-22 15:55:10,445 INFO [train.py:996] (1/4) Epoch 7, batch 27700, loss[loss=0.2202, simple_loss=0.2954, pruned_loss=0.07253, over 21461.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3079, pruned_loss=0.08249, over 4281596.72 frames. ], batch size: 211, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:55:24,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1264002.0, ans=0.0 2023-06-22 15:55:24,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1264002.0, ans=0.125 2023-06-22 15:55:32,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1264062.0, ans=10.0 2023-06-22 15:56:53,371 INFO [train.py:996] (1/4) Epoch 7, batch 27750, loss[loss=0.2109, simple_loss=0.295, pruned_loss=0.06336, over 21772.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3124, pruned_loss=0.08257, over 4278922.87 frames. ], batch size: 298, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:57:02,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-22 15:57:11,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=22.5 2023-06-22 15:57:13,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1264362.0, ans=0.125 2023-06-22 15:57:28,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.33 vs. limit=10.0 2023-06-22 15:57:39,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1264422.0, ans=0.125 2023-06-22 15:57:48,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1264482.0, ans=0.125 2023-06-22 15:57:53,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1264482.0, ans=0.0 2023-06-22 15:58:17,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.542e+02 4.380e+02 5.826e+02 1.163e+03, threshold=8.759e+02, percent-clipped=13.0 2023-06-22 15:58:26,091 INFO [train.py:996] (1/4) Epoch 7, batch 27800, loss[loss=0.1968, simple_loss=0.2667, pruned_loss=0.06349, over 21207.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3124, pruned_loss=0.08364, over 4284097.94 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:58:36,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1264602.0, ans=0.0 2023-06-22 15:58:45,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1264662.0, ans=0.125 2023-06-22 15:58:51,596 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:58:56,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1264662.0, ans=0.125 2023-06-22 16:00:09,315 INFO [train.py:996] (1/4) Epoch 7, batch 27850, loss[loss=0.238, simple_loss=0.3267, pruned_loss=0.07463, over 21796.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3113, pruned_loss=0.08438, over 4291138.96 frames. ], batch size: 247, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:00:14,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1264902.0, ans=0.0 2023-06-22 16:00:41,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=15.0 2023-06-22 16:01:06,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1265082.0, ans=0.2 2023-06-22 16:01:11,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-22 16:01:12,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1265082.0, ans=0.125 2023-06-22 16:01:44,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.686e+02 3.358e+02 4.006e+02 5.079e+02 1.358e+03, threshold=8.013e+02, percent-clipped=6.0 2023-06-22 16:01:50,960 INFO [train.py:996] (1/4) Epoch 7, batch 27900, loss[loss=0.2249, simple_loss=0.3157, pruned_loss=0.06704, over 21418.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.322, pruned_loss=0.0864, over 4285114.55 frames. ], batch size: 194, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 16:01:52,149 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-22 16:01:54,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-22 16:02:06,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.33 vs. limit=15.0 2023-06-22 16:02:41,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1265322.0, ans=0.1 2023-06-22 16:03:13,361 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:03:17,171 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-22 16:03:30,637 INFO [train.py:996] (1/4) Epoch 7, batch 27950, loss[loss=0.2274, simple_loss=0.327, pruned_loss=0.0639, over 21227.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.323, pruned_loss=0.08297, over 4279078.50 frames. ], batch size: 549, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 16:05:01,859 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.331e+02 4.215e+02 5.224e+02 1.025e+03, threshold=8.430e+02, percent-clipped=5.0 2023-06-22 16:05:13,121 INFO [train.py:996] (1/4) Epoch 7, batch 28000, loss[loss=0.2659, simple_loss=0.3305, pruned_loss=0.1006, over 21883.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3178, pruned_loss=0.07985, over 4287353.86 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:05:13,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1265802.0, ans=0.04949747468305833 2023-06-22 16:05:34,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1265862.0, ans=0.0 2023-06-22 16:06:24,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1265982.0, ans=0.0 2023-06-22 16:06:52,796 INFO [train.py:996] (1/4) Epoch 7, batch 28050, loss[loss=0.1775, simple_loss=0.2313, pruned_loss=0.0618, over 21283.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3145, pruned_loss=0.08103, over 4294766.34 frames. ], batch size: 159, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:06:59,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1266102.0, ans=0.125 2023-06-22 16:07:29,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-22 16:07:35,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1266222.0, ans=0.2 2023-06-22 16:07:47,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1266222.0, ans=0.0 2023-06-22 16:07:49,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1266222.0, ans=0.125 2023-06-22 16:07:53,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1266282.0, ans=0.125 2023-06-22 16:07:54,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=15.0 2023-06-22 16:08:25,038 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.562e+02 4.496e+02 6.296e+02 1.484e+03, threshold=8.993e+02, percent-clipped=8.0 2023-06-22 16:08:31,038 INFO [train.py:996] (1/4) Epoch 7, batch 28100, loss[loss=0.2235, simple_loss=0.2877, pruned_loss=0.0797, over 21445.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3128, pruned_loss=0.08109, over 4288064.31 frames. ], batch size: 389, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:08:48,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1266402.0, ans=0.125 2023-06-22 16:09:30,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1266582.0, ans=0.0 2023-06-22 16:09:54,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1266642.0, ans=0.0 2023-06-22 16:09:56,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1266642.0, ans=0.125 2023-06-22 16:10:07,977 INFO [train.py:996] (1/4) Epoch 7, batch 28150, loss[loss=0.204, simple_loss=0.265, pruned_loss=0.07149, over 21827.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3051, pruned_loss=0.08128, over 4289351.74 frames. ], batch size: 352, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:10:08,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1266702.0, ans=0.1 2023-06-22 16:11:08,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1266882.0, ans=0.125 2023-06-22 16:11:39,707 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.795e+02 3.363e+02 3.938e+02 4.881e+02 1.435e+03, threshold=7.877e+02, percent-clipped=4.0 2023-06-22 16:11:46,178 INFO [train.py:996] (1/4) Epoch 7, batch 28200, loss[loss=0.2462, simple_loss=0.3128, pruned_loss=0.08982, over 20706.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3041, pruned_loss=0.08254, over 4286764.85 frames. ], batch size: 607, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:11:59,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1267002.0, ans=0.125 2023-06-22 16:11:59,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1267002.0, ans=0.95 2023-06-22 16:12:05,014 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=12.0 2023-06-22 16:12:33,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-22 16:13:02,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1267242.0, ans=0.125 2023-06-22 16:13:24,188 INFO [train.py:996] (1/4) Epoch 7, batch 28250, loss[loss=0.2297, simple_loss=0.2929, pruned_loss=0.08319, over 21602.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3098, pruned_loss=0.08577, over 4272432.87 frames. ], batch size: 415, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:13:27,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1267302.0, ans=0.125 2023-06-22 16:13:48,132 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:14:40,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1267482.0, ans=0.0 2023-06-22 16:14:56,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.948e+02 3.937e+02 4.825e+02 6.344e+02 1.441e+03, threshold=9.649e+02, percent-clipped=9.0 2023-06-22 16:15:06,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1267602.0, ans=0.1 2023-06-22 16:15:07,530 INFO [train.py:996] (1/4) Epoch 7, batch 28300, loss[loss=0.2121, simple_loss=0.2742, pruned_loss=0.07502, over 21819.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.307, pruned_loss=0.08369, over 4274213.25 frames. ], batch size: 102, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:15:15,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-22 16:15:50,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1267722.0, ans=0.04949747468305833 2023-06-22 16:16:00,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1267722.0, ans=0.1 2023-06-22 16:16:23,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-22 16:16:28,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1267842.0, ans=0.035 2023-06-22 16:16:51,647 INFO [train.py:996] (1/4) Epoch 7, batch 28350, loss[loss=0.2267, simple_loss=0.3116, pruned_loss=0.07092, over 21533.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3049, pruned_loss=0.0774, over 4274023.62 frames. ], batch size: 389, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:17:07,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1267962.0, ans=0.125 2023-06-22 16:17:32,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1268022.0, ans=0.0 2023-06-22 16:17:54,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=12.0 2023-06-22 16:17:57,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-22 16:18:09,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1268142.0, ans=0.125 2023-06-22 16:18:16,790 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.369e+02 4.514e+02 6.821e+02 1.544e+03, threshold=9.028e+02, percent-clipped=6.0 2023-06-22 16:18:25,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1268142.0, ans=0.0 2023-06-22 16:18:26,271 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.50 vs. limit=15.0 2023-06-22 16:18:28,286 INFO [train.py:996] (1/4) Epoch 7, batch 28400, loss[loss=0.2874, simple_loss=0.3552, pruned_loss=0.1098, over 21803.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3011, pruned_loss=0.07747, over 4271531.24 frames. ], batch size: 118, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:19:08,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1268322.0, ans=0.0 2023-06-22 16:19:43,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=8.0 2023-06-22 16:20:07,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-22 16:20:10,455 INFO [train.py:996] (1/4) Epoch 7, batch 28450, loss[loss=0.3368, simple_loss=0.3739, pruned_loss=0.1498, over 21682.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3064, pruned_loss=0.08172, over 4273957.64 frames. ], batch size: 508, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:20:28,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1268562.0, ans=0.125 2023-06-22 16:20:40,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1268562.0, ans=0.125 2023-06-22 16:21:44,202 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 3.754e+02 4.591e+02 6.046e+02 1.139e+03, threshold=9.182e+02, percent-clipped=3.0 2023-06-22 16:21:49,027 INFO [train.py:996] (1/4) Epoch 7, batch 28500, loss[loss=0.2296, simple_loss=0.3074, pruned_loss=0.07592, over 21947.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3076, pruned_loss=0.08365, over 4278414.18 frames. ], batch size: 316, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:22:07,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-22 16:22:25,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-22 16:22:43,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1268922.0, ans=0.125 2023-06-22 16:23:33,164 INFO [train.py:996] (1/4) Epoch 7, batch 28550, loss[loss=0.2587, simple_loss=0.3538, pruned_loss=0.08181, over 21619.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3156, pruned_loss=0.08608, over 4277855.09 frames. ], batch size: 230, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:23:49,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1269162.0, ans=0.0 2023-06-22 16:24:29,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1269282.0, ans=0.2 2023-06-22 16:24:39,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269282.0, ans=0.1 2023-06-22 16:24:48,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269342.0, ans=0.1 2023-06-22 16:25:07,215 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.907e+02 3.553e+02 4.352e+02 5.602e+02 1.151e+03, threshold=8.703e+02, percent-clipped=1.0 2023-06-22 16:25:10,386 INFO [train.py:996] (1/4) Epoch 7, batch 28600, loss[loss=0.2394, simple_loss=0.3169, pruned_loss=0.081, over 21857.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3208, pruned_loss=0.08772, over 4279575.17 frames. ], batch size: 372, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:25:26,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1269402.0, ans=0.125 2023-06-22 16:25:51,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1269522.0, ans=0.1 2023-06-22 16:26:15,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1269582.0, ans=0.125 2023-06-22 16:26:22,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269582.0, ans=0.1 2023-06-22 16:26:40,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1269642.0, ans=0.125 2023-06-22 16:26:48,966 INFO [train.py:996] (1/4) Epoch 7, batch 28650, loss[loss=0.2116, simple_loss=0.2746, pruned_loss=0.07426, over 21678.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3163, pruned_loss=0.0874, over 4269291.68 frames. ], batch size: 417, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:26:58,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1269702.0, ans=0.07 2023-06-22 16:27:03,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1269702.0, ans=0.0 2023-06-22 16:27:24,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1269762.0, ans=0.04949747468305833 2023-06-22 16:27:55,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-22 16:27:59,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1269882.0, ans=0.0 2023-06-22 16:28:23,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1269942.0, ans=0.125 2023-06-22 16:28:24,458 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.670e+02 4.036e+02 5.079e+02 7.151e+02 1.408e+03, threshold=1.016e+03, percent-clipped=11.0 2023-06-22 16:28:32,062 INFO [train.py:996] (1/4) Epoch 7, batch 28700, loss[loss=0.2514, simple_loss=0.3219, pruned_loss=0.09041, over 21350.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3157, pruned_loss=0.0887, over 4261184.39 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:28:57,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1270062.0, ans=0.0 2023-06-22 16:30:09,831 INFO [train.py:996] (1/4) Epoch 7, batch 28750, loss[loss=0.1958, simple_loss=0.29, pruned_loss=0.05076, over 21644.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3167, pruned_loss=0.08927, over 4268528.84 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:30:24,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1270302.0, ans=0.0 2023-06-22 16:31:01,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1270422.0, ans=0.95 2023-06-22 16:31:17,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-22 16:31:24,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1270482.0, ans=0.1 2023-06-22 16:31:26,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1270482.0, ans=0.0 2023-06-22 16:31:45,884 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.805e+02 3.311e+02 3.906e+02 4.905e+02 1.219e+03, threshold=7.811e+02, percent-clipped=5.0 2023-06-22 16:31:49,114 INFO [train.py:996] (1/4) Epoch 7, batch 28800, loss[loss=0.2705, simple_loss=0.3396, pruned_loss=0.1007, over 21571.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3198, pruned_loss=0.08957, over 4272562.11 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:32:11,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1270662.0, ans=0.0 2023-06-22 16:32:48,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-22 16:33:30,951 INFO [train.py:996] (1/4) Epoch 7, batch 28850, loss[loss=0.2764, simple_loss=0.3354, pruned_loss=0.1087, over 21478.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3209, pruned_loss=0.09058, over 4279643.56 frames. ], batch size: 131, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:33:53,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1270962.0, ans=0.125 2023-06-22 16:34:51,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-22 16:34:55,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1271142.0, ans=0.0 2023-06-22 16:35:03,020 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.913e+02 3.986e+02 4.811e+02 7.732e+02 1.672e+03, threshold=9.622e+02, percent-clipped=22.0 2023-06-22 16:35:03,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1271142.0, ans=0.125 2023-06-22 16:35:06,509 INFO [train.py:996] (1/4) Epoch 7, batch 28900, loss[loss=0.2472, simple_loss=0.3116, pruned_loss=0.09136, over 21481.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3243, pruned_loss=0.09266, over 4283216.43 frames. ], batch size: 211, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:35:10,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1271202.0, ans=0.2 2023-06-22 16:35:28,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1271262.0, ans=0.125 2023-06-22 16:36:16,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1271382.0, ans=0.2 2023-06-22 16:36:42,231 INFO [train.py:996] (1/4) Epoch 7, batch 28950, loss[loss=0.2156, simple_loss=0.2927, pruned_loss=0.06928, over 21712.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3222, pruned_loss=0.09117, over 4285129.03 frames. ], batch size: 247, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:36:54,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1271502.0, ans=0.0 2023-06-22 16:37:08,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.42 vs. limit=10.0 2023-06-22 16:37:16,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1271562.0, ans=0.125 2023-06-22 16:38:20,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1271742.0, ans=0.1 2023-06-22 16:38:23,464 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.694e+02 3.738e+02 4.662e+02 6.144e+02 1.231e+03, threshold=9.324e+02, percent-clipped=1.0 2023-06-22 16:38:31,537 INFO [train.py:996] (1/4) Epoch 7, batch 29000, loss[loss=0.282, simple_loss=0.3524, pruned_loss=0.1057, over 21831.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3264, pruned_loss=0.09074, over 4281396.95 frames. ], batch size: 124, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:38:48,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1271862.0, ans=0.0 2023-06-22 16:39:01,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1271862.0, ans=0.125 2023-06-22 16:39:09,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1271922.0, ans=0.125 2023-06-22 16:40:05,727 INFO [train.py:996] (1/4) Epoch 7, batch 29050, loss[loss=0.2303, simple_loss=0.301, pruned_loss=0.07977, over 21771.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3267, pruned_loss=0.09198, over 4291905.36 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:40:07,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1272102.0, ans=0.2 2023-06-22 16:40:11,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1272102.0, ans=0.0 2023-06-22 16:40:15,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1272102.0, ans=0.0 2023-06-22 16:40:26,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1272162.0, ans=0.125 2023-06-22 16:40:47,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1272222.0, ans=0.125 2023-06-22 16:40:55,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1272222.0, ans=0.125 2023-06-22 16:41:01,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272282.0, ans=0.1 2023-06-22 16:41:39,493 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.766e+02 3.571e+02 4.314e+02 5.526e+02 9.643e+02, threshold=8.628e+02, percent-clipped=2.0 2023-06-22 16:41:42,556 INFO [train.py:996] (1/4) Epoch 7, batch 29100, loss[loss=0.2163, simple_loss=0.2792, pruned_loss=0.0767, over 21629.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3169, pruned_loss=0.0889, over 4288619.12 frames. ], batch size: 416, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:41:50,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1272402.0, ans=0.125 2023-06-22 16:42:06,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-22 16:42:17,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-22 16:42:29,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272522.0, ans=0.1 2023-06-22 16:42:32,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1272522.0, ans=0.0 2023-06-22 16:42:52,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1272642.0, ans=0.0 2023-06-22 16:43:19,107 INFO [train.py:996] (1/4) Epoch 7, batch 29150, loss[loss=0.1973, simple_loss=0.254, pruned_loss=0.07025, over 20760.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3157, pruned_loss=0.08706, over 4290324.89 frames. ], batch size: 608, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:44:11,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1272822.0, ans=0.1 2023-06-22 16:44:24,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1272882.0, ans=0.125 2023-06-22 16:44:52,749 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.751e+02 3.598e+02 4.101e+02 5.513e+02 1.275e+03, threshold=8.201e+02, percent-clipped=6.0 2023-06-22 16:44:55,717 INFO [train.py:996] (1/4) Epoch 7, batch 29200, loss[loss=0.192, simple_loss=0.2605, pruned_loss=0.06178, over 21222.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3124, pruned_loss=0.08656, over 4281773.50 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:45:13,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1273062.0, ans=0.125 2023-06-22 16:45:31,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.56 vs. limit=10.0 2023-06-22 16:45:44,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1273122.0, ans=0.0 2023-06-22 16:45:53,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-22 16:45:55,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1273182.0, ans=0.1 2023-06-22 16:46:35,594 INFO [train.py:996] (1/4) Epoch 7, batch 29250, loss[loss=0.2203, simple_loss=0.3107, pruned_loss=0.06493, over 21738.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.309, pruned_loss=0.08316, over 4286753.03 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:46:56,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1273362.0, ans=0.125 2023-06-22 16:47:20,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1273422.0, ans=0.125 2023-06-22 16:48:10,638 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.544e+02 3.452e+02 4.218e+02 5.407e+02 1.140e+03, threshold=8.437e+02, percent-clipped=6.0 2023-06-22 16:48:13,957 INFO [train.py:996] (1/4) Epoch 7, batch 29300, loss[loss=0.2109, simple_loss=0.2756, pruned_loss=0.07311, over 21703.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.312, pruned_loss=0.08286, over 4287432.99 frames. ], batch size: 112, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:48:29,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1273602.0, ans=0.1 2023-06-22 16:49:01,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-22 16:49:52,322 INFO [train.py:996] (1/4) Epoch 7, batch 29350, loss[loss=0.2196, simple_loss=0.2817, pruned_loss=0.07875, over 21639.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3087, pruned_loss=0.08241, over 4281127.28 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:50:02,023 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:50:13,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1273902.0, ans=0.0 2023-06-22 16:50:43,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1274022.0, ans=0.125 2023-06-22 16:51:29,120 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.690e+02 4.735e+02 5.967e+02 1.066e+03, threshold=9.470e+02, percent-clipped=8.0 2023-06-22 16:51:30,509 INFO [train.py:996] (1/4) Epoch 7, batch 29400, loss[loss=0.2173, simple_loss=0.3103, pruned_loss=0.06218, over 21392.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3079, pruned_loss=0.08058, over 4280283.31 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:51:53,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1274262.0, ans=0.0 2023-06-22 16:52:04,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1274262.0, ans=0.125 2023-06-22 16:52:07,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1274322.0, ans=0.5 2023-06-22 16:52:20,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1274322.0, ans=0.125 2023-06-22 16:53:09,817 INFO [train.py:996] (1/4) Epoch 7, batch 29450, loss[loss=0.2773, simple_loss=0.3704, pruned_loss=0.0921, over 19731.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3051, pruned_loss=0.07968, over 4261597.83 frames. ], batch size: 703, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:53:12,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1274502.0, ans=0.1 2023-06-22 16:54:09,792 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-22 16:54:26,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1274742.0, ans=0.125 2023-06-22 16:54:41,857 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.353e+02 4.158e+02 5.429e+02 7.361e+02 1.574e+03, threshold=1.086e+03, percent-clipped=7.0 2023-06-22 16:54:43,572 INFO [train.py:996] (1/4) Epoch 7, batch 29500, loss[loss=0.2451, simple_loss=0.2997, pruned_loss=0.09522, over 21515.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3117, pruned_loss=0.08378, over 4269234.22 frames. ], batch size: 194, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:54:48,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1274802.0, ans=0.125 2023-06-22 16:55:42,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1274982.0, ans=0.125 2023-06-22 16:55:56,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1275042.0, ans=0.015 2023-06-22 16:56:21,931 INFO [train.py:996] (1/4) Epoch 7, batch 29550, loss[loss=0.2161, simple_loss=0.2821, pruned_loss=0.07506, over 21901.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3115, pruned_loss=0.08575, over 4285832.46 frames. ], batch size: 316, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:56:48,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1275162.0, ans=0.0 2023-06-22 16:57:15,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1275222.0, ans=0.04949747468305833 2023-06-22 16:57:52,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-06-22 16:57:59,476 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.724e+02 4.638e+02 7.025e+02 1.809e+03, threshold=9.276e+02, percent-clipped=5.0 2023-06-22 16:58:01,074 INFO [train.py:996] (1/4) Epoch 7, batch 29600, loss[loss=0.2705, simple_loss=0.3393, pruned_loss=0.1009, over 20142.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3174, pruned_loss=0.08849, over 4281231.81 frames. ], batch size: 702, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:59:37,957 INFO [train.py:996] (1/4) Epoch 7, batch 29650, loss[loss=0.2105, simple_loss=0.2764, pruned_loss=0.07228, over 21754.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3146, pruned_loss=0.08439, over 4276387.53 frames. ], batch size: 247, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:59:47,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1275702.0, ans=0.125 2023-06-22 16:59:52,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1275762.0, ans=0.125 2023-06-22 16:59:57,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1275762.0, ans=0.2 2023-06-22 17:01:16,740 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.826e+02 4.944e+02 6.202e+02 1.000e+03, threshold=9.888e+02, percent-clipped=1.0 2023-06-22 17:01:16,772 INFO [train.py:996] (1/4) Epoch 7, batch 29700, loss[loss=0.2454, simple_loss=0.3486, pruned_loss=0.07113, over 21436.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3174, pruned_loss=0.08432, over 4267896.33 frames. ], batch size: 194, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:02:41,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1276242.0, ans=0.125 2023-06-22 17:02:55,351 INFO [train.py:996] (1/4) Epoch 7, batch 29750, loss[loss=0.2314, simple_loss=0.3267, pruned_loss=0.06806, over 21688.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.322, pruned_loss=0.0843, over 4274932.82 frames. ], batch size: 263, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:03:00,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1276302.0, ans=0.2 2023-06-22 17:03:14,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1276362.0, ans=0.125 2023-06-22 17:04:32,162 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.419e+02 3.988e+02 5.118e+02 1.049e+03, threshold=7.976e+02, percent-clipped=2.0 2023-06-22 17:04:32,193 INFO [train.py:996] (1/4) Epoch 7, batch 29800, loss[loss=0.2396, simple_loss=0.3014, pruned_loss=0.08892, over 21533.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3228, pruned_loss=0.08513, over 4281273.52 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:04:42,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1276602.0, ans=0.0 2023-06-22 17:04:47,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1276662.0, ans=0.125 2023-06-22 17:05:39,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-06-22 17:06:10,689 INFO [train.py:996] (1/4) Epoch 7, batch 29850, loss[loss=0.2036, simple_loss=0.2851, pruned_loss=0.06109, over 21395.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3179, pruned_loss=0.08255, over 4279174.35 frames. ], batch size: 194, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:07:01,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1277022.0, ans=0.125 2023-06-22 17:07:01,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1277022.0, ans=0.09899494936611666 2023-06-22 17:07:02,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1277022.0, ans=0.0 2023-06-22 17:07:11,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1277082.0, ans=0.0 2023-06-22 17:07:48,880 INFO [train.py:996] (1/4) Epoch 7, batch 29900, loss[loss=0.2632, simple_loss=0.3293, pruned_loss=0.09854, over 21487.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3174, pruned_loss=0.08494, over 4282649.11 frames. ], batch size: 194, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:07:50,468 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.619e+02 3.325e+02 3.983e+02 5.006e+02 1.426e+03, threshold=7.966e+02, percent-clipped=5.0 2023-06-22 17:07:54,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1277202.0, ans=0.125 2023-06-22 17:08:06,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1277202.0, ans=0.2 2023-06-22 17:08:53,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1277322.0, ans=0.2 2023-06-22 17:09:17,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1277442.0, ans=0.125 2023-06-22 17:09:33,578 INFO [train.py:996] (1/4) Epoch 7, batch 29950, loss[loss=0.3183, simple_loss=0.3734, pruned_loss=0.1316, over 21306.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.323, pruned_loss=0.08921, over 4278408.92 frames. ], batch size: 143, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:09:42,014 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:09:58,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1277562.0, ans=0.0 2023-06-22 17:09:58,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1277562.0, ans=0.0 2023-06-22 17:10:00,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1277562.0, ans=0.125 2023-06-22 17:10:41,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1277682.0, ans=0.025 2023-06-22 17:11:07,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1277742.0, ans=0.0 2023-06-22 17:11:13,182 INFO [train.py:996] (1/4) Epoch 7, batch 30000, loss[loss=0.2226, simple_loss=0.3083, pruned_loss=0.06849, over 21406.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3235, pruned_loss=0.08826, over 4270845.87 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:11:13,183 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 17:11:34,229 INFO [train.py:1028] (1/4) Epoch 7, validation: loss=0.2473, simple_loss=0.3461, pruned_loss=0.0743, over 1796401.00 frames. 2023-06-22 17:11:34,230 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 17:11:36,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 3.810e+02 4.424e+02 5.666e+02 1.321e+03, threshold=8.847e+02, percent-clipped=8.0 2023-06-22 17:11:49,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1277802.0, ans=0.07 2023-06-22 17:12:08,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1277862.0, ans=0.1 2023-06-22 17:12:13,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1277862.0, ans=0.2 2023-06-22 17:12:13,775 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:12:47,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1277982.0, ans=0.1 2023-06-22 17:12:58,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1278042.0, ans=0.0 2023-06-22 17:13:17,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1278042.0, ans=0.125 2023-06-22 17:13:26,189 INFO [train.py:996] (1/4) Epoch 7, batch 30050, loss[loss=0.2674, simple_loss=0.3659, pruned_loss=0.08445, over 21691.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3265, pruned_loss=0.08547, over 4269100.44 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:13:33,416 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-22 17:14:38,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278282.0, ans=0.1 2023-06-22 17:15:03,323 INFO [train.py:996] (1/4) Epoch 7, batch 30100, loss[loss=0.2093, simple_loss=0.2736, pruned_loss=0.0725, over 21828.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3233, pruned_loss=0.08433, over 4270719.17 frames. ], batch size: 107, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:15:04,879 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 3.663e+02 4.877e+02 6.196e+02 1.469e+03, threshold=9.754e+02, percent-clipped=9.0 2023-06-22 17:15:15,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1278402.0, ans=0.125 2023-06-22 17:15:42,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278522.0, ans=0.1 2023-06-22 17:15:47,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1278522.0, ans=0.125 2023-06-22 17:15:48,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1278522.0, ans=0.125 2023-06-22 17:16:16,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1278582.0, ans=0.0 2023-06-22 17:16:41,802 INFO [train.py:996] (1/4) Epoch 7, batch 30150, loss[loss=0.2573, simple_loss=0.3394, pruned_loss=0.08763, over 21494.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3194, pruned_loss=0.08566, over 4272742.94 frames. ], batch size: 131, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:16:54,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1278702.0, ans=0.125 2023-06-22 17:18:24,063 INFO [train.py:996] (1/4) Epoch 7, batch 30200, loss[loss=0.2501, simple_loss=0.3476, pruned_loss=0.07624, over 21307.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3213, pruned_loss=0.08399, over 4271951.21 frames. ], batch size: 549, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:18:25,707 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.591e+02 3.500e+02 4.327e+02 6.195e+02 1.104e+03, threshold=8.654e+02, percent-clipped=5.0 2023-06-22 17:18:37,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1279002.0, ans=0.04949747468305833 2023-06-22 17:18:44,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279062.0, ans=0.1 2023-06-22 17:18:55,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=12.0 2023-06-22 17:19:21,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1279122.0, ans=0.125 2023-06-22 17:20:04,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1279242.0, ans=0.125 2023-06-22 17:20:09,060 INFO [train.py:996] (1/4) Epoch 7, batch 30250, loss[loss=0.263, simple_loss=0.3664, pruned_loss=0.07979, over 21810.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3293, pruned_loss=0.08687, over 4270234.59 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:21:00,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279422.0, ans=0.1 2023-06-22 17:21:14,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1279482.0, ans=0.125 2023-06-22 17:21:16,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1279482.0, ans=0.125 2023-06-22 17:21:42,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1279542.0, ans=0.0 2023-06-22 17:21:48,236 INFO [train.py:996] (1/4) Epoch 7, batch 30300, loss[loss=0.2593, simple_loss=0.3106, pruned_loss=0.104, over 21501.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3282, pruned_loss=0.0877, over 4269601.38 frames. ], batch size: 441, lr: 4.15e-03, grad_scale: 16.0 2023-06-22 17:21:49,819 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 4.197e+02 5.234e+02 7.319e+02 1.495e+03, threshold=1.047e+03, percent-clipped=13.0 2023-06-22 17:21:56,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1279602.0, ans=0.0 2023-06-22 17:22:00,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1279602.0, ans=0.0 2023-06-22 17:22:19,913 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.77 vs. limit=15.0 2023-06-22 17:22:56,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1279782.0, ans=0.125 2023-06-22 17:23:33,746 INFO [train.py:996] (1/4) Epoch 7, batch 30350, loss[loss=0.2931, simple_loss=0.3875, pruned_loss=0.09934, over 21222.00 frames. ], tot_loss[loss=0.254, simple_loss=0.33, pruned_loss=0.08902, over 4254130.26 frames. ], batch size: 549, lr: 4.15e-03, grad_scale: 16.0 2023-06-22 17:23:43,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1279902.0, ans=0.0 2023-06-22 17:23:43,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1279902.0, ans=0.2 2023-06-22 17:24:09,748 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-22 17:24:56,731 INFO [train.py:996] (1/4) Epoch 7, batch 30400, loss[loss=0.2303, simple_loss=0.2732, pruned_loss=0.09368, over 20338.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3201, pruned_loss=0.08612, over 4238216.10 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-22 17:24:58,187 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.840e+02 4.001e+02 6.030e+02 8.810e+02 1.556e+03, threshold=1.206e+03, percent-clipped=18.0 2023-06-22 17:25:10,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=12.0 2023-06-22 17:25:21,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1280262.0, ans=0.2 2023-06-22 17:25:27,649 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:26:10,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1280442.0, ans=0.2 2023-06-22 17:26:15,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-22 17:26:21,159 INFO [train.py:996] (1/4) Epoch 7, batch 30450, loss[loss=0.2669, simple_loss=0.3881, pruned_loss=0.07284, over 19936.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3224, pruned_loss=0.08631, over 4184840.82 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-22 17:26:29,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1280502.0, ans=0.05 2023-06-22 17:26:32,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1280502.0, ans=0.0 2023-06-22 17:26:38,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1280562.0, ans=0.125 2023-06-22 17:27:08,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.44 vs. limit=15.0 2023-06-22 17:27:17,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1280682.0, ans=0.125 2023-06-22 17:29:06,178 INFO [train.py:996] (1/4) Epoch 8, batch 0, loss[loss=0.2475, simple_loss=0.3061, pruned_loss=0.09443, over 21350.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3061, pruned_loss=0.09443, over 21350.00 frames. ], batch size: 473, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:29:06,179 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 17:29:21,694 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2437, simple_loss=0.3524, pruned_loss=0.06749, over 1796401.00 frames. 2023-06-22 17:29:21,695 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 17:29:28,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1280772.0, ans=0.125 2023-06-22 17:29:30,715 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.318e+02 7.236e+02 1.078e+03 1.767e+03 4.535e+03, threshold=2.157e+03, percent-clipped=44.0 2023-06-22 17:30:36,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1280952.0, ans=0.1 2023-06-22 17:31:00,103 INFO [train.py:996] (1/4) Epoch 8, batch 50, loss[loss=0.2481, simple_loss=0.3391, pruned_loss=0.07861, over 19903.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3298, pruned_loss=0.08828, over 970473.12 frames. ], batch size: 703, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:31:05,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1281072.0, ans=0.2 2023-06-22 17:31:17,769 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:31:55,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-06-22 17:32:33,435 INFO [train.py:996] (1/4) Epoch 8, batch 100, loss[loss=0.2911, simple_loss=0.3722, pruned_loss=0.105, over 21462.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3472, pruned_loss=0.09166, over 1703849.45 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:32:44,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.771e+02 3.609e+02 4.818e+02 6.662e+02 2.202e+03, threshold=9.637e+02, percent-clipped=1.0 2023-06-22 17:32:51,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1281432.0, ans=0.125 2023-06-22 17:32:59,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1281432.0, ans=0.2 2023-06-22 17:33:35,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1281552.0, ans=0.0 2023-06-22 17:33:59,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1281612.0, ans=0.0 2023-06-22 17:34:06,767 INFO [train.py:996] (1/4) Epoch 8, batch 150, loss[loss=0.2523, simple_loss=0.3558, pruned_loss=0.07435, over 21864.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3502, pruned_loss=0.09129, over 2272830.39 frames. ], batch size: 371, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:34:49,298 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.35 vs. limit=10.0 2023-06-22 17:34:52,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-22 17:34:55,265 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:35:09,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1281852.0, ans=0.125 2023-06-22 17:35:11,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-22 17:35:31,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1281912.0, ans=10.0 2023-06-22 17:35:39,239 INFO [train.py:996] (1/4) Epoch 8, batch 200, loss[loss=0.3089, simple_loss=0.3609, pruned_loss=0.1284, over 21429.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.344, pruned_loss=0.08852, over 2719578.66 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:35:47,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1281972.0, ans=10.0 2023-06-22 17:35:49,891 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.926e+02 4.076e+02 5.203e+02 6.716e+02 1.490e+03, threshold=1.041e+03, percent-clipped=7.0 2023-06-22 17:35:50,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1281972.0, ans=0.2 2023-06-22 17:35:57,431 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-22 17:36:18,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1282032.0, ans=0.125 2023-06-22 17:36:48,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-06-22 17:37:18,896 INFO [train.py:996] (1/4) Epoch 8, batch 250, loss[loss=0.2377, simple_loss=0.2996, pruned_loss=0.08788, over 21581.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3377, pruned_loss=0.08923, over 3072383.62 frames. ], batch size: 548, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:38:25,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1282452.0, ans=0.125 2023-06-22 17:38:26,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1282452.0, ans=0.125 2023-06-22 17:38:34,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1282512.0, ans=0.0 2023-06-22 17:38:48,962 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:38:54,780 INFO [train.py:996] (1/4) Epoch 8, batch 300, loss[loss=0.242, simple_loss=0.3164, pruned_loss=0.08382, over 21236.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3311, pruned_loss=0.08689, over 3337767.91 frames. ], batch size: 176, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:38:55,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1282572.0, ans=0.035 2023-06-22 17:38:57,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1282572.0, ans=0.125 2023-06-22 17:39:06,254 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 4.058e+02 5.447e+02 7.307e+02 1.512e+03, threshold=1.089e+03, percent-clipped=7.0 2023-06-22 17:40:35,720 INFO [train.py:996] (1/4) Epoch 8, batch 350, loss[loss=0.2072, simple_loss=0.2759, pruned_loss=0.06928, over 21194.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3246, pruned_loss=0.08546, over 3543533.65 frames. ], batch size: 548, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:40:48,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-22 17:41:45,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1283052.0, ans=0.125 2023-06-22 17:42:12,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1283112.0, ans=0.0 2023-06-22 17:42:12,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1283112.0, ans=0.0 2023-06-22 17:42:14,821 INFO [train.py:996] (1/4) Epoch 8, batch 400, loss[loss=0.2317, simple_loss=0.2871, pruned_loss=0.08812, over 21317.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.317, pruned_loss=0.08374, over 3707588.20 frames. ], batch size: 473, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:42:26,001 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.922e+02 3.737e+02 4.920e+02 6.486e+02 1.177e+03, threshold=9.840e+02, percent-clipped=3.0 2023-06-22 17:43:27,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1283352.0, ans=0.125 2023-06-22 17:43:37,116 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:43:40,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1283412.0, ans=0.125 2023-06-22 17:43:54,192 INFO [train.py:996] (1/4) Epoch 8, batch 450, loss[loss=0.2236, simple_loss=0.2803, pruned_loss=0.08343, over 21780.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.314, pruned_loss=0.08153, over 3836107.06 frames. ], batch size: 352, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:45:32,281 INFO [train.py:996] (1/4) Epoch 8, batch 500, loss[loss=0.2172, simple_loss=0.2674, pruned_loss=0.08348, over 20734.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3184, pruned_loss=0.08209, over 3930596.40 frames. ], batch size: 609, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:45:41,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.00 vs. limit=22.5 2023-06-22 17:45:59,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 3.923e+02 5.554e+02 7.720e+02 1.831e+03, threshold=1.111e+03, percent-clipped=13.0 2023-06-22 17:46:01,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1283832.0, ans=0.125 2023-06-22 17:46:29,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1283892.0, ans=0.0 2023-06-22 17:46:49,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1283952.0, ans=0.1 2023-06-22 17:47:02,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1284012.0, ans=0.1 2023-06-22 17:47:02,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1284012.0, ans=0.125 2023-06-22 17:47:15,027 INFO [train.py:996] (1/4) Epoch 8, batch 550, loss[loss=0.2655, simple_loss=0.3767, pruned_loss=0.07718, over 21677.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3236, pruned_loss=0.08238, over 4007818.74 frames. ], batch size: 441, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:47:16,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1284072.0, ans=0.125 2023-06-22 17:47:30,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-06-22 17:48:07,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-22 17:48:23,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1284252.0, ans=0.0 2023-06-22 17:48:37,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1284312.0, ans=0.0 2023-06-22 17:48:46,263 INFO [train.py:996] (1/4) Epoch 8, batch 600, loss[loss=0.2487, simple_loss=0.3204, pruned_loss=0.08844, over 21510.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3226, pruned_loss=0.08189, over 4066642.28 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:49:08,682 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.844e+02 4.934e+02 7.871e+02 2.167e+03, threshold=9.868e+02, percent-clipped=19.0 2023-06-22 17:49:20,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1284432.0, ans=0.125 2023-06-22 17:50:23,344 INFO [train.py:996] (1/4) Epoch 8, batch 650, loss[loss=0.2575, simple_loss=0.3181, pruned_loss=0.09844, over 15431.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3226, pruned_loss=0.08205, over 4107444.76 frames. ], batch size: 63, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:50:29,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-22 17:50:30,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1284672.0, ans=0.125 2023-06-22 17:50:50,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1284732.0, ans=0.1 2023-06-22 17:51:00,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.54 vs. limit=15.0 2023-06-22 17:51:07,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1284792.0, ans=0.125 2023-06-22 17:51:24,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1284852.0, ans=0.125 2023-06-22 17:51:34,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1284852.0, ans=0.0 2023-06-22 17:51:52,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-06-22 17:51:56,374 INFO [train.py:996] (1/4) Epoch 8, batch 700, loss[loss=0.2279, simple_loss=0.3554, pruned_loss=0.05022, over 20836.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3251, pruned_loss=0.08321, over 4148613.51 frames. ], batch size: 608, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:52:03,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1284972.0, ans=0.125 2023-06-22 17:52:20,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.941e+02 4.186e+02 5.348e+02 7.319e+02 1.415e+03, threshold=1.070e+03, percent-clipped=6.0 2023-06-22 17:52:25,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1285032.0, ans=0.2 2023-06-22 17:52:38,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1285092.0, ans=0.125 2023-06-22 17:53:29,831 INFO [train.py:996] (1/4) Epoch 8, batch 750, loss[loss=0.2414, simple_loss=0.304, pruned_loss=0.08937, over 21280.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3225, pruned_loss=0.08397, over 4173449.16 frames. ], batch size: 143, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:54:02,514 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.26 vs. limit=6.0 2023-06-22 17:54:06,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1285332.0, ans=0.035 2023-06-22 17:54:19,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1285392.0, ans=0.2 2023-06-22 17:55:07,706 INFO [train.py:996] (1/4) Epoch 8, batch 800, loss[loss=0.2949, simple_loss=0.3949, pruned_loss=0.09744, over 20910.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3193, pruned_loss=0.08446, over 4194123.52 frames. ], batch size: 608, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:55:26,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1285572.0, ans=0.1 2023-06-22 17:55:35,853 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.791e+02 4.052e+02 4.658e+02 6.687e+02 1.387e+03, threshold=9.317e+02, percent-clipped=3.0 2023-06-22 17:56:17,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1285752.0, ans=0.2 2023-06-22 17:56:54,349 INFO [train.py:996] (1/4) Epoch 8, batch 850, loss[loss=0.2125, simple_loss=0.2905, pruned_loss=0.06723, over 21667.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3166, pruned_loss=0.08459, over 4221299.90 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:57:01,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-22 17:57:48,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1286052.0, ans=0.2 2023-06-22 17:58:07,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1286112.0, ans=0.125 2023-06-22 17:58:14,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-22 17:58:26,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-22 17:58:33,078 INFO [train.py:996] (1/4) Epoch 8, batch 900, loss[loss=0.2428, simple_loss=0.3095, pruned_loss=0.08802, over 21926.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3131, pruned_loss=0.08378, over 4238062.28 frames. ], batch size: 316, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:58:38,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1286172.0, ans=0.2 2023-06-22 17:58:41,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286172.0, ans=0.1 2023-06-22 17:58:43,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1286172.0, ans=0.0 2023-06-22 17:58:47,804 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.900e+02 3.949e+02 5.086e+02 6.787e+02 1.769e+03, threshold=1.017e+03, percent-clipped=9.0 2023-06-22 17:59:18,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1286292.0, ans=0.2 2023-06-22 17:59:21,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1286292.0, ans=0.0 2023-06-22 17:59:21,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1286292.0, ans=0.125 2023-06-22 17:59:49,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1286412.0, ans=0.125 2023-06-22 17:59:51,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1286412.0, ans=0.0 2023-06-22 18:00:07,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286412.0, ans=0.1 2023-06-22 18:00:12,919 INFO [train.py:996] (1/4) Epoch 8, batch 950, loss[loss=0.243, simple_loss=0.3035, pruned_loss=0.09129, over 21294.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3112, pruned_loss=0.08324, over 4252826.66 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:00:29,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1286532.0, ans=0.125 2023-06-22 18:01:07,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1286652.0, ans=0.2 2023-06-22 18:01:09,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2023-06-22 18:01:51,067 INFO [train.py:996] (1/4) Epoch 8, batch 1000, loss[loss=0.3583, simple_loss=0.4597, pruned_loss=0.1285, over 19775.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3127, pruned_loss=0.08435, over 4266035.84 frames. ], batch size: 703, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:01:55,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1286772.0, ans=0.0 2023-06-22 18:02:05,745 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 3.565e+02 4.375e+02 6.228e+02 1.305e+03, threshold=8.750e+02, percent-clipped=2.0 2023-06-22 18:03:26,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1287012.0, ans=0.05 2023-06-22 18:03:30,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1287012.0, ans=0.1 2023-06-22 18:03:31,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1287072.0, ans=0.125 2023-06-22 18:03:32,524 INFO [train.py:996] (1/4) Epoch 8, batch 1050, loss[loss=0.199, simple_loss=0.2623, pruned_loss=0.06787, over 21232.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3119, pruned_loss=0.08421, over 4273843.74 frames. ], batch size: 143, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:03:49,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1287132.0, ans=0.0 2023-06-22 18:03:52,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1287132.0, ans=0.2 2023-06-22 18:04:04,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-22 18:04:27,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1287252.0, ans=0.125 2023-06-22 18:04:29,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1287252.0, ans=0.125 2023-06-22 18:05:07,894 INFO [train.py:996] (1/4) Epoch 8, batch 1100, loss[loss=0.2821, simple_loss=0.344, pruned_loss=0.1101, over 21547.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3129, pruned_loss=0.0848, over 4279290.91 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:05:21,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.840e+02 4.476e+02 5.815e+02 7.362e+02 1.371e+03, threshold=1.163e+03, percent-clipped=15.0 2023-06-22 18:05:34,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1287432.0, ans=0.125 2023-06-22 18:05:48,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1287492.0, ans=0.125 2023-06-22 18:06:18,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1287552.0, ans=0.125 2023-06-22 18:06:33,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1287612.0, ans=0.125 2023-06-22 18:06:43,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1287612.0, ans=0.0 2023-06-22 18:06:48,171 INFO [train.py:996] (1/4) Epoch 8, batch 1150, loss[loss=0.235, simple_loss=0.3227, pruned_loss=0.07365, over 21782.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.314, pruned_loss=0.08474, over 4286714.74 frames. ], batch size: 332, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:06:50,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=12.0 2023-06-22 18:06:52,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1287672.0, ans=0.0 2023-06-22 18:07:03,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1287732.0, ans=0.125 2023-06-22 18:07:18,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1287732.0, ans=0.125 2023-06-22 18:07:36,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1287792.0, ans=0.125 2023-06-22 18:07:37,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1287792.0, ans=0.125 2023-06-22 18:07:39,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1287792.0, ans=0.0 2023-06-22 18:08:24,635 INFO [train.py:996] (1/4) Epoch 8, batch 1200, loss[loss=0.3013, simple_loss=0.3699, pruned_loss=0.1164, over 21558.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3164, pruned_loss=0.08506, over 4284762.15 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:08:26,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1287972.0, ans=0.1 2023-06-22 18:08:43,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 3.897e+02 4.987e+02 7.014e+02 1.089e+03, threshold=9.974e+02, percent-clipped=0.0 2023-06-22 18:09:28,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1288152.0, ans=0.125 2023-06-22 18:09:36,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1288152.0, ans=0.125 2023-06-22 18:09:41,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1288152.0, ans=0.0 2023-06-22 18:09:47,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1288212.0, ans=0.125 2023-06-22 18:09:51,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1288212.0, ans=0.0 2023-06-22 18:10:03,646 INFO [train.py:996] (1/4) Epoch 8, batch 1250, loss[loss=0.2154, simple_loss=0.2947, pruned_loss=0.06808, over 21271.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3174, pruned_loss=0.08501, over 4288929.30 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:10:04,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-22 18:10:35,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1288332.0, ans=0.125 2023-06-22 18:11:14,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-22 18:11:18,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1288452.0, ans=0.125 2023-06-22 18:11:44,063 INFO [train.py:996] (1/4) Epoch 8, batch 1300, loss[loss=0.2179, simple_loss=0.2941, pruned_loss=0.07089, over 21819.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3175, pruned_loss=0.08472, over 4289556.85 frames. ], batch size: 282, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:11:54,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.80 vs. limit=15.0 2023-06-22 18:12:04,933 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 4.208e+02 5.615e+02 7.044e+02 1.517e+03, threshold=1.123e+03, percent-clipped=9.0 2023-06-22 18:12:19,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1288692.0, ans=0.2 2023-06-22 18:13:07,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-22 18:13:13,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1288812.0, ans=0.125 2023-06-22 18:13:19,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1288812.0, ans=15.0 2023-06-22 18:13:22,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1288812.0, ans=0.2 2023-06-22 18:13:24,629 INFO [train.py:996] (1/4) Epoch 8, batch 1350, loss[loss=0.2527, simple_loss=0.3461, pruned_loss=0.07962, over 21689.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.321, pruned_loss=0.08609, over 4290276.27 frames. ], batch size: 389, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:13:51,051 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-22 18:14:00,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1288932.0, ans=0.125 2023-06-22 18:14:06,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1288992.0, ans=0.1 2023-06-22 18:14:07,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1288992.0, ans=0.125 2023-06-22 18:14:35,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1289052.0, ans=0.0 2023-06-22 18:15:05,870 INFO [train.py:996] (1/4) Epoch 8, batch 1400, loss[loss=0.2128, simple_loss=0.2802, pruned_loss=0.07273, over 21687.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.318, pruned_loss=0.08509, over 4295347.45 frames. ], batch size: 298, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:15:14,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1289172.0, ans=0.025 2023-06-22 18:15:16,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-22 18:15:26,934 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.742e+02 3.782e+02 4.959e+02 6.793e+02 1.586e+03, threshold=9.917e+02, percent-clipped=6.0 2023-06-22 18:16:41,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1289412.0, ans=0.125 2023-06-22 18:16:45,945 INFO [train.py:996] (1/4) Epoch 8, batch 1450, loss[loss=0.2664, simple_loss=0.3341, pruned_loss=0.09934, over 21911.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3169, pruned_loss=0.08585, over 4296846.38 frames. ], batch size: 316, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:16:47,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1289472.0, ans=0.125 2023-06-22 18:18:22,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1289712.0, ans=0.1 2023-06-22 18:18:25,455 INFO [train.py:996] (1/4) Epoch 8, batch 1500, loss[loss=0.2726, simple_loss=0.342, pruned_loss=0.1016, over 21333.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3186, pruned_loss=0.08724, over 4297434.11 frames. ], batch size: 548, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:18:46,217 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 3.712e+02 4.836e+02 6.899e+02 1.421e+03, threshold=9.672e+02, percent-clipped=7.0 2023-06-22 18:19:17,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1289892.0, ans=0.0 2023-06-22 18:20:07,043 INFO [train.py:996] (1/4) Epoch 8, batch 1550, loss[loss=0.2201, simple_loss=0.3095, pruned_loss=0.06538, over 21570.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3157, pruned_loss=0.08575, over 4289925.21 frames. ], batch size: 389, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:20:20,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1290072.0, ans=0.125 2023-06-22 18:21:05,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1290192.0, ans=0.5 2023-06-22 18:21:06,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1290192.0, ans=0.2 2023-06-22 18:21:43,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1290312.0, ans=0.2 2023-06-22 18:21:48,304 INFO [train.py:996] (1/4) Epoch 8, batch 1600, loss[loss=0.2644, simple_loss=0.3254, pruned_loss=0.1017, over 21605.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.316, pruned_loss=0.08548, over 4277115.35 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:22:16,349 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.959e+02 3.911e+02 5.598e+02 7.259e+02 1.641e+03, threshold=1.120e+03, percent-clipped=8.0 2023-06-22 18:22:33,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1290432.0, ans=0.07 2023-06-22 18:23:06,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1290552.0, ans=0.2 2023-06-22 18:23:36,626 INFO [train.py:996] (1/4) Epoch 8, batch 1650, loss[loss=0.2557, simple_loss=0.3408, pruned_loss=0.08527, over 20683.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3152, pruned_loss=0.08402, over 4274986.45 frames. ], batch size: 607, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:23:58,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1290672.0, ans=0.1 2023-06-22 18:24:35,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1290852.0, ans=0.1 2023-06-22 18:24:54,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-22 18:25:17,562 INFO [train.py:996] (1/4) Epoch 8, batch 1700, loss[loss=0.2103, simple_loss=0.2806, pruned_loss=0.06997, over 21448.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3182, pruned_loss=0.08514, over 4275403.73 frames. ], batch size: 195, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:25:44,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=12.0 2023-06-22 18:25:45,115 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.906e+02 3.944e+02 4.852e+02 6.481e+02 1.409e+03, threshold=9.704e+02, percent-clipped=2.0 2023-06-22 18:25:49,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1291032.0, ans=0.125 2023-06-22 18:26:00,629 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:26:05,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1291092.0, ans=0.2 2023-06-22 18:26:08,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1291092.0, ans=0.125 2023-06-22 18:26:46,566 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:27:04,041 INFO [train.py:996] (1/4) Epoch 8, batch 1750, loss[loss=0.3139, simple_loss=0.3826, pruned_loss=0.1226, over 21382.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3167, pruned_loss=0.08267, over 4272444.06 frames. ], batch size: 471, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:28:30,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1291512.0, ans=0.125 2023-06-22 18:28:39,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.82 vs. limit=15.0 2023-06-22 18:28:46,007 INFO [train.py:996] (1/4) Epoch 8, batch 1800, loss[loss=0.2139, simple_loss=0.2748, pruned_loss=0.07651, over 21492.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3166, pruned_loss=0.08107, over 4269311.50 frames. ], batch size: 212, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:28:49,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1291572.0, ans=0.125 2023-06-22 18:28:53,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1291572.0, ans=0.125 2023-06-22 18:29:04,665 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.938e+02 3.886e+02 4.989e+02 8.763e+02 2.376e+03, threshold=9.977e+02, percent-clipped=20.0 2023-06-22 18:30:18,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1291812.0, ans=0.2 2023-06-22 18:30:26,423 INFO [train.py:996] (1/4) Epoch 8, batch 1850, loss[loss=0.1915, simple_loss=0.2613, pruned_loss=0.06079, over 21265.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3155, pruned_loss=0.07933, over 4268654.00 frames. ], batch size: 176, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:30:30,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1291872.0, ans=0.125 2023-06-22 18:30:44,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=8.0 2023-06-22 18:30:55,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1291932.0, ans=0.1 2023-06-22 18:31:55,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-06-22 18:32:06,611 INFO [train.py:996] (1/4) Epoch 8, batch 1900, loss[loss=0.2089, simple_loss=0.2694, pruned_loss=0.0742, over 21196.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3168, pruned_loss=0.08118, over 4268314.04 frames. ], batch size: 144, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:32:10,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1292172.0, ans=0.125 2023-06-22 18:32:15,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1292172.0, ans=0.0 2023-06-22 18:32:25,788 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.699e+02 3.836e+02 4.981e+02 6.397e+02 1.530e+03, threshold=9.962e+02, percent-clipped=6.0 2023-06-22 18:32:38,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1292232.0, ans=0.2 2023-06-22 18:33:13,273 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-22 18:33:48,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1292472.0, ans=0.125 2023-06-22 18:33:49,924 INFO [train.py:996] (1/4) Epoch 8, batch 1950, loss[loss=0.2903, simple_loss=0.3221, pruned_loss=0.1293, over 21362.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3117, pruned_loss=0.081, over 4276595.17 frames. ], batch size: 507, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:33:55,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1292472.0, ans=0.0 2023-06-22 18:34:36,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1292592.0, ans=0.1 2023-06-22 18:34:57,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=22.5 2023-06-22 18:35:12,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1292652.0, ans=0.125 2023-06-22 18:35:31,204 INFO [train.py:996] (1/4) Epoch 8, batch 2000, loss[loss=0.2401, simple_loss=0.3288, pruned_loss=0.07573, over 21581.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3081, pruned_loss=0.07781, over 4259716.56 frames. ], batch size: 441, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:35:38,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1292772.0, ans=0.125 2023-06-22 18:35:51,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-22 18:35:54,819 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 4.321e+02 6.258e+02 9.587e+02 1.701e+03, threshold=1.252e+03, percent-clipped=22.0 2023-06-22 18:35:58,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1292832.0, ans=0.125 2023-06-22 18:36:05,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-22 18:37:13,665 INFO [train.py:996] (1/4) Epoch 8, batch 2050, loss[loss=0.2662, simple_loss=0.3453, pruned_loss=0.09353, over 21869.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3111, pruned_loss=0.07825, over 4263508.65 frames. ], batch size: 371, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:37:21,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1293072.0, ans=0.0 2023-06-22 18:37:37,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1293132.0, ans=0.125 2023-06-22 18:37:53,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1293192.0, ans=0.125 2023-06-22 18:37:55,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1293192.0, ans=0.0 2023-06-22 18:37:57,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-22 18:38:18,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-22 18:38:20,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293252.0, ans=0.1 2023-06-22 18:38:43,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293312.0, ans=0.1 2023-06-22 18:38:47,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-22 18:38:53,107 INFO [train.py:996] (1/4) Epoch 8, batch 2100, loss[loss=0.2662, simple_loss=0.3328, pruned_loss=0.09982, over 21635.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3134, pruned_loss=0.07969, over 4273810.52 frames. ], batch size: 471, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:38:58,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1293372.0, ans=0.0 2023-06-22 18:39:02,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1293372.0, ans=0.125 2023-06-22 18:39:17,077 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.801e+02 4.056e+02 5.317e+02 7.512e+02 1.644e+03, threshold=1.063e+03, percent-clipped=5.0 2023-06-22 18:39:43,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1293492.0, ans=0.07 2023-06-22 18:40:05,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.50 vs. limit=15.0 2023-06-22 18:40:31,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1293612.0, ans=0.125 2023-06-22 18:40:33,842 INFO [train.py:996] (1/4) Epoch 8, batch 2150, loss[loss=0.2017, simple_loss=0.2791, pruned_loss=0.06218, over 21595.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3115, pruned_loss=0.08104, over 4280785.44 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:40:37,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1293672.0, ans=0.125 2023-06-22 18:40:57,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1293732.0, ans=0.2 2023-06-22 18:41:23,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293792.0, ans=0.1 2023-06-22 18:42:10,363 INFO [train.py:996] (1/4) Epoch 8, batch 2200, loss[loss=0.2012, simple_loss=0.2834, pruned_loss=0.05948, over 21764.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3153, pruned_loss=0.08203, over 4284785.42 frames. ], batch size: 247, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:42:13,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1293972.0, ans=0.125 2023-06-22 18:42:33,916 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.009e+02 3.898e+02 4.994e+02 6.578e+02 1.550e+03, threshold=9.987e+02, percent-clipped=10.0 2023-06-22 18:43:28,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-22 18:43:37,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1294212.0, ans=0.0 2023-06-22 18:43:50,446 INFO [train.py:996] (1/4) Epoch 8, batch 2250, loss[loss=0.2154, simple_loss=0.2751, pruned_loss=0.07779, over 21353.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.311, pruned_loss=0.08018, over 4282854.57 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:44:14,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-22 18:44:24,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1294332.0, ans=0.125 2023-06-22 18:44:52,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-06-22 18:44:52,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1294392.0, ans=0.1 2023-06-22 18:45:07,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.73 vs. limit=15.0 2023-06-22 18:45:07,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1294452.0, ans=0.0 2023-06-22 18:45:09,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1294452.0, ans=0.0 2023-06-22 18:45:17,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1294512.0, ans=0.125 2023-06-22 18:45:29,779 INFO [train.py:996] (1/4) Epoch 8, batch 2300, loss[loss=0.2177, simple_loss=0.2856, pruned_loss=0.07494, over 21849.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3057, pruned_loss=0.07965, over 4269774.34 frames. ], batch size: 107, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:45:53,598 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 4.015e+02 5.277e+02 7.353e+02 1.540e+03, threshold=1.055e+03, percent-clipped=5.0 2023-06-22 18:45:55,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1294632.0, ans=0.0 2023-06-22 18:46:39,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-22 18:47:01,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1294812.0, ans=0.07 2023-06-22 18:47:11,796 INFO [train.py:996] (1/4) Epoch 8, batch 2350, loss[loss=0.2039, simple_loss=0.2891, pruned_loss=0.05937, over 21704.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3056, pruned_loss=0.08013, over 4268131.20 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:47:34,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1294932.0, ans=0.125 2023-06-22 18:48:53,778 INFO [train.py:996] (1/4) Epoch 8, batch 2400, loss[loss=0.2489, simple_loss=0.3277, pruned_loss=0.08499, over 21367.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3073, pruned_loss=0.08217, over 4268132.87 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:49:18,984 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.938e+02 4.939e+02 6.884e+02 8.991e+02 1.831e+03, threshold=1.377e+03, percent-clipped=16.0 2023-06-22 18:49:33,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1295232.0, ans=0.125 2023-06-22 18:50:01,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1295352.0, ans=0.035 2023-06-22 18:50:10,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1295352.0, ans=0.1 2023-06-22 18:50:17,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1295412.0, ans=0.2 2023-06-22 18:50:35,320 INFO [train.py:996] (1/4) Epoch 8, batch 2450, loss[loss=0.1934, simple_loss=0.2626, pruned_loss=0.06213, over 21630.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3114, pruned_loss=0.08437, over 4267781.51 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:50:37,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1295472.0, ans=0.0 2023-06-22 18:50:40,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=15.0 2023-06-22 18:51:10,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1295532.0, ans=0.125 2023-06-22 18:52:03,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1295712.0, ans=0.2 2023-06-22 18:52:15,532 INFO [train.py:996] (1/4) Epoch 8, batch 2500, loss[loss=0.2183, simple_loss=0.2863, pruned_loss=0.07511, over 21828.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3109, pruned_loss=0.08392, over 4262688.26 frames. ], batch size: 107, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:52:19,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-22 18:52:30,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1295832.0, ans=0.1 2023-06-22 18:52:34,801 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.052e+02 4.409e+02 5.815e+02 8.522e+02 2.143e+03, threshold=1.163e+03, percent-clipped=4.0 2023-06-22 18:53:14,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1295892.0, ans=0.125 2023-06-22 18:53:42,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1296012.0, ans=0.125 2023-06-22 18:53:52,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1296012.0, ans=0.0 2023-06-22 18:53:57,214 INFO [train.py:996] (1/4) Epoch 8, batch 2550, loss[loss=0.1994, simple_loss=0.2619, pruned_loss=0.0684, over 21551.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3089, pruned_loss=0.08353, over 4259420.33 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:53:57,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1296072.0, ans=0.95 2023-06-22 18:54:17,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1296132.0, ans=0.2 2023-06-22 18:54:27,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1296132.0, ans=0.2 2023-06-22 18:55:00,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1296252.0, ans=0.0 2023-06-22 18:55:01,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1296252.0, ans=0.125 2023-06-22 18:55:37,972 INFO [train.py:996] (1/4) Epoch 8, batch 2600, loss[loss=0.2432, simple_loss=0.3077, pruned_loss=0.08933, over 19926.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3106, pruned_loss=0.08574, over 4267831.93 frames. ], batch size: 702, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:55:47,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-22 18:55:57,337 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.832e+02 4.000e+02 4.999e+02 6.903e+02 1.017e+03, threshold=9.998e+02, percent-clipped=0.0 2023-06-22 18:56:32,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296492.0, ans=0.1 2023-06-22 18:56:32,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1296492.0, ans=0.125 2023-06-22 18:56:36,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-22 18:56:48,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296552.0, ans=0.1 2023-06-22 18:57:03,291 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:57:14,610 INFO [train.py:996] (1/4) Epoch 8, batch 2650, loss[loss=0.2743, simple_loss=0.3358, pruned_loss=0.1065, over 21859.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3137, pruned_loss=0.08772, over 4270788.17 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:57:16,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-22 18:58:07,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1296792.0, ans=0.0 2023-06-22 18:58:15,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1296792.0, ans=0.0 2023-06-22 18:58:41,755 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-22 18:58:55,300 INFO [train.py:996] (1/4) Epoch 8, batch 2700, loss[loss=0.2229, simple_loss=0.3029, pruned_loss=0.07143, over 21817.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3112, pruned_loss=0.08571, over 4273850.54 frames. ], batch size: 371, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:58:58,737 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:59:14,449 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.148e+02 4.384e+02 5.256e+02 7.143e+02 1.333e+03, threshold=1.051e+03, percent-clipped=8.0 2023-06-22 19:00:37,150 INFO [train.py:996] (1/4) Epoch 8, batch 2750, loss[loss=0.2461, simple_loss=0.3321, pruned_loss=0.08005, over 21847.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3119, pruned_loss=0.08527, over 4273543.03 frames. ], batch size: 371, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 19:00:39,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1297272.0, ans=0.0 2023-06-22 19:01:19,219 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1297332.0, ans=0.2 2023-06-22 19:01:49,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1297452.0, ans=0.2 2023-06-22 19:02:14,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1297512.0, ans=0.05 2023-06-22 19:02:18,172 INFO [train.py:996] (1/4) Epoch 8, batch 2800, loss[loss=0.219, simple_loss=0.2837, pruned_loss=0.07712, over 21259.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3164, pruned_loss=0.08696, over 4266315.38 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 32.0 2023-06-22 19:02:26,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1297572.0, ans=0.0 2023-06-22 19:02:53,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1297632.0, ans=0.125 2023-06-22 19:02:56,036 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.531e+02 5.975e+02 9.124e+02 1.757e+03, threshold=1.195e+03, percent-clipped=17.0 2023-06-22 19:03:04,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1297632.0, ans=15.0 2023-06-22 19:03:17,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1297692.0, ans=0.0 2023-06-22 19:03:24,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1297752.0, ans=0.0 2023-06-22 19:04:01,707 INFO [train.py:996] (1/4) Epoch 8, batch 2850, loss[loss=0.2251, simple_loss=0.3055, pruned_loss=0.07237, over 21797.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3162, pruned_loss=0.08731, over 4257733.24 frames. ], batch size: 371, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:04:21,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-22 19:05:06,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1298052.0, ans=0.125 2023-06-22 19:05:22,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1298112.0, ans=0.125 2023-06-22 19:05:34,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1298112.0, ans=0.0 2023-06-22 19:05:35,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1298172.0, ans=0.2 2023-06-22 19:05:36,711 INFO [train.py:996] (1/4) Epoch 8, batch 2900, loss[loss=0.2437, simple_loss=0.304, pruned_loss=0.09173, over 21694.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3126, pruned_loss=0.08651, over 4265777.37 frames. ], batch size: 263, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:06:07,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1298232.0, ans=0.125 2023-06-22 19:06:12,629 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 4.241e+02 6.077e+02 8.455e+02 1.821e+03, threshold=1.215e+03, percent-clipped=6.0 2023-06-22 19:06:16,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1298232.0, ans=0.1 2023-06-22 19:06:52,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1298352.0, ans=0.2 2023-06-22 19:06:54,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1298352.0, ans=0.125 2023-06-22 19:07:15,534 INFO [train.py:996] (1/4) Epoch 8, batch 2950, loss[loss=0.2525, simple_loss=0.3513, pruned_loss=0.07685, over 20817.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3164, pruned_loss=0.08671, over 4280533.16 frames. ], batch size: 607, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:07:51,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1298532.0, ans=0.125 2023-06-22 19:08:01,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.62 vs. limit=15.0 2023-06-22 19:08:56,503 INFO [train.py:996] (1/4) Epoch 8, batch 3000, loss[loss=0.2683, simple_loss=0.339, pruned_loss=0.09877, over 21913.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3213, pruned_loss=0.08745, over 4286113.26 frames. ], batch size: 316, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:08:56,504 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 19:09:17,901 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2518, simple_loss=0.3464, pruned_loss=0.0786, over 1796401.00 frames. 2023-06-22 19:09:17,902 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 19:09:40,432 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.798e+02 4.528e+02 5.625e+02 8.163e+02 1.642e+03, threshold=1.125e+03, percent-clipped=6.0 2023-06-22 19:09:45,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1298832.0, ans=0.125 2023-06-22 19:10:19,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1298952.0, ans=0.125 2023-06-22 19:10:58,781 INFO [train.py:996] (1/4) Epoch 8, batch 3050, loss[loss=0.2096, simple_loss=0.2854, pruned_loss=0.06694, over 21653.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3194, pruned_loss=0.08477, over 4290696.91 frames. ], batch size: 263, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:10:59,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1299072.0, ans=0.125 2023-06-22 19:11:15,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-22 19:11:35,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299192.0, ans=0.1 2023-06-22 19:11:42,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1299192.0, ans=0.0 2023-06-22 19:12:13,033 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-22 19:12:34,315 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:12:38,572 INFO [train.py:996] (1/4) Epoch 8, batch 3100, loss[loss=0.2492, simple_loss=0.3428, pruned_loss=0.07773, over 21664.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3213, pruned_loss=0.08449, over 4292038.39 frames. ], batch size: 414, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:13:03,130 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:13:05,466 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.778e+02 3.934e+02 5.608e+02 7.913e+02 1.726e+03, threshold=1.122e+03, percent-clipped=9.0 2023-06-22 19:14:18,791 INFO [train.py:996] (1/4) Epoch 8, batch 3150, loss[loss=0.2552, simple_loss=0.3313, pruned_loss=0.08949, over 21655.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3232, pruned_loss=0.08484, over 4285864.44 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:14:27,095 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-22 19:14:37,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1299672.0, ans=0.2 2023-06-22 19:15:00,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1299792.0, ans=0.1 2023-06-22 19:15:51,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-22 19:16:06,918 INFO [train.py:996] (1/4) Epoch 8, batch 3200, loss[loss=0.2125, simple_loss=0.2873, pruned_loss=0.06884, over 21319.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3229, pruned_loss=0.0844, over 4284212.56 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:16:21,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-22 19:16:29,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.108e+02 3.882e+02 4.322e+02 5.833e+02 1.816e+03, threshold=8.643e+02, percent-clipped=1.0 2023-06-22 19:16:30,904 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-22 19:17:05,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1300092.0, ans=0.125 2023-06-22 19:17:46,241 INFO [train.py:996] (1/4) Epoch 8, batch 3250, loss[loss=0.2669, simple_loss=0.3273, pruned_loss=0.1033, over 21202.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3243, pruned_loss=0.0863, over 4282384.59 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:18:39,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1300392.0, ans=0.0 2023-06-22 19:19:25,823 INFO [train.py:996] (1/4) Epoch 8, batch 3300, loss[loss=0.2298, simple_loss=0.3015, pruned_loss=0.07903, over 21169.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3192, pruned_loss=0.08559, over 4278810.89 frames. ], batch size: 159, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:19:33,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1300572.0, ans=0.5 2023-06-22 19:19:35,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1300572.0, ans=0.2 2023-06-22 19:19:48,439 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 4.529e+02 6.033e+02 9.657e+02 1.783e+03, threshold=1.207e+03, percent-clipped=28.0 2023-06-22 19:19:49,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1300632.0, ans=0.025 2023-06-22 19:21:01,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-22 19:21:02,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-22 19:21:04,791 INFO [train.py:996] (1/4) Epoch 8, batch 3350, loss[loss=0.2537, simple_loss=0.314, pruned_loss=0.09665, over 21360.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3202, pruned_loss=0.08611, over 4280061.67 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:21:08,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1300872.0, ans=0.2 2023-06-22 19:21:23,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1300932.0, ans=0.125 2023-06-22 19:22:00,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1300992.0, ans=0.0 2023-06-22 19:22:35,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1301112.0, ans=0.0 2023-06-22 19:22:43,292 INFO [train.py:996] (1/4) Epoch 8, batch 3400, loss[loss=0.2315, simple_loss=0.3205, pruned_loss=0.07121, over 21711.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3201, pruned_loss=0.08641, over 4283341.71 frames. ], batch size: 414, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:22:50,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1301172.0, ans=0.125 2023-06-22 19:22:54,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-22 19:23:16,078 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.036e+02 4.257e+02 5.465e+02 6.871e+02 1.586e+03, threshold=1.093e+03, percent-clipped=5.0 2023-06-22 19:23:43,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1301292.0, ans=0.125 2023-06-22 19:24:24,348 INFO [train.py:996] (1/4) Epoch 8, batch 3450, loss[loss=0.3186, simple_loss=0.3919, pruned_loss=0.1227, over 21827.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3174, pruned_loss=0.0862, over 4283177.85 frames. ], batch size: 317, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:24:41,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1301472.0, ans=0.125 2023-06-22 19:25:09,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1301532.0, ans=0.0 2023-06-22 19:25:28,587 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=22.5 2023-06-22 19:26:09,215 INFO [train.py:996] (1/4) Epoch 8, batch 3500, loss[loss=0.2175, simple_loss=0.312, pruned_loss=0.06153, over 21431.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3262, pruned_loss=0.08997, over 4284355.27 frames. ], batch size: 211, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:26:27,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1301772.0, ans=0.125 2023-06-22 19:26:34,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1301832.0, ans=0.2 2023-06-22 19:26:36,796 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.220e+02 4.832e+02 6.635e+02 8.517e+02 1.814e+03, threshold=1.327e+03, percent-clipped=16.0 2023-06-22 19:26:44,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1301892.0, ans=0.125 2023-06-22 19:26:45,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-22 19:26:57,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1301892.0, ans=0.1 2023-06-22 19:27:42,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-22 19:27:42,650 INFO [train.py:996] (1/4) Epoch 8, batch 3550, loss[loss=0.1942, simple_loss=0.2678, pruned_loss=0.06032, over 19908.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3292, pruned_loss=0.09107, over 4280084.15 frames. ], batch size: 703, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:28:04,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1302132.0, ans=0.125 2023-06-22 19:28:11,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1302132.0, ans=0.0 2023-06-22 19:28:53,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1302252.0, ans=0.2 2023-06-22 19:29:21,144 INFO [train.py:996] (1/4) Epoch 8, batch 3600, loss[loss=0.2411, simple_loss=0.3084, pruned_loss=0.08688, over 21265.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3241, pruned_loss=0.09093, over 4275789.07 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 32.0 2023-06-22 19:29:48,395 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.074e+02 4.388e+02 6.270e+02 8.797e+02 1.377e+03, threshold=1.254e+03, percent-clipped=1.0 2023-06-22 19:30:05,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1302492.0, ans=0.2 2023-06-22 19:30:11,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1302492.0, ans=0.0 2023-06-22 19:30:36,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1302612.0, ans=0.0 2023-06-22 19:30:37,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1302612.0, ans=0.2 2023-06-22 19:30:59,821 INFO [train.py:996] (1/4) Epoch 8, batch 3650, loss[loss=0.2491, simple_loss=0.3086, pruned_loss=0.09478, over 21336.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.324, pruned_loss=0.09092, over 4278044.92 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:31:03,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-22 19:31:29,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1302732.0, ans=0.125 2023-06-22 19:31:54,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-22 19:32:06,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1302852.0, ans=0.125 2023-06-22 19:32:37,366 INFO [train.py:996] (1/4) Epoch 8, batch 3700, loss[loss=0.2692, simple_loss=0.3382, pruned_loss=0.1001, over 21609.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3228, pruned_loss=0.09036, over 4273963.71 frames. ], batch size: 471, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:33:05,944 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 4.107e+02 5.215e+02 7.517e+02 1.439e+03, threshold=1.043e+03, percent-clipped=3.0 2023-06-22 19:33:19,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1303092.0, ans=0.125 2023-06-22 19:33:28,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.36 vs. limit=12.0 2023-06-22 19:34:11,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1303212.0, ans=0.125 2023-06-22 19:34:16,537 INFO [train.py:996] (1/4) Epoch 8, batch 3750, loss[loss=0.1841, simple_loss=0.2664, pruned_loss=0.05094, over 21629.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3215, pruned_loss=0.089, over 4278347.52 frames. ], batch size: 263, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:35:01,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1303392.0, ans=0.0 2023-06-22 19:35:26,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1303452.0, ans=0.125 2023-06-22 19:35:49,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1303512.0, ans=0.125 2023-06-22 19:35:51,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1303512.0, ans=0.125 2023-06-22 19:35:56,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1303512.0, ans=0.0 2023-06-22 19:36:00,331 INFO [train.py:996] (1/4) Epoch 8, batch 3800, loss[loss=0.2872, simple_loss=0.3943, pruned_loss=0.09004, over 19870.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3185, pruned_loss=0.08623, over 4275323.14 frames. ], batch size: 702, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:36:19,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1303632.0, ans=0.125 2023-06-22 19:36:25,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1303632.0, ans=0.125 2023-06-22 19:36:28,077 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 4.734e+02 6.125e+02 7.875e+02 1.546e+03, threshold=1.225e+03, percent-clipped=6.0 2023-06-22 19:36:45,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1303692.0, ans=10.0 2023-06-22 19:37:03,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.82 vs. limit=22.5 2023-06-22 19:37:18,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1303812.0, ans=0.0 2023-06-22 19:37:36,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1303872.0, ans=0.0 2023-06-22 19:37:37,344 INFO [train.py:996] (1/4) Epoch 8, batch 3850, loss[loss=0.2299, simple_loss=0.2919, pruned_loss=0.08394, over 21424.00 frames. ], tot_loss[loss=0.245, simple_loss=0.316, pruned_loss=0.08696, over 4276256.74 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:38:04,627 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:38:40,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1304052.0, ans=0.125 2023-06-22 19:39:01,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1304112.0, ans=0.125 2023-06-22 19:39:16,045 INFO [train.py:996] (1/4) Epoch 8, batch 3900, loss[loss=0.2438, simple_loss=0.3107, pruned_loss=0.08848, over 21883.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3128, pruned_loss=0.08655, over 4280749.36 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:39:45,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.973e+02 4.645e+02 5.915e+02 7.788e+02 1.896e+03, threshold=1.183e+03, percent-clipped=6.0 2023-06-22 19:40:01,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-22 19:40:48,305 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-22 19:40:56,655 INFO [train.py:996] (1/4) Epoch 8, batch 3950, loss[loss=0.2034, simple_loss=0.287, pruned_loss=0.0599, over 21638.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3146, pruned_loss=0.08626, over 4284156.74 frames. ], batch size: 263, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:41:23,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.39 vs. limit=6.0 2023-06-22 19:42:36,369 INFO [train.py:996] (1/4) Epoch 8, batch 4000, loss[loss=0.2347, simple_loss=0.2948, pruned_loss=0.08728, over 21442.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3076, pruned_loss=0.08229, over 4284309.06 frames. ], batch size: 389, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:43:04,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1304832.0, ans=0.0 2023-06-22 19:43:05,271 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.962e+02 4.102e+02 5.710e+02 7.605e+02 1.219e+03, threshold=1.142e+03, percent-clipped=1.0 2023-06-22 19:44:21,123 INFO [train.py:996] (1/4) Epoch 8, batch 4050, loss[loss=0.2331, simple_loss=0.3053, pruned_loss=0.08046, over 21387.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3081, pruned_loss=0.08101, over 4278681.89 frames. ], batch size: 194, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:44:26,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1305072.0, ans=0.125 2023-06-22 19:45:04,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1305192.0, ans=0.125 2023-06-22 19:45:11,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1305252.0, ans=0.125 2023-06-22 19:45:24,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.14 vs. limit=15.0 2023-06-22 19:45:41,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1305312.0, ans=0.125 2023-06-22 19:46:00,758 INFO [train.py:996] (1/4) Epoch 8, batch 4100, loss[loss=0.1982, simple_loss=0.2791, pruned_loss=0.05868, over 21251.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3076, pruned_loss=0.08137, over 4280464.61 frames. ], batch size: 176, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:46:06,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1305372.0, ans=0.2 2023-06-22 19:46:15,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1305432.0, ans=0.125 2023-06-22 19:46:26,624 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.689e+02 3.586e+02 4.759e+02 6.058e+02 1.628e+03, threshold=9.517e+02, percent-clipped=6.0 2023-06-22 19:47:37,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1305612.0, ans=0.1 2023-06-22 19:47:40,280 INFO [train.py:996] (1/4) Epoch 8, batch 4150, loss[loss=0.1871, simple_loss=0.2731, pruned_loss=0.05059, over 21463.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3078, pruned_loss=0.07818, over 4281408.47 frames. ], batch size: 195, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:48:37,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-22 19:48:42,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-22 19:48:58,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1305852.0, ans=0.2 2023-06-22 19:49:23,603 INFO [train.py:996] (1/4) Epoch 8, batch 4200, loss[loss=0.275, simple_loss=0.3802, pruned_loss=0.0849, over 21558.00 frames. ], tot_loss[loss=0.233, simple_loss=0.309, pruned_loss=0.07845, over 4276784.21 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:49:56,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1306032.0, ans=0.125 2023-06-22 19:49:56,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1306032.0, ans=0.125 2023-06-22 19:49:57,482 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.597e+02 4.423e+02 6.282e+02 9.309e+02 2.210e+03, threshold=1.256e+03, percent-clipped=22.0 2023-06-22 19:51:06,578 INFO [train.py:996] (1/4) Epoch 8, batch 4250, loss[loss=0.2805, simple_loss=0.3498, pruned_loss=0.1056, over 21746.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3151, pruned_loss=0.07923, over 4278709.56 frames. ], batch size: 247, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:51:49,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1306392.0, ans=0.125 2023-06-22 19:52:13,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1306392.0, ans=0.025 2023-06-22 19:52:27,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1306452.0, ans=0.125 2023-06-22 19:52:35,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1306512.0, ans=0.0 2023-06-22 19:52:55,195 INFO [train.py:996] (1/4) Epoch 8, batch 4300, loss[loss=0.2358, simple_loss=0.3259, pruned_loss=0.07288, over 21828.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.322, pruned_loss=0.08209, over 4280611.95 frames. ], batch size: 282, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:53:19,243 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:53:38,420 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.490e+02 6.473e+02 1.024e+03 2.368e+03, threshold=1.295e+03, percent-clipped=12.0 2023-06-22 19:53:45,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1306692.0, ans=0.0 2023-06-22 19:54:35,544 INFO [train.py:996] (1/4) Epoch 8, batch 4350, loss[loss=0.2724, simple_loss=0.3379, pruned_loss=0.1034, over 21327.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3232, pruned_loss=0.08253, over 4278717.84 frames. ], batch size: 471, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:55:15,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1306932.0, ans=0.2 2023-06-22 19:55:40,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307052.0, ans=0.1 2023-06-22 19:56:08,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1307112.0, ans=0.05 2023-06-22 19:56:15,891 INFO [train.py:996] (1/4) Epoch 8, batch 4400, loss[loss=0.2446, simple_loss=0.3365, pruned_loss=0.07635, over 21724.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3173, pruned_loss=0.08236, over 4273114.65 frames. ], batch size: 298, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:56:35,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-06-22 19:56:53,724 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.962e+02 4.391e+02 5.978e+02 7.745e+02 1.639e+03, threshold=1.196e+03, percent-clipped=7.0 2023-06-22 19:57:09,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-06-22 19:57:28,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1307352.0, ans=0.0 2023-06-22 19:57:32,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1307352.0, ans=0.1 2023-06-22 19:58:02,010 INFO [train.py:996] (1/4) Epoch 8, batch 4450, loss[loss=0.2929, simple_loss=0.3841, pruned_loss=0.1009, over 21733.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3261, pruned_loss=0.08425, over 4276237.19 frames. ], batch size: 298, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:58:13,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1307472.0, ans=0.0 2023-06-22 19:58:13,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1307472.0, ans=0.125 2023-06-22 19:58:22,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1307532.0, ans=0.0 2023-06-22 19:59:48,286 INFO [train.py:996] (1/4) Epoch 8, batch 4500, loss[loss=0.2184, simple_loss=0.3208, pruned_loss=0.05793, over 20856.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3268, pruned_loss=0.08604, over 4281848.44 frames. ], batch size: 608, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:59:48,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1307772.0, ans=0.5 2023-06-22 20:00:14,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-22 20:00:14,654 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.895e+02 4.151e+02 5.436e+02 7.450e+02 1.876e+03, threshold=1.087e+03, percent-clipped=7.0 2023-06-22 20:00:38,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1307892.0, ans=0.125 2023-06-22 20:01:28,225 INFO [train.py:996] (1/4) Epoch 8, batch 4550, loss[loss=0.3209, simple_loss=0.3878, pruned_loss=0.1271, over 21818.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3272, pruned_loss=0.0856, over 4284637.38 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:01:33,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1308072.0, ans=0.125 2023-06-22 20:02:14,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1308192.0, ans=0.1 2023-06-22 20:02:35,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1308252.0, ans=0.125 2023-06-22 20:02:53,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1308312.0, ans=0.07 2023-06-22 20:03:08,601 INFO [train.py:996] (1/4) Epoch 8, batch 4600, loss[loss=0.241, simple_loss=0.3206, pruned_loss=0.08071, over 21485.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.331, pruned_loss=0.08813, over 4280080.64 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:03:40,833 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.710e+02 4.165e+02 5.279e+02 6.740e+02 1.716e+03, threshold=1.056e+03, percent-clipped=6.0 2023-06-22 20:03:42,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1308432.0, ans=0.125 2023-06-22 20:03:50,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1308492.0, ans=0.125 2023-06-22 20:03:53,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1308492.0, ans=0.125 2023-06-22 20:03:58,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1308492.0, ans=0.0 2023-06-22 20:04:07,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1308552.0, ans=0.125 2023-06-22 20:04:25,486 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:04:34,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1308612.0, ans=0.0 2023-06-22 20:04:34,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1308612.0, ans=0.125 2023-06-22 20:04:47,311 INFO [train.py:996] (1/4) Epoch 8, batch 4650, loss[loss=0.2551, simple_loss=0.316, pruned_loss=0.09708, over 21755.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3261, pruned_loss=0.08645, over 4275149.09 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:05:05,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1308732.0, ans=0.1 2023-06-22 20:06:22,490 INFO [train.py:996] (1/4) Epoch 8, batch 4700, loss[loss=0.1985, simple_loss=0.2619, pruned_loss=0.06752, over 21324.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3155, pruned_loss=0.08294, over 4275016.13 frames. ], batch size: 131, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:06:43,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1309032.0, ans=0.125 2023-06-22 20:06:52,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.92 vs. limit=22.5 2023-06-22 20:06:54,234 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.434e+02 3.523e+02 4.183e+02 5.885e+02 1.412e+03, threshold=8.365e+02, percent-clipped=3.0 2023-06-22 20:06:57,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309092.0, ans=0.1 2023-06-22 20:07:20,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1309152.0, ans=0.125 2023-06-22 20:07:45,251 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:08:02,204 INFO [train.py:996] (1/4) Epoch 8, batch 4750, loss[loss=0.218, simple_loss=0.2786, pruned_loss=0.07868, over 21537.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3093, pruned_loss=0.08301, over 4282634.37 frames. ], batch size: 212, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:08:17,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309332.0, ans=0.1 2023-06-22 20:08:35,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1309332.0, ans=0.125 2023-06-22 20:09:06,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-22 20:09:36,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1309512.0, ans=0.0 2023-06-22 20:09:42,412 INFO [train.py:996] (1/4) Epoch 8, batch 4800, loss[loss=0.2612, simple_loss=0.3729, pruned_loss=0.07476, over 21208.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3091, pruned_loss=0.08336, over 4285148.62 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:10:14,335 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.048e+02 4.179e+02 5.198e+02 6.996e+02 1.429e+03, threshold=1.040e+03, percent-clipped=10.0 2023-06-22 20:10:18,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1309692.0, ans=0.0 2023-06-22 20:10:37,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-22 20:10:54,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1309752.0, ans=0.0 2023-06-22 20:11:13,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1309812.0, ans=0.125 2023-06-22 20:11:21,065 INFO [train.py:996] (1/4) Epoch 8, batch 4850, loss[loss=0.211, simple_loss=0.2871, pruned_loss=0.06743, over 20326.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3098, pruned_loss=0.08299, over 4289293.81 frames. ], batch size: 703, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:11:23,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1309872.0, ans=0.125 2023-06-22 20:11:48,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1309932.0, ans=0.125 2023-06-22 20:11:58,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.11 vs. limit=22.5 2023-06-22 20:12:14,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1309992.0, ans=0.0 2023-06-22 20:12:31,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1310052.0, ans=0.125 2023-06-22 20:13:00,622 INFO [train.py:996] (1/4) Epoch 8, batch 4900, loss[loss=0.2581, simple_loss=0.3474, pruned_loss=0.08446, over 21716.00 frames. ], tot_loss[loss=0.24, simple_loss=0.312, pruned_loss=0.08402, over 4285445.35 frames. ], batch size: 298, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:13:22,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-22 20:13:29,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1310232.0, ans=0.0 2023-06-22 20:13:30,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1310232.0, ans=0.1 2023-06-22 20:13:32,433 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.010e+02 3.969e+02 5.003e+02 6.919e+02 1.603e+03, threshold=1.001e+03, percent-clipped=6.0 2023-06-22 20:13:49,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1310292.0, ans=0.125 2023-06-22 20:14:32,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1310412.0, ans=0.125 2023-06-22 20:14:40,889 INFO [train.py:996] (1/4) Epoch 8, batch 4950, loss[loss=0.2087, simple_loss=0.3031, pruned_loss=0.05711, over 19906.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3163, pruned_loss=0.08236, over 4288072.42 frames. ], batch size: 703, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:14:41,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1310472.0, ans=0.125 2023-06-22 20:14:52,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1310472.0, ans=0.0 2023-06-22 20:15:58,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1310652.0, ans=0.0 2023-06-22 20:16:25,023 INFO [train.py:996] (1/4) Epoch 8, batch 5000, loss[loss=0.2732, simple_loss=0.3309, pruned_loss=0.1078, over 21700.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3155, pruned_loss=0.07943, over 4281047.86 frames. ], batch size: 507, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:16:51,895 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.498e+02 3.608e+02 4.622e+02 7.271e+02 1.664e+03, threshold=9.243e+02, percent-clipped=6.0 2023-06-22 20:17:00,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1310892.0, ans=0.125 2023-06-22 20:17:30,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1310952.0, ans=0.0 2023-06-22 20:17:40,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.70 vs. limit=6.0 2023-06-22 20:17:53,512 INFO [train.py:996] (1/4) Epoch 8, batch 5050, loss[loss=0.2638, simple_loss=0.3825, pruned_loss=0.07257, over 20844.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3163, pruned_loss=0.08208, over 4290577.66 frames. ], batch size: 607, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:19:19,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-22 20:19:21,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1311312.0, ans=0.125 2023-06-22 20:19:28,966 INFO [train.py:996] (1/4) Epoch 8, batch 5100, loss[loss=0.2707, simple_loss=0.3582, pruned_loss=0.09158, over 20067.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3142, pruned_loss=0.08255, over 4287253.54 frames. ], batch size: 703, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:20:02,314 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.036e+02 3.868e+02 4.783e+02 6.520e+02 1.021e+03, threshold=9.567e+02, percent-clipped=2.0 2023-06-22 20:20:09,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1311492.0, ans=0.125 2023-06-22 20:20:57,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-22 20:21:08,419 INFO [train.py:996] (1/4) Epoch 8, batch 5150, loss[loss=0.2752, simple_loss=0.3228, pruned_loss=0.1138, over 21845.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3121, pruned_loss=0.08362, over 4295272.64 frames. ], batch size: 508, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:21:33,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1311732.0, ans=0.125 2023-06-22 20:21:42,891 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:21:57,061 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:22:12,011 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=12.0 2023-06-22 20:22:40,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1311912.0, ans=0.125 2023-06-22 20:22:52,484 INFO [train.py:996] (1/4) Epoch 8, batch 5200, loss[loss=0.2358, simple_loss=0.3342, pruned_loss=0.06872, over 21646.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3126, pruned_loss=0.08334, over 4293319.01 frames. ], batch size: 263, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:23:26,974 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.134e+02 4.423e+02 5.674e+02 8.806e+02 1.736e+03, threshold=1.135e+03, percent-clipped=18.0 2023-06-22 20:23:30,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1312092.0, ans=0.0 2023-06-22 20:23:32,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1312092.0, ans=0.125 2023-06-22 20:24:02,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1312152.0, ans=0.1 2023-06-22 20:24:30,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-22 20:24:32,962 INFO [train.py:996] (1/4) Epoch 8, batch 5250, loss[loss=0.2563, simple_loss=0.3446, pruned_loss=0.084, over 21637.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3148, pruned_loss=0.08104, over 4290630.90 frames. ], batch size: 389, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:26:08,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1312512.0, ans=0.0 2023-06-22 20:26:11,381 INFO [train.py:996] (1/4) Epoch 8, batch 5300, loss[loss=0.2087, simple_loss=0.2815, pruned_loss=0.06791, over 21687.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3135, pruned_loss=0.08129, over 4297777.05 frames. ], batch size: 263, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:26:44,652 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.970e+02 3.721e+02 4.525e+02 6.404e+02 1.262e+03, threshold=9.050e+02, percent-clipped=2.0 2023-06-22 20:26:50,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-22 20:26:58,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1312692.0, ans=0.125 2023-06-22 20:26:58,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1312692.0, ans=0.125 2023-06-22 20:27:16,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1312752.0, ans=0.07 2023-06-22 20:27:27,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1312812.0, ans=0.2 2023-06-22 20:27:36,517 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-22 20:27:49,130 INFO [train.py:996] (1/4) Epoch 8, batch 5350, loss[loss=0.2229, simple_loss=0.2927, pruned_loss=0.07656, over 21363.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3127, pruned_loss=0.083, over 4300896.08 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:27:51,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1312872.0, ans=0.125 2023-06-22 20:28:13,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1312932.0, ans=0.1 2023-06-22 20:28:29,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1312992.0, ans=0.125 2023-06-22 20:29:21,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1313172.0, ans=0.2 2023-06-22 20:29:22,309 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-22 20:29:22,763 INFO [train.py:996] (1/4) Epoch 8, batch 5400, loss[loss=0.2318, simple_loss=0.2998, pruned_loss=0.0819, over 21583.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.312, pruned_loss=0.08428, over 4303550.41 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:29:33,112 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:29:58,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1313232.0, ans=0.2 2023-06-22 20:30:01,571 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.001e+02 4.337e+02 6.656e+02 9.891e+02 1.935e+03, threshold=1.331e+03, percent-clipped=29.0 2023-06-22 20:30:18,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1313292.0, ans=0.0 2023-06-22 20:30:20,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-22 20:30:26,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1313352.0, ans=0.0 2023-06-22 20:30:54,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1313412.0, ans=0.0 2023-06-22 20:30:58,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1313412.0, ans=0.1 2023-06-22 20:31:03,456 INFO [train.py:996] (1/4) Epoch 8, batch 5450, loss[loss=0.2266, simple_loss=0.3056, pruned_loss=0.0738, over 21817.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3143, pruned_loss=0.08263, over 4299464.70 frames. ], batch size: 124, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:31:05,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1313472.0, ans=0.0 2023-06-22 20:31:42,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1313532.0, ans=0.0 2023-06-22 20:32:02,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-22 20:32:49,378 INFO [train.py:996] (1/4) Epoch 8, batch 5500, loss[loss=0.2362, simple_loss=0.3232, pruned_loss=0.07459, over 21574.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3187, pruned_loss=0.07955, over 4297173.58 frames. ], batch size: 230, lr: 3.81e-03, grad_scale: 8.0 2023-06-22 20:33:07,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1313772.0, ans=0.2 2023-06-22 20:33:30,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1313892.0, ans=0.04949747468305833 2023-06-22 20:33:31,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.855e+02 4.320e+02 6.103e+02 1.036e+03 2.497e+03, threshold=1.221e+03, percent-clipped=15.0 2023-06-22 20:34:16,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1314012.0, ans=0.0 2023-06-22 20:34:40,461 INFO [train.py:996] (1/4) Epoch 8, batch 5550, loss[loss=0.2591, simple_loss=0.387, pruned_loss=0.06555, over 19834.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3192, pruned_loss=0.07763, over 4290289.40 frames. ], batch size: 703, lr: 3.81e-03, grad_scale: 8.0 2023-06-22 20:34:42,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1314072.0, ans=0.1 2023-06-22 20:34:54,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1314072.0, ans=0.1 2023-06-22 20:34:57,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1314132.0, ans=0.0 2023-06-22 20:35:09,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1314132.0, ans=0.0 2023-06-22 20:35:20,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1314192.0, ans=0.1 2023-06-22 20:35:45,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1314252.0, ans=0.125 2023-06-22 20:36:20,967 INFO [train.py:996] (1/4) Epoch 8, batch 5600, loss[loss=0.2067, simple_loss=0.2821, pruned_loss=0.06565, over 21164.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3155, pruned_loss=0.07453, over 4286323.18 frames. ], batch size: 143, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:36:34,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1314372.0, ans=0.0 2023-06-22 20:36:58,098 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.586e+02 4.144e+02 6.431e+02 9.394e+02 1.823e+03, threshold=1.286e+03, percent-clipped=11.0 2023-06-22 20:37:26,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1314552.0, ans=0.125 2023-06-22 20:37:28,243 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:37:52,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1314612.0, ans=0.125 2023-06-22 20:37:55,151 INFO [train.py:996] (1/4) Epoch 8, batch 5650, loss[loss=0.2484, simple_loss=0.3155, pruned_loss=0.09063, over 21859.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3219, pruned_loss=0.07869, over 4284039.25 frames. ], batch size: 124, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:38:24,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1314732.0, ans=0.1 2023-06-22 20:39:17,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1314912.0, ans=0.125 2023-06-22 20:39:34,675 INFO [train.py:996] (1/4) Epoch 8, batch 5700, loss[loss=0.2636, simple_loss=0.3503, pruned_loss=0.08847, over 21453.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3211, pruned_loss=0.07928, over 4284660.65 frames. ], batch size: 471, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:40:11,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1315092.0, ans=0.0 2023-06-22 20:40:12,310 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.709e+02 4.552e+02 6.267e+02 8.726e+02 1.736e+03, threshold=1.253e+03, percent-clipped=4.0 2023-06-22 20:40:13,847 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.38 vs. limit=15.0 2023-06-22 20:40:23,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1315092.0, ans=0.0 2023-06-22 20:40:23,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1315092.0, ans=0.05 2023-06-22 20:40:56,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1315152.0, ans=0.1 2023-06-22 20:41:06,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1315212.0, ans=0.125 2023-06-22 20:41:09,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1315212.0, ans=0.125 2023-06-22 20:41:19,548 INFO [train.py:996] (1/4) Epoch 8, batch 5750, loss[loss=0.1767, simple_loss=0.2581, pruned_loss=0.04766, over 21399.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3151, pruned_loss=0.0752, over 4284607.00 frames. ], batch size: 211, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:41:47,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1315332.0, ans=0.05 2023-06-22 20:42:59,396 INFO [train.py:996] (1/4) Epoch 8, batch 5800, loss[loss=0.2651, simple_loss=0.3587, pruned_loss=0.08575, over 21589.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3153, pruned_loss=0.07439, over 4285731.12 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:43:26,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-22 20:43:42,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.530e+02 3.870e+02 5.356e+02 7.893e+02 2.349e+03, threshold=1.071e+03, percent-clipped=9.0 2023-06-22 20:44:07,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1315752.0, ans=0.125 2023-06-22 20:44:21,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1315812.0, ans=0.125 2023-06-22 20:44:21,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1315812.0, ans=0.125 2023-06-22 20:44:22,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1315812.0, ans=0.5 2023-06-22 20:44:35,119 INFO [train.py:996] (1/4) Epoch 8, batch 5850, loss[loss=0.2493, simple_loss=0.3176, pruned_loss=0.0905, over 20128.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3128, pruned_loss=0.07015, over 4276358.16 frames. ], batch size: 702, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:44:39,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-22 20:44:52,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1315872.0, ans=0.125 2023-06-22 20:45:17,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-22 20:45:21,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1315992.0, ans=0.0 2023-06-22 20:46:06,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1316172.0, ans=0.125 2023-06-22 20:46:07,391 INFO [train.py:996] (1/4) Epoch 8, batch 5900, loss[loss=0.1804, simple_loss=0.2563, pruned_loss=0.05225, over 21347.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.3043, pruned_loss=0.06432, over 4281307.04 frames. ], batch size: 194, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:46:48,289 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.361e+02 4.544e+02 6.338e+02 1.644e+03, threshold=9.088e+02, percent-clipped=4.0 2023-06-22 20:47:36,467 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.20 vs. limit=15.0 2023-06-22 20:47:42,265 INFO [train.py:996] (1/4) Epoch 8, batch 5950, loss[loss=0.2243, simple_loss=0.2819, pruned_loss=0.08333, over 21459.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3036, pruned_loss=0.0682, over 4283053.35 frames. ], batch size: 177, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:47:51,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1316472.0, ans=0.125 2023-06-22 20:49:00,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=12.0 2023-06-22 20:49:03,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-22 20:49:04,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1316712.0, ans=0.125 2023-06-22 20:49:19,236 INFO [train.py:996] (1/4) Epoch 8, batch 6000, loss[loss=0.2367, simple_loss=0.2912, pruned_loss=0.0911, over 21317.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3004, pruned_loss=0.07291, over 4279422.42 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:49:19,237 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 20:49:40,896 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2636, simple_loss=0.3606, pruned_loss=0.08334, over 1796401.00 frames. 2023-06-22 20:49:40,897 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 20:49:50,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1316772.0, ans=0.02 2023-06-22 20:49:58,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1316832.0, ans=0.125 2023-06-22 20:50:01,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1316832.0, ans=0.125 2023-06-22 20:50:11,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1316892.0, ans=0.2 2023-06-22 20:50:16,502 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.164e+02 5.189e+02 7.580e+02 1.356e+03, threshold=1.038e+03, percent-clipped=17.0 2023-06-22 20:51:14,226 INFO [train.py:996] (1/4) Epoch 8, batch 6050, loss[loss=0.2053, simple_loss=0.271, pruned_loss=0.06979, over 21598.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2944, pruned_loss=0.07344, over 4278838.24 frames. ], batch size: 415, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:51:27,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1317072.0, ans=0.0 2023-06-22 20:51:54,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-22 20:51:55,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1317192.0, ans=0.1 2023-06-22 20:51:56,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1317192.0, ans=0.125 2023-06-22 20:52:00,252 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:52:16,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-06-22 20:52:38,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.70 vs. limit=22.5 2023-06-22 20:52:51,312 INFO [train.py:996] (1/4) Epoch 8, batch 6100, loss[loss=0.1923, simple_loss=0.286, pruned_loss=0.04934, over 21742.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2957, pruned_loss=0.0722, over 4267004.92 frames. ], batch size: 351, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:53:22,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-22 20:53:24,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1317432.0, ans=0.125 2023-06-22 20:53:29,020 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.450e+02 4.266e+02 5.544e+02 1.374e+03, threshold=8.532e+02, percent-clipped=4.0 2023-06-22 20:53:35,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1317492.0, ans=0.0 2023-06-22 20:54:21,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1317612.0, ans=0.2 2023-06-22 20:54:28,763 INFO [train.py:996] (1/4) Epoch 8, batch 6150, loss[loss=0.2463, simple_loss=0.3196, pruned_loss=0.08646, over 21528.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2987, pruned_loss=0.07522, over 4264018.55 frames. ], batch size: 389, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:54:37,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1317672.0, ans=0.125 2023-06-22 20:54:37,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1317672.0, ans=0.0 2023-06-22 20:54:55,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-22 20:55:14,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1317792.0, ans=0.0 2023-06-22 20:55:20,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1317852.0, ans=0.2 2023-06-22 20:55:28,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1317852.0, ans=0.2 2023-06-22 20:55:51,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1317912.0, ans=0.125 2023-06-22 20:56:06,768 INFO [train.py:996] (1/4) Epoch 8, batch 6200, loss[loss=0.2377, simple_loss=0.3626, pruned_loss=0.05641, over 20761.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3018, pruned_loss=0.0752, over 4261500.05 frames. ], batch size: 607, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:56:37,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1318032.0, ans=0.2 2023-06-22 20:56:44,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.020e+02 4.357e+02 5.420e+02 8.092e+02 2.121e+03, threshold=1.084e+03, percent-clipped=22.0 2023-06-22 20:57:07,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1318152.0, ans=0.125 2023-06-22 20:57:33,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-22 20:57:47,416 INFO [train.py:996] (1/4) Epoch 8, batch 6250, loss[loss=0.2127, simple_loss=0.3065, pruned_loss=0.05945, over 21458.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3087, pruned_loss=0.07527, over 4262607.55 frames. ], batch size: 211, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 20:58:10,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1318332.0, ans=0.0 2023-06-22 20:58:16,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1318332.0, ans=0.125 2023-06-22 20:58:32,408 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:58:32,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-22 20:59:03,184 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:59:22,291 INFO [train.py:996] (1/4) Epoch 8, batch 6300, loss[loss=0.2408, simple_loss=0.3197, pruned_loss=0.08097, over 21758.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3122, pruned_loss=0.07371, over 4271441.43 frames. ], batch size: 441, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 20:59:22,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1318572.0, ans=0.0 2023-06-22 20:59:56,627 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-22 21:00:00,002 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 4.288e+02 6.355e+02 8.462e+02 1.476e+03, threshold=1.271e+03, percent-clipped=15.0 2023-06-22 21:00:52,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1318812.0, ans=0.125 2023-06-22 21:00:56,504 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-22 21:01:04,437 INFO [train.py:996] (1/4) Epoch 8, batch 6350, loss[loss=0.3026, simple_loss=0.3591, pruned_loss=0.1231, over 21808.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3148, pruned_loss=0.0789, over 4273297.74 frames. ], batch size: 441, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:01:22,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1318932.0, ans=0.0 2023-06-22 21:01:52,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=12.0 2023-06-22 21:02:06,778 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-22 21:02:43,844 INFO [train.py:996] (1/4) Epoch 8, batch 6400, loss[loss=0.2462, simple_loss=0.3236, pruned_loss=0.08442, over 21712.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.32, pruned_loss=0.08293, over 4277437.48 frames. ], batch size: 298, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:02:52,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1319172.0, ans=0.1 2023-06-22 21:03:15,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.54 vs. limit=15.0 2023-06-22 21:03:31,912 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.536e+02 4.404e+02 5.449e+02 7.418e+02 1.410e+03, threshold=1.090e+03, percent-clipped=1.0 2023-06-22 21:03:41,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1319292.0, ans=0.0 2023-06-22 21:03:41,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1319292.0, ans=0.0 2023-06-22 21:03:48,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1319352.0, ans=0.125 2023-06-22 21:04:22,004 INFO [train.py:996] (1/4) Epoch 8, batch 6450, loss[loss=0.2208, simple_loss=0.3016, pruned_loss=0.06994, over 21340.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3225, pruned_loss=0.08227, over 4275027.11 frames. ], batch size: 211, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:04:27,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1319472.0, ans=0.1 2023-06-22 21:05:29,251 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:05:55,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1319712.0, ans=0.2 2023-06-22 21:05:59,790 INFO [train.py:996] (1/4) Epoch 8, batch 6500, loss[loss=0.2312, simple_loss=0.3365, pruned_loss=0.06298, over 21239.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3174, pruned_loss=0.08098, over 4271249.95 frames. ], batch size: 549, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:06:21,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1319772.0, ans=0.95 2023-06-22 21:06:48,030 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.971e+02 4.929e+02 6.597e+02 9.364e+02 1.745e+03, threshold=1.319e+03, percent-clipped=16.0 2023-06-22 21:07:09,348 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:07:14,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-22 21:07:15,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1319952.0, ans=0.0 2023-06-22 21:07:44,851 INFO [train.py:996] (1/4) Epoch 8, batch 6550, loss[loss=0.2242, simple_loss=0.2965, pruned_loss=0.07594, over 21677.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.317, pruned_loss=0.07982, over 4267760.13 frames. ], batch size: 263, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:07:51,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-22 21:07:56,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1320072.0, ans=0.125 2023-06-22 21:08:00,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1320132.0, ans=0.07 2023-06-22 21:08:02,824 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.60 vs. limit=15.0 2023-06-22 21:08:03,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1320132.0, ans=0.0 2023-06-22 21:08:19,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1320192.0, ans=0.1 2023-06-22 21:08:31,147 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-22 21:09:17,845 INFO [train.py:996] (1/4) Epoch 8, batch 6600, loss[loss=0.2935, simple_loss=0.3555, pruned_loss=0.1157, over 21547.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3114, pruned_loss=0.0803, over 4258439.32 frames. ], batch size: 471, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:09:54,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1320492.0, ans=0.2 2023-06-22 21:09:57,232 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.721e+02 4.019e+02 5.750e+02 8.785e+02 1.668e+03, threshold=1.150e+03, percent-clipped=5.0 2023-06-22 21:10:31,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-22 21:10:55,809 INFO [train.py:996] (1/4) Epoch 8, batch 6650, loss[loss=0.2047, simple_loss=0.2775, pruned_loss=0.06591, over 21615.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3031, pruned_loss=0.0776, over 4257819.45 frames. ], batch size: 298, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:12:25,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.23 vs. limit=10.0 2023-06-22 21:12:29,170 INFO [train.py:996] (1/4) Epoch 8, batch 6700, loss[loss=0.195, simple_loss=0.2606, pruned_loss=0.06465, over 21568.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2969, pruned_loss=0.07733, over 4261498.00 frames. ], batch size: 263, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:12:35,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1320972.0, ans=0.0 2023-06-22 21:12:37,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1320972.0, ans=0.0 2023-06-22 21:12:45,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1320972.0, ans=0.0 2023-06-22 21:13:03,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1321092.0, ans=0.125 2023-06-22 21:13:08,244 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.668e+02 3.822e+02 4.568e+02 6.367e+02 1.164e+03, threshold=9.137e+02, percent-clipped=1.0 2023-06-22 21:13:50,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-22 21:14:02,871 INFO [train.py:996] (1/4) Epoch 8, batch 6750, loss[loss=0.249, simple_loss=0.314, pruned_loss=0.092, over 21506.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2943, pruned_loss=0.07763, over 4253402.98 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:14:06,511 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:14:22,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1321332.0, ans=0.0 2023-06-22 21:14:38,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-22 21:15:30,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1321512.0, ans=0.1 2023-06-22 21:15:37,256 INFO [train.py:996] (1/4) Epoch 8, batch 6800, loss[loss=0.2602, simple_loss=0.3299, pruned_loss=0.09528, over 21877.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2964, pruned_loss=0.0797, over 4255995.94 frames. ], batch size: 107, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:16:03,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1321632.0, ans=0.5 2023-06-22 21:16:16,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.952e+02 4.656e+02 6.234e+02 8.844e+02 1.935e+03, threshold=1.247e+03, percent-clipped=22.0 2023-06-22 21:16:47,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1321812.0, ans=0.5 2023-06-22 21:16:53,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1321812.0, ans=0.015 2023-06-22 21:17:04,209 INFO [train.py:996] (1/4) Epoch 8, batch 6850, loss[loss=0.3083, simple_loss=0.4173, pruned_loss=0.09962, over 19947.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2969, pruned_loss=0.08083, over 4261416.12 frames. ], batch size: 702, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:17:50,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-22 21:18:07,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1322052.0, ans=0.0 2023-06-22 21:18:48,721 INFO [train.py:996] (1/4) Epoch 8, batch 6900, loss[loss=0.2149, simple_loss=0.2848, pruned_loss=0.0725, over 21856.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2987, pruned_loss=0.08173, over 4268293.55 frames. ], batch size: 107, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:19:05,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1322172.0, ans=0.04949747468305833 2023-06-22 21:19:06,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1322172.0, ans=0.0 2023-06-22 21:19:34,853 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 4.516e+02 6.241e+02 9.290e+02 1.863e+03, threshold=1.248e+03, percent-clipped=14.0 2023-06-22 21:19:58,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1322352.0, ans=0.05 2023-06-22 21:20:19,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1322412.0, ans=0.125 2023-06-22 21:20:33,637 INFO [train.py:996] (1/4) Epoch 8, batch 6950, loss[loss=0.2715, simple_loss=0.347, pruned_loss=0.09804, over 21552.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3014, pruned_loss=0.07785, over 4268690.84 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:20:53,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1322532.0, ans=0.125 2023-06-22 21:21:11,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1322592.0, ans=0.1 2023-06-22 21:22:09,210 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-22 21:22:12,758 INFO [train.py:996] (1/4) Epoch 8, batch 7000, loss[loss=0.2472, simple_loss=0.3059, pruned_loss=0.09424, over 21339.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3066, pruned_loss=0.08054, over 4254586.62 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:22:45,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1322832.0, ans=0.125 2023-06-22 21:22:54,255 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.125e+02 4.702e+02 7.118e+02 1.401e+03, threshold=9.403e+02, percent-clipped=1.0 2023-06-22 21:23:41,085 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:23:51,500 INFO [train.py:996] (1/4) Epoch 8, batch 7050, loss[loss=0.2026, simple_loss=0.2996, pruned_loss=0.05283, over 21609.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3032, pruned_loss=0.07967, over 4247482.35 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:24:19,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1323132.0, ans=0.1 2023-06-22 21:24:22,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1323132.0, ans=0.05 2023-06-22 21:25:00,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-22 21:25:24,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.85 vs. limit=15.0 2023-06-22 21:25:31,405 INFO [train.py:996] (1/4) Epoch 8, batch 7100, loss[loss=0.2763, simple_loss=0.3431, pruned_loss=0.1047, over 21660.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3064, pruned_loss=0.08028, over 4259651.48 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:25:55,411 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:26:12,658 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 4.059e+02 5.159e+02 6.413e+02 1.166e+03, threshold=1.032e+03, percent-clipped=5.0 2023-06-22 21:26:19,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1323492.0, ans=0.0 2023-06-22 21:26:23,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1323492.0, ans=0.125 2023-06-22 21:26:27,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1323552.0, ans=0.02 2023-06-22 21:26:28,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1323552.0, ans=0.1 2023-06-22 21:27:05,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-22 21:27:15,340 INFO [train.py:996] (1/4) Epoch 8, batch 7150, loss[loss=0.2419, simple_loss=0.3132, pruned_loss=0.08529, over 21760.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3056, pruned_loss=0.07893, over 4259654.26 frames. ], batch size: 332, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:27:20,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1323672.0, ans=0.125 2023-06-22 21:27:27,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1323672.0, ans=0.125 2023-06-22 21:27:51,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1323792.0, ans=0.125 2023-06-22 21:28:54,638 INFO [train.py:996] (1/4) Epoch 8, batch 7200, loss[loss=0.24, simple_loss=0.3594, pruned_loss=0.06028, over 19768.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3082, pruned_loss=0.08144, over 4262879.21 frames. ], batch size: 703, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:29:07,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1323972.0, ans=0.125 2023-06-22 21:29:17,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1324032.0, ans=0.09899494936611666 2023-06-22 21:29:35,436 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.797e+02 4.643e+02 6.367e+02 8.701e+02 1.653e+03, threshold=1.273e+03, percent-clipped=12.0 2023-06-22 21:30:20,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1324212.0, ans=0.1 2023-06-22 21:30:22,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1324212.0, ans=0.125 2023-06-22 21:30:27,776 INFO [train.py:996] (1/4) Epoch 8, batch 7250, loss[loss=0.2326, simple_loss=0.2909, pruned_loss=0.08713, over 21528.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3052, pruned_loss=0.08178, over 4268632.55 frames. ], batch size: 132, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:30:40,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1324272.0, ans=0.2 2023-06-22 21:30:46,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1324332.0, ans=0.125 2023-06-22 21:30:53,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-22 21:32:03,427 INFO [train.py:996] (1/4) Epoch 8, batch 7300, loss[loss=0.2155, simple_loss=0.2738, pruned_loss=0.07864, over 21861.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2988, pruned_loss=0.0802, over 4265677.92 frames. ], batch size: 373, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:32:14,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1324572.0, ans=0.1 2023-06-22 21:32:47,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1324692.0, ans=0.0 2023-06-22 21:32:49,713 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.041e+02 5.063e+02 7.441e+02 1.428e+03, threshold=1.013e+03, percent-clipped=2.0 2023-06-22 21:32:59,017 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-22 21:33:43,004 INFO [train.py:996] (1/4) Epoch 8, batch 7350, loss[loss=0.2428, simple_loss=0.29, pruned_loss=0.09779, over 21562.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.2968, pruned_loss=0.08102, over 4260777.94 frames. ], batch size: 442, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:33:53,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1324872.0, ans=0.125 2023-06-22 21:33:53,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1324872.0, ans=0.1 2023-06-22 21:34:28,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1324992.0, ans=0.0 2023-06-22 21:35:24,118 INFO [train.py:996] (1/4) Epoch 8, batch 7400, loss[loss=0.2347, simple_loss=0.3113, pruned_loss=0.07901, over 21631.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3031, pruned_loss=0.08322, over 4266695.56 frames. ], batch size: 263, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:36:10,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.225e+02 4.086e+02 4.943e+02 6.567e+02 1.302e+03, threshold=9.886e+02, percent-clipped=5.0 2023-06-22 21:36:58,176 INFO [train.py:996] (1/4) Epoch 8, batch 7450, loss[loss=0.2041, simple_loss=0.2622, pruned_loss=0.07298, over 21553.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3, pruned_loss=0.08101, over 4263318.25 frames. ], batch size: 263, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:37:39,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-22 21:37:50,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1325592.0, ans=0.2 2023-06-22 21:38:05,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.39 vs. limit=15.0 2023-06-22 21:38:15,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1325652.0, ans=0.2 2023-06-22 21:38:24,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1325712.0, ans=0.125 2023-06-22 21:38:26,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325712.0, ans=0.1 2023-06-22 21:38:38,919 INFO [train.py:996] (1/4) Epoch 8, batch 7500, loss[loss=0.2756, simple_loss=0.3659, pruned_loss=0.09268, over 21614.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3059, pruned_loss=0.08246, over 4269063.27 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:39:31,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1325892.0, ans=0.0 2023-06-22 21:39:33,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1325892.0, ans=0.125 2023-06-22 21:39:35,930 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.043e+02 4.495e+02 6.774e+02 8.927e+02 1.705e+03, threshold=1.355e+03, percent-clipped=18.0 2023-06-22 21:39:39,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1325892.0, ans=0.125 2023-06-22 21:39:49,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1325952.0, ans=0.2 2023-06-22 21:40:09,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1326012.0, ans=0.2 2023-06-22 21:40:24,014 INFO [train.py:996] (1/4) Epoch 8, batch 7550, loss[loss=0.2249, simple_loss=0.3288, pruned_loss=0.06048, over 21677.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3112, pruned_loss=0.08103, over 4277563.58 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:40:59,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1326132.0, ans=0.125 2023-06-22 21:41:32,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1326252.0, ans=0.0 2023-06-22 21:42:00,790 INFO [train.py:996] (1/4) Epoch 8, batch 7600, loss[loss=0.2072, simple_loss=0.2695, pruned_loss=0.0724, over 21176.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3133, pruned_loss=0.07993, over 4274929.51 frames. ], batch size: 608, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:42:50,590 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.183e+02 4.607e+02 6.581e+02 9.726e+02 1.530e+03, threshold=1.316e+03, percent-clipped=7.0 2023-06-22 21:43:24,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1326612.0, ans=0.0 2023-06-22 21:43:38,633 INFO [train.py:996] (1/4) Epoch 8, batch 7650, loss[loss=0.2336, simple_loss=0.2973, pruned_loss=0.08491, over 21938.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3125, pruned_loss=0.08087, over 4278654.88 frames. ], batch size: 316, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:44:05,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1326732.0, ans=0.125 2023-06-22 21:44:20,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1326732.0, ans=0.2 2023-06-22 21:44:34,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1326792.0, ans=0.125 2023-06-22 21:44:43,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1326852.0, ans=0.125 2023-06-22 21:44:48,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1326852.0, ans=0.0 2023-06-22 21:45:18,115 INFO [train.py:996] (1/4) Epoch 8, batch 7700, loss[loss=0.278, simple_loss=0.3465, pruned_loss=0.1048, over 21572.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3143, pruned_loss=0.08407, over 4277947.77 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:45:30,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.02 vs. limit=22.5 2023-06-22 21:45:40,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1327032.0, ans=0.125 2023-06-22 21:45:47,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1327032.0, ans=0.0 2023-06-22 21:45:50,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1327032.0, ans=0.1 2023-06-22 21:45:55,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1327032.0, ans=0.04949747468305833 2023-06-22 21:46:04,660 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.953e+02 3.847e+02 4.781e+02 6.112e+02 1.345e+03, threshold=9.563e+02, percent-clipped=1.0 2023-06-22 21:46:58,552 INFO [train.py:996] (1/4) Epoch 8, batch 7750, loss[loss=0.172, simple_loss=0.2185, pruned_loss=0.06276, over 17271.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3212, pruned_loss=0.08558, over 4276498.05 frames. ], batch size: 62, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:47:22,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1327332.0, ans=0.125 2023-06-22 21:47:45,597 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-22 21:48:42,274 INFO [train.py:996] (1/4) Epoch 8, batch 7800, loss[loss=0.2296, simple_loss=0.2762, pruned_loss=0.09148, over 21194.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3245, pruned_loss=0.0863, over 4278141.00 frames. ], batch size: 143, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:48:47,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1327572.0, ans=0.125 2023-06-22 21:49:19,175 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.166e+02 4.676e+02 6.316e+02 9.000e+02 2.015e+03, threshold=1.263e+03, percent-clipped=20.0 2023-06-22 21:50:00,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1327812.0, ans=0.125 2023-06-22 21:50:15,001 INFO [train.py:996] (1/4) Epoch 8, batch 7850, loss[loss=0.1978, simple_loss=0.2588, pruned_loss=0.06839, over 21453.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3173, pruned_loss=0.08456, over 4279927.39 frames. ], batch size: 195, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:51:00,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1327992.0, ans=0.2 2023-06-22 21:51:14,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1328052.0, ans=0.1 2023-06-22 21:51:17,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1328052.0, ans=0.0 2023-06-22 21:51:36,245 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:52:00,783 INFO [train.py:996] (1/4) Epoch 8, batch 7900, loss[loss=0.2675, simple_loss=0.3487, pruned_loss=0.09317, over 21732.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3116, pruned_loss=0.08344, over 4280173.42 frames. ], batch size: 351, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:52:44,750 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 3.830e+02 4.891e+02 7.312e+02 1.897e+03, threshold=9.781e+02, percent-clipped=5.0 2023-06-22 21:53:41,632 INFO [train.py:996] (1/4) Epoch 8, batch 7950, loss[loss=0.2294, simple_loss=0.3095, pruned_loss=0.07466, over 21757.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3159, pruned_loss=0.08279, over 4277295.99 frames. ], batch size: 298, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:55:24,247 INFO [train.py:996] (1/4) Epoch 8, batch 8000, loss[loss=0.33, simple_loss=0.3934, pruned_loss=0.1333, over 21428.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3177, pruned_loss=0.08433, over 4260782.14 frames. ], batch size: 471, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:55:36,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1328772.0, ans=0.125 2023-06-22 21:56:12,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1328892.0, ans=0.0 2023-06-22 21:56:25,232 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.345e+02 4.453e+02 5.766e+02 9.258e+02 3.143e+03, threshold=1.153e+03, percent-clipped=22.0 2023-06-22 21:57:10,680 INFO [train.py:996] (1/4) Epoch 8, batch 8050, loss[loss=0.2371, simple_loss=0.2865, pruned_loss=0.09384, over 20283.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3212, pruned_loss=0.08519, over 4266221.84 frames. ], batch size: 703, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:57:44,699 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-22 21:57:59,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-22 21:58:09,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329192.0, ans=0.1 2023-06-22 21:58:19,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329252.0, ans=0.1 2023-06-22 21:58:28,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-22 21:58:40,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1329312.0, ans=0.0 2023-06-22 21:58:56,256 INFO [train.py:996] (1/4) Epoch 8, batch 8100, loss[loss=0.2163, simple_loss=0.2823, pruned_loss=0.07521, over 21843.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.32, pruned_loss=0.08437, over 4261942.27 frames. ], batch size: 282, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:59:06,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1329372.0, ans=0.2 2023-06-22 21:59:47,644 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.065e+02 4.714e+02 7.153e+02 1.179e+03 2.402e+03, threshold=1.431e+03, percent-clipped=27.0 2023-06-22 22:00:43,230 INFO [train.py:996] (1/4) Epoch 8, batch 8150, loss[loss=0.2542, simple_loss=0.3535, pruned_loss=0.07746, over 21769.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3266, pruned_loss=0.08553, over 4259530.12 frames. ], batch size: 371, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:00:59,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1329732.0, ans=0.125 2023-06-22 22:01:12,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1329732.0, ans=0.125 2023-06-22 22:01:58,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1329912.0, ans=0.0 2023-06-22 22:01:58,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1329912.0, ans=0.125 2023-06-22 22:02:22,165 INFO [train.py:996] (1/4) Epoch 8, batch 8200, loss[loss=0.2121, simple_loss=0.2744, pruned_loss=0.07489, over 21531.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3203, pruned_loss=0.08374, over 4267486.07 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:02:32,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1329972.0, ans=0.125 2023-06-22 22:02:47,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1330032.0, ans=0.0 2023-06-22 22:02:48,778 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:03:02,699 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 4.983e+02 6.789e+02 1.065e+03 2.564e+03, threshold=1.358e+03, percent-clipped=14.0 2023-06-22 22:03:18,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-22 22:03:24,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1330152.0, ans=0.1 2023-06-22 22:04:02,071 INFO [train.py:996] (1/4) Epoch 8, batch 8250, loss[loss=0.2949, simple_loss=0.3718, pruned_loss=0.109, over 21562.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3187, pruned_loss=0.08393, over 4255755.08 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:04:16,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1330332.0, ans=0.125 2023-06-22 22:04:18,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1330332.0, ans=0.125 2023-06-22 22:04:33,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1330392.0, ans=0.0 2023-06-22 22:04:34,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=22.5 2023-06-22 22:04:38,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1330392.0, ans=0.125 2023-06-22 22:05:39,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-22 22:05:41,300 INFO [train.py:996] (1/4) Epoch 8, batch 8300, loss[loss=0.2077, simple_loss=0.2847, pruned_loss=0.06536, over 21226.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3166, pruned_loss=0.08203, over 4259411.25 frames. ], batch size: 159, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:05:43,618 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:05:46,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1330572.0, ans=0.125 2023-06-22 22:05:50,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1330572.0, ans=0.1 2023-06-22 22:06:27,256 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.942e+02 4.371e+02 5.179e+02 7.811e+02 1.980e+03, threshold=1.036e+03, percent-clipped=4.0 2023-06-22 22:06:51,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1330752.0, ans=0.125 2023-06-22 22:07:17,087 INFO [train.py:996] (1/4) Epoch 8, batch 8350, loss[loss=0.2225, simple_loss=0.318, pruned_loss=0.06348, over 21779.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3168, pruned_loss=0.08045, over 4252787.18 frames. ], batch size: 282, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:07:35,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1330932.0, ans=0.0 2023-06-22 22:07:51,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-22 22:08:08,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1330992.0, ans=0.1 2023-06-22 22:08:56,617 INFO [train.py:996] (1/4) Epoch 8, batch 8400, loss[loss=0.1848, simple_loss=0.2814, pruned_loss=0.0441, over 21747.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3144, pruned_loss=0.07764, over 4261247.59 frames. ], batch size: 351, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:09:03,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1331172.0, ans=0.035 2023-06-22 22:09:15,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1331232.0, ans=0.125 2023-06-22 22:09:18,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1331232.0, ans=0.125 2023-06-22 22:09:19,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1331232.0, ans=0.2 2023-06-22 22:09:35,097 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.911e+02 4.503e+02 6.126e+02 1.860e+03, threshold=9.006e+02, percent-clipped=8.0 2023-06-22 22:09:41,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1331292.0, ans=0.0 2023-06-22 22:09:42,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-22 22:10:09,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1331352.0, ans=0.04949747468305833 2023-06-22 22:10:28,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1331412.0, ans=0.1 2023-06-22 22:10:34,287 INFO [train.py:996] (1/4) Epoch 8, batch 8450, loss[loss=0.2167, simple_loss=0.2814, pruned_loss=0.07599, over 21506.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3108, pruned_loss=0.07744, over 4269948.10 frames. ], batch size: 195, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:10:58,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1331532.0, ans=0.1 2023-06-22 22:11:55,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1331712.0, ans=0.0 2023-06-22 22:12:12,568 INFO [train.py:996] (1/4) Epoch 8, batch 8500, loss[loss=0.2036, simple_loss=0.2721, pruned_loss=0.06755, over 21819.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3061, pruned_loss=0.07852, over 4266008.54 frames. ], batch size: 118, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:12:56,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-22 22:12:58,446 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.125e+02 4.200e+02 5.679e+02 8.112e+02 1.673e+03, threshold=1.136e+03, percent-clipped=13.0 2023-06-22 22:13:16,411 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:13:50,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1332012.0, ans=0.0 2023-06-22 22:13:54,929 INFO [train.py:996] (1/4) Epoch 8, batch 8550, loss[loss=0.2816, simple_loss=0.3688, pruned_loss=0.09718, over 21609.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3114, pruned_loss=0.08081, over 4270692.53 frames. ], batch size: 389, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:15:13,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1332252.0, ans=0.0 2023-06-22 22:15:13,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1332252.0, ans=0.125 2023-06-22 22:15:35,658 INFO [train.py:996] (1/4) Epoch 8, batch 8600, loss[loss=0.2398, simple_loss=0.3314, pruned_loss=0.07407, over 21746.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3176, pruned_loss=0.08299, over 4271895.81 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:15:40,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1332372.0, ans=0.09899494936611666 2023-06-22 22:16:18,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1332492.0, ans=0.2 2023-06-22 22:16:19,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332492.0, ans=0.1 2023-06-22 22:16:32,350 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.201e+02 4.143e+02 4.841e+02 5.659e+02 1.807e+03, threshold=9.683e+02, percent-clipped=7.0 2023-06-22 22:17:01,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-22 22:17:02,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1332612.0, ans=0.125 2023-06-22 22:17:15,235 INFO [train.py:996] (1/4) Epoch 8, batch 8650, loss[loss=0.2235, simple_loss=0.2955, pruned_loss=0.07569, over 20039.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3235, pruned_loss=0.08363, over 4267947.15 frames. ], batch size: 704, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:17:22,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332672.0, ans=0.1 2023-06-22 22:17:26,418 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-22 22:17:39,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1332732.0, ans=10.0 2023-06-22 22:17:40,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-22 22:18:29,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1332852.0, ans=10.0 2023-06-22 22:18:39,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.87 vs. limit=5.0 2023-06-22 22:18:40,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.60 vs. limit=15.0 2023-06-22 22:18:53,792 INFO [train.py:996] (1/4) Epoch 8, batch 8700, loss[loss=0.2441, simple_loss=0.3067, pruned_loss=0.09076, over 21446.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3162, pruned_loss=0.08036, over 4268089.34 frames. ], batch size: 389, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:18:54,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1332972.0, ans=0.2 2023-06-22 22:18:59,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1332972.0, ans=0.125 2023-06-22 22:19:48,717 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.793e+02 4.063e+02 5.741e+02 9.934e+02 1.995e+03, threshold=1.148e+03, percent-clipped=26.0 2023-06-22 22:20:18,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1333212.0, ans=0.125 2023-06-22 22:20:24,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1333212.0, ans=0.0 2023-06-22 22:20:32,233 INFO [train.py:996] (1/4) Epoch 8, batch 8750, loss[loss=0.2491, simple_loss=0.3112, pruned_loss=0.09355, over 21498.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3126, pruned_loss=0.08146, over 4273187.56 frames. ], batch size: 144, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:20:35,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-22 22:20:37,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1333272.0, ans=0.0 2023-06-22 22:20:55,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1333332.0, ans=0.02 2023-06-22 22:20:55,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1333332.0, ans=0.125 2023-06-22 22:21:10,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-22 22:21:12,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.19 vs. limit=15.0 2023-06-22 22:21:28,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1333392.0, ans=0.1 2023-06-22 22:21:37,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1333392.0, ans=0.0 2023-06-22 22:21:54,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1333512.0, ans=0.125 2023-06-22 22:21:54,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1333512.0, ans=0.125 2023-06-22 22:22:16,521 INFO [train.py:996] (1/4) Epoch 8, batch 8800, loss[loss=0.2766, simple_loss=0.3643, pruned_loss=0.09441, over 21317.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3207, pruned_loss=0.08367, over 4276604.45 frames. ], batch size: 548, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:22:28,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1333572.0, ans=0.125 2023-06-22 22:22:31,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-22 22:22:52,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1333632.0, ans=0.2 2023-06-22 22:23:01,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-22 22:23:07,931 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.331e+02 4.564e+02 6.230e+02 9.935e+02 2.348e+03, threshold=1.246e+03, percent-clipped=15.0 2023-06-22 22:23:22,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1333752.0, ans=0.125 2023-06-22 22:23:24,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1333752.0, ans=0.125 2023-06-22 22:23:50,304 INFO [train.py:996] (1/4) Epoch 8, batch 8850, loss[loss=0.2266, simple_loss=0.3308, pruned_loss=0.06123, over 21638.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3274, pruned_loss=0.08563, over 4283334.39 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:23:57,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1333872.0, ans=0.125 2023-06-22 22:24:09,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1333872.0, ans=0.2 2023-06-22 22:25:26,318 INFO [train.py:996] (1/4) Epoch 8, batch 8900, loss[loss=0.2806, simple_loss=0.3831, pruned_loss=0.08903, over 20753.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3211, pruned_loss=0.08475, over 4278670.01 frames. ], batch size: 608, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:26:06,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1334232.0, ans=0.0 2023-06-22 22:26:20,781 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.244e+02 4.551e+02 5.330e+02 7.944e+02 2.391e+03, threshold=1.066e+03, percent-clipped=3.0 2023-06-22 22:26:39,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1334352.0, ans=0.09899494936611666 2023-06-22 22:26:48,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1334412.0, ans=0.0 2023-06-22 22:26:54,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1334412.0, ans=0.0 2023-06-22 22:27:11,653 INFO [train.py:996] (1/4) Epoch 8, batch 8950, loss[loss=0.2523, simple_loss=0.3267, pruned_loss=0.08895, over 21605.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.321, pruned_loss=0.08333, over 4271798.86 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:27:38,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1334532.0, ans=0.125 2023-06-22 22:27:44,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1334532.0, ans=0.1 2023-06-22 22:27:44,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1334532.0, ans=0.125 2023-06-22 22:28:39,259 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-22 22:28:50,985 INFO [train.py:996] (1/4) Epoch 8, batch 9000, loss[loss=0.2161, simple_loss=0.2899, pruned_loss=0.07118, over 21575.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3152, pruned_loss=0.08236, over 4279861.62 frames. ], batch size: 414, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:28:50,986 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-22 22:29:12,148 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2658, simple_loss=0.3603, pruned_loss=0.0856, over 1796401.00 frames. 2023-06-22 22:29:12,149 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-22 22:29:25,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1334772.0, ans=0.04949747468305833 2023-06-22 22:29:27,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1334832.0, ans=0.1 2023-06-22 22:29:27,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1334832.0, ans=0.125 2023-06-22 22:30:00,392 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.159e+02 6.404e+02 9.275e+02 1.956e+03, threshold=1.281e+03, percent-clipped=15.0 2023-06-22 22:30:16,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1334952.0, ans=0.1 2023-06-22 22:30:27,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1334952.0, ans=0.0 2023-06-22 22:30:51,429 INFO [train.py:996] (1/4) Epoch 8, batch 9050, loss[loss=0.2235, simple_loss=0.3059, pruned_loss=0.07052, over 21562.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.311, pruned_loss=0.07924, over 4277764.43 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:30:59,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-22 22:31:01,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1335072.0, ans=0.0 2023-06-22 22:31:21,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1335132.0, ans=0.1 2023-06-22 22:31:31,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1335192.0, ans=0.125 2023-06-22 22:31:35,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1335192.0, ans=0.125 2023-06-22 22:32:06,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1335252.0, ans=0.125 2023-06-22 22:32:23,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1335312.0, ans=0.125 2023-06-22 22:32:34,030 INFO [train.py:996] (1/4) Epoch 8, batch 9100, loss[loss=0.2455, simple_loss=0.3427, pruned_loss=0.07412, over 21619.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.316, pruned_loss=0.08159, over 4275017.51 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:32:34,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1335372.0, ans=0.025 2023-06-22 22:33:15,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1335492.0, ans=0.1 2023-06-22 22:33:32,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 4.374e+02 5.511e+02 8.272e+02 1.713e+03, threshold=1.102e+03, percent-clipped=4.0 2023-06-22 22:34:15,757 INFO [train.py:996] (1/4) Epoch 8, batch 9150, loss[loss=0.2452, simple_loss=0.3299, pruned_loss=0.0802, over 21578.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3187, pruned_loss=0.07904, over 4267902.83 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:35:18,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1335792.0, ans=0.1 2023-06-22 22:35:29,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1335852.0, ans=0.035 2023-06-22 22:35:38,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-22 22:36:01,129 INFO [train.py:996] (1/4) Epoch 8, batch 9200, loss[loss=0.2462, simple_loss=0.3176, pruned_loss=0.08737, over 21289.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3196, pruned_loss=0.07875, over 4266480.17 frames. ], batch size: 176, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:36:52,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1336092.0, ans=10.0 2023-06-22 22:36:59,569 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.939e+02 4.377e+02 5.436e+02 8.538e+02 1.737e+03, threshold=1.087e+03, percent-clipped=12.0 2023-06-22 22:37:15,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.68 vs. limit=5.0 2023-06-22 22:37:39,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1336272.0, ans=0.125 2023-06-22 22:37:40,997 INFO [train.py:996] (1/4) Epoch 8, batch 9250, loss[loss=0.2193, simple_loss=0.2839, pruned_loss=0.07733, over 21605.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3207, pruned_loss=0.0816, over 4274196.94 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:37:49,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1336272.0, ans=0.125 2023-06-22 22:38:39,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1336452.0, ans=0.1 2023-06-22 22:38:41,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1336452.0, ans=10.0 2023-06-22 22:38:47,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1336452.0, ans=0.0 2023-06-22 22:39:16,215 INFO [train.py:996] (1/4) Epoch 8, batch 9300, loss[loss=0.2262, simple_loss=0.2885, pruned_loss=0.08191, over 21891.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3178, pruned_loss=0.08233, over 4271089.82 frames. ], batch size: 107, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:39:18,766 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.45 vs. limit=6.0 2023-06-22 22:39:25,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1336572.0, ans=0.1 2023-06-22 22:39:42,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1336632.0, ans=0.0 2023-06-22 22:40:09,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-22 22:40:11,049 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.867e+02 5.198e+02 7.448e+02 1.175e+03 2.635e+03, threshold=1.490e+03, percent-clipped=31.0 2023-06-22 22:40:37,415 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:40:43,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1336812.0, ans=0.2 2023-06-22 22:40:51,254 INFO [train.py:996] (1/4) Epoch 8, batch 9350, loss[loss=0.2582, simple_loss=0.3333, pruned_loss=0.09155, over 21409.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3263, pruned_loss=0.08367, over 4276724.45 frames. ], batch size: 176, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:41:15,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-22 22:41:49,099 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:41:49,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-22 22:41:51,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-22 22:42:03,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1337052.0, ans=0.125 2023-06-22 22:42:34,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1337112.0, ans=0.2 2023-06-22 22:42:36,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=15.0 2023-06-22 22:42:36,766 INFO [train.py:996] (1/4) Epoch 8, batch 9400, loss[loss=0.2384, simple_loss=0.3018, pruned_loss=0.08754, over 21273.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3275, pruned_loss=0.08387, over 4276272.84 frames. ], batch size: 549, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:42:58,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1337232.0, ans=0.0 2023-06-22 22:43:24,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-22 22:43:32,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.205e+02 4.546e+02 6.111e+02 8.751e+02 2.078e+03, threshold=1.222e+03, percent-clipped=3.0 2023-06-22 22:43:55,693 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-22 22:44:16,706 INFO [train.py:996] (1/4) Epoch 8, batch 9450, loss[loss=0.2195, simple_loss=0.2797, pruned_loss=0.07969, over 21746.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3207, pruned_loss=0.08331, over 4264310.26 frames. ], batch size: 300, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:44:58,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1337592.0, ans=0.125 2023-06-22 22:45:54,762 INFO [train.py:996] (1/4) Epoch 8, batch 9500, loss[loss=0.1845, simple_loss=0.2763, pruned_loss=0.04637, over 21681.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3145, pruned_loss=0.08075, over 4258365.64 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:45:57,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1337772.0, ans=0.125 2023-06-22 22:46:33,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-22 22:46:50,872 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.196e+02 5.640e+02 7.713e+02 1.096e+03 2.487e+03, threshold=1.543e+03, percent-clipped=16.0 2023-06-22 22:47:18,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1338012.0, ans=0.125 2023-06-22 22:47:34,325 INFO [train.py:996] (1/4) Epoch 8, batch 9550, loss[loss=0.287, simple_loss=0.3713, pruned_loss=0.1014, over 21626.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3194, pruned_loss=0.08359, over 4258901.38 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:47:46,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1338072.0, ans=0.0 2023-06-22 22:48:36,162 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-22 22:48:37,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-22 22:49:07,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1338372.0, ans=10.0 2023-06-22 22:49:14,079 INFO [train.py:996] (1/4) Epoch 8, batch 9600, loss[loss=0.1746, simple_loss=0.2851, pruned_loss=0.03205, over 20776.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3203, pruned_loss=0.08488, over 4269439.82 frames. ], batch size: 607, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:49:23,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1338372.0, ans=0.1 2023-06-22 22:49:48,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.64 vs. limit=10.0 2023-06-22 22:50:03,251 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.129e+02 4.133e+02 5.747e+02 7.464e+02 1.666e+03, threshold=1.149e+03, percent-clipped=1.0 2023-06-22 22:50:16,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1338552.0, ans=0.1 2023-06-22 22:50:49,781 INFO [train.py:996] (1/4) Epoch 8, batch 9650, loss[loss=0.3066, simple_loss=0.3624, pruned_loss=0.1253, over 21401.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3184, pruned_loss=0.08479, over 4274138.79 frames. ], batch size: 508, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:51:12,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1338732.0, ans=0.0 2023-06-22 22:52:28,572 INFO [train.py:996] (1/4) Epoch 8, batch 9700, loss[loss=0.2019, simple_loss=0.2853, pruned_loss=0.05928, over 21754.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3206, pruned_loss=0.08447, over 4280721.79 frames. ], batch size: 247, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:52:32,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1338972.0, ans=0.125 2023-06-22 22:52:39,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-22 22:52:49,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1339032.0, ans=0.05 2023-06-22 22:53:14,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1339092.0, ans=0.05 2023-06-22 22:53:18,055 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.568e+02 6.321e+02 8.796e+02 1.656e+03, threshold=1.264e+03, percent-clipped=3.0 2023-06-22 22:53:23,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1339152.0, ans=0.1 2023-06-22 22:53:23,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1339152.0, ans=0.2 2023-06-22 22:53:57,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1339212.0, ans=0.125 2023-06-22 22:53:59,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1339212.0, ans=0.2 2023-06-22 22:54:05,591 INFO [train.py:996] (1/4) Epoch 8, batch 9750, loss[loss=0.2023, simple_loss=0.2731, pruned_loss=0.06572, over 21562.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3133, pruned_loss=0.08335, over 4274447.81 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:54:09,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1339272.0, ans=0.0 2023-06-22 22:54:11,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-22 22:54:28,556 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-22 22:55:39,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1339512.0, ans=0.125 2023-06-22 22:55:42,098 INFO [train.py:996] (1/4) Epoch 8, batch 9800, loss[loss=0.2308, simple_loss=0.2953, pruned_loss=0.08321, over 21620.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3137, pruned_loss=0.08379, over 4276246.58 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:55:55,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1339572.0, ans=0.0 2023-06-22 22:56:31,484 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.189e+02 3.646e+02 4.309e+02 6.187e+02 1.699e+03, threshold=8.618e+02, percent-clipped=3.0 2023-06-22 22:56:32,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-22 22:57:19,961 INFO [train.py:996] (1/4) Epoch 8, batch 9850, loss[loss=0.2612, simple_loss=0.3089, pruned_loss=0.1068, over 21576.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3096, pruned_loss=0.08373, over 4271795.52 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:57:29,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1339872.0, ans=0.125 2023-06-22 22:57:39,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1339932.0, ans=0.125 2023-06-22 22:57:50,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1339992.0, ans=0.2 2023-06-22 22:58:09,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1339992.0, ans=0.0 2023-06-22 22:58:31,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1340112.0, ans=0.015 2023-06-22 22:58:54,094 INFO [train.py:996] (1/4) Epoch 8, batch 9900, loss[loss=0.2133, simple_loss=0.2902, pruned_loss=0.06822, over 21760.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3076, pruned_loss=0.08388, over 4272122.06 frames. ], batch size: 282, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:59:38,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1340292.0, ans=0.5 2023-06-22 22:59:45,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.205e+02 4.510e+02 5.793e+02 9.115e+02 1.830e+03, threshold=1.159e+03, percent-clipped=29.0 2023-06-22 23:00:04,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1340352.0, ans=0.125 2023-06-22 23:00:13,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-22 23:00:25,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1340412.0, ans=0.09899494936611666 2023-06-22 23:00:29,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1340412.0, ans=0.0 2023-06-22 23:00:30,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1340412.0, ans=0.125 2023-06-22 23:00:33,432 INFO [train.py:996] (1/4) Epoch 8, batch 9950, loss[loss=0.2312, simple_loss=0.2922, pruned_loss=0.08509, over 21525.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3131, pruned_loss=0.08679, over 4260490.17 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:00:34,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1340472.0, ans=0.1 2023-06-22 23:00:48,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1340532.0, ans=0.025 2023-06-22 23:01:18,731 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:01:40,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1340652.0, ans=0.125 2023-06-22 23:02:13,508 INFO [train.py:996] (1/4) Epoch 8, batch 10000, loss[loss=0.2386, simple_loss=0.3107, pruned_loss=0.08329, over 21611.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3077, pruned_loss=0.08487, over 4259429.63 frames. ], batch size: 389, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:02:27,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1340772.0, ans=0.0 2023-06-22 23:02:35,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1340832.0, ans=0.0 2023-06-22 23:03:04,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1340892.0, ans=0.125 2023-06-22 23:03:05,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.683e+02 4.495e+02 6.092e+02 8.521e+02 2.124e+03, threshold=1.218e+03, percent-clipped=12.0 2023-06-22 23:03:54,457 INFO [train.py:996] (1/4) Epoch 8, batch 10050, loss[loss=0.2214, simple_loss=0.2777, pruned_loss=0.08253, over 20762.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3105, pruned_loss=0.08505, over 4257923.05 frames. ], batch size: 609, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:04:09,741 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-22 23:04:21,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1341132.0, ans=0.0 2023-06-22 23:04:36,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1341192.0, ans=0.0 2023-06-22 23:05:33,507 INFO [train.py:996] (1/4) Epoch 8, batch 10100, loss[loss=0.269, simple_loss=0.3308, pruned_loss=0.1036, over 21620.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3062, pruned_loss=0.08312, over 4258537.13 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:05:53,466 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-22 23:06:07,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1341432.0, ans=0.125 2023-06-22 23:06:40,192 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.964e+02 4.513e+02 5.773e+02 8.039e+02 1.456e+03, threshold=1.155e+03, percent-clipped=7.0 2023-06-22 23:06:43,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1341552.0, ans=0.125 2023-06-22 23:06:56,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1341612.0, ans=0.0 2023-06-22 23:07:08,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1341612.0, ans=0.1 2023-06-22 23:07:16,819 INFO [train.py:996] (1/4) Epoch 8, batch 10150, loss[loss=0.2142, simple_loss=0.28, pruned_loss=0.07414, over 17071.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.312, pruned_loss=0.08579, over 4258866.39 frames. ], batch size: 60, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:07:34,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1341732.0, ans=0.0 2023-06-22 23:07:38,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=12.0 2023-06-22 23:08:55,277 INFO [train.py:996] (1/4) Epoch 8, batch 10200, loss[loss=0.2041, simple_loss=0.2727, pruned_loss=0.0677, over 16182.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3109, pruned_loss=0.08318, over 4245921.81 frames. ], batch size: 63, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:09:49,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1342092.0, ans=0.125 2023-06-22 23:09:55,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-22 23:09:59,694 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.693e+02 4.053e+02 5.323e+02 7.160e+02 1.292e+03, threshold=1.065e+03, percent-clipped=4.0 2023-06-22 23:10:06,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1342152.0, ans=0.125 2023-06-22 23:10:09,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1342152.0, ans=0.0 2023-06-22 23:10:13,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1342212.0, ans=0.07 2023-06-22 23:10:35,063 INFO [train.py:996] (1/4) Epoch 8, batch 10250, loss[loss=0.2234, simple_loss=0.3033, pruned_loss=0.07175, over 21328.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3063, pruned_loss=0.07775, over 4245473.66 frames. ], batch size: 159, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:10:38,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1342272.0, ans=0.125 2023-06-22 23:11:00,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1342332.0, ans=0.125 2023-06-22 23:11:43,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1342452.0, ans=0.125 2023-06-22 23:12:14,726 INFO [train.py:996] (1/4) Epoch 8, batch 10300, loss[loss=0.2278, simple_loss=0.3093, pruned_loss=0.07309, over 21289.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3093, pruned_loss=0.07751, over 4248607.56 frames. ], batch size: 176, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:13:12,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1342692.0, ans=0.0 2023-06-22 23:13:14,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1342692.0, ans=0.125 2023-06-22 23:13:20,052 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.911e+02 4.156e+02 6.277e+02 8.296e+02 2.131e+03, threshold=1.255e+03, percent-clipped=15.0 2023-06-22 23:13:47,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1342812.0, ans=0.0 2023-06-22 23:14:00,726 INFO [train.py:996] (1/4) Epoch 8, batch 10350, loss[loss=0.1869, simple_loss=0.2362, pruned_loss=0.06884, over 21211.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3114, pruned_loss=0.07809, over 4257780.96 frames. ], batch size: 143, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:14:24,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1342872.0, ans=0.025 2023-06-22 23:14:31,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1342932.0, ans=0.0 2023-06-22 23:14:33,127 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-22 23:14:41,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-22 23:14:52,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1342992.0, ans=0.0 2023-06-22 23:15:43,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1343112.0, ans=0.125 2023-06-22 23:15:51,365 INFO [train.py:996] (1/4) Epoch 8, batch 10400, loss[loss=0.2251, simple_loss=0.2852, pruned_loss=0.08252, over 21639.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3065, pruned_loss=0.07683, over 4259916.74 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:15:51,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1343172.0, ans=0.125 2023-06-22 23:16:06,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1343232.0, ans=0.1 2023-06-22 23:16:36,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1343292.0, ans=0.0 2023-06-22 23:16:45,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.450e+02 4.781e+02 6.358e+02 9.315e+02 2.129e+03, threshold=1.272e+03, percent-clipped=10.0 2023-06-22 23:16:46,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1343352.0, ans=0.0 2023-06-22 23:16:49,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1343352.0, ans=0.125 2023-06-22 23:17:29,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1343412.0, ans=0.1 2023-06-22 23:17:31,417 INFO [train.py:996] (1/4) Epoch 8, batch 10450, loss[loss=0.2232, simple_loss=0.2961, pruned_loss=0.07513, over 21349.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3101, pruned_loss=0.07985, over 4266542.88 frames. ], batch size: 176, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:17:49,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1343532.0, ans=0.0 2023-06-22 23:19:09,679 INFO [train.py:996] (1/4) Epoch 8, batch 10500, loss[loss=0.2169, simple_loss=0.2821, pruned_loss=0.07583, over 21628.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3073, pruned_loss=0.07861, over 4257158.58 frames. ], batch size: 332, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:19:15,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.43 vs. limit=22.5 2023-06-22 23:20:03,204 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.164e+02 4.877e+02 7.502e+02 1.116e+03 2.000e+03, threshold=1.500e+03, percent-clipped=17.0 2023-06-22 23:20:05,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-22 23:20:51,450 INFO [train.py:996] (1/4) Epoch 8, batch 10550, loss[loss=0.2328, simple_loss=0.285, pruned_loss=0.09031, over 21279.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3032, pruned_loss=0.07809, over 4257545.50 frames. ], batch size: 144, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:20:52,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1344072.0, ans=0.125 2023-06-22 23:20:52,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1344072.0, ans=0.125 2023-06-22 23:21:08,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1344132.0, ans=0.2 2023-06-22 23:21:33,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1344192.0, ans=0.125 2023-06-22 23:21:36,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1344192.0, ans=0.0 2023-06-22 23:21:45,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1344252.0, ans=0.025 2023-06-22 23:22:37,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1344312.0, ans=0.125 2023-06-22 23:22:43,188 INFO [train.py:996] (1/4) Epoch 8, batch 10600, loss[loss=0.2069, simple_loss=0.2842, pruned_loss=0.06484, over 21399.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3009, pruned_loss=0.07724, over 4240262.31 frames. ], batch size: 211, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:23:42,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1344492.0, ans=0.125 2023-06-22 23:23:49,147 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 4.057e+02 5.623e+02 8.035e+02 1.796e+03, threshold=1.125e+03, percent-clipped=5.0 2023-06-22 23:23:49,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1344552.0, ans=0.125 2023-06-22 23:24:22,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1344612.0, ans=0.125 2023-06-22 23:24:35,515 INFO [train.py:996] (1/4) Epoch 8, batch 10650, loss[loss=0.138, simple_loss=0.1923, pruned_loss=0.04189, over 16680.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3038, pruned_loss=0.0763, over 4235035.82 frames. ], batch size: 62, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:24:50,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1344672.0, ans=0.0 2023-06-22 23:24:51,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1344672.0, ans=0.125 2023-06-22 23:25:03,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1344732.0, ans=0.1 2023-06-22 23:25:36,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1344792.0, ans=0.125 2023-06-22 23:26:06,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1344852.0, ans=0.125 2023-06-22 23:26:14,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1344912.0, ans=0.09899494936611666 2023-06-22 23:26:30,733 INFO [train.py:996] (1/4) Epoch 8, batch 10700, loss[loss=0.2984, simple_loss=0.3637, pruned_loss=0.1165, over 21546.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3021, pruned_loss=0.07631, over 4238319.88 frames. ], batch size: 389, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:26:40,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1344972.0, ans=0.2 2023-06-22 23:26:50,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1345032.0, ans=0.2 2023-06-22 23:26:53,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-22 23:26:58,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1345032.0, ans=0.04949747468305833 2023-06-22 23:27:36,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.264e+02 5.189e+02 6.885e+02 9.178e+02 1.741e+03, threshold=1.377e+03, percent-clipped=11.0 2023-06-22 23:27:37,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-06-22 23:27:45,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1345152.0, ans=0.125 2023-06-22 23:28:09,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1345212.0, ans=0.0 2023-06-22 23:28:13,112 INFO [train.py:996] (1/4) Epoch 8, batch 10750, loss[loss=0.2513, simple_loss=0.349, pruned_loss=0.07676, over 21751.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3134, pruned_loss=0.08111, over 4245132.85 frames. ], batch size: 332, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:28:49,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-22 23:29:46,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.75 vs. limit=15.0 2023-06-22 23:29:55,297 INFO [train.py:996] (1/4) Epoch 8, batch 10800, loss[loss=0.2837, simple_loss=0.3528, pruned_loss=0.1073, over 21745.00 frames. ], tot_loss[loss=0.243, simple_loss=0.32, pruned_loss=0.08296, over 4251115.58 frames. ], batch size: 332, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:29:55,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1345572.0, ans=0.125 2023-06-22 23:30:39,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1345632.0, ans=0.1 2023-06-22 23:30:41,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1345692.0, ans=0.2 2023-06-22 23:30:42,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1345692.0, ans=0.0 2023-06-22 23:31:04,404 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.247e+02 4.824e+02 6.509e+02 9.810e+02 2.428e+03, threshold=1.302e+03, percent-clipped=4.0 2023-06-22 23:31:08,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1345752.0, ans=0.2 2023-06-22 23:31:39,586 INFO [train.py:996] (1/4) Epoch 8, batch 10850, loss[loss=0.2304, simple_loss=0.3307, pruned_loss=0.06503, over 20828.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3197, pruned_loss=0.08254, over 4247886.86 frames. ], batch size: 609, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:31:50,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1345872.0, ans=0.1 2023-06-22 23:32:59,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-22 23:33:12,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1346112.0, ans=0.0 2023-06-22 23:33:13,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1346112.0, ans=0.125 2023-06-22 23:33:19,405 INFO [train.py:996] (1/4) Epoch 8, batch 10900, loss[loss=0.2235, simple_loss=0.3216, pruned_loss=0.06276, over 21813.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3133, pruned_loss=0.08056, over 4258587.74 frames. ], batch size: 371, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:33:53,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1346232.0, ans=0.04949747468305833 2023-06-22 23:33:53,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1346232.0, ans=0.125 2023-06-22 23:34:23,721 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.825e+02 3.987e+02 5.547e+02 7.924e+02 1.642e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-22 23:34:37,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1346412.0, ans=0.2 2023-06-22 23:35:00,086 INFO [train.py:996] (1/4) Epoch 8, batch 10950, loss[loss=0.2147, simple_loss=0.2858, pruned_loss=0.07174, over 21837.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3084, pruned_loss=0.07843, over 4252865.26 frames. ], batch size: 352, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:35:54,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.72 vs. limit=22.5 2023-06-22 23:36:38,561 INFO [train.py:996] (1/4) Epoch 8, batch 11000, loss[loss=0.2216, simple_loss=0.291, pruned_loss=0.07612, over 21860.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3056, pruned_loss=0.07919, over 4263085.81 frames. ], batch size: 332, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:36:42,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1346772.0, ans=0.0 2023-06-22 23:37:40,860 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-22 23:37:43,000 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.878e+02 3.827e+02 4.499e+02 6.468e+02 1.217e+03, threshold=8.999e+02, percent-clipped=2.0 2023-06-22 23:37:45,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=15.0 2023-06-22 23:37:51,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1346952.0, ans=0.0 2023-06-22 23:38:15,781 INFO [train.py:996] (1/4) Epoch 8, batch 11050, loss[loss=0.2265, simple_loss=0.2882, pruned_loss=0.08245, over 21743.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3035, pruned_loss=0.07987, over 4263879.38 frames. ], batch size: 371, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:38:17,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1347072.0, ans=0.0 2023-06-22 23:38:26,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2023-06-22 23:38:35,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1347132.0, ans=0.1 2023-06-22 23:38:51,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1347132.0, ans=0.0 2023-06-22 23:39:15,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1347192.0, ans=0.0 2023-06-22 23:39:48,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1347312.0, ans=0.0 2023-06-22 23:39:54,419 INFO [train.py:996] (1/4) Epoch 8, batch 11100, loss[loss=0.192, simple_loss=0.2647, pruned_loss=0.05968, over 21482.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3029, pruned_loss=0.08011, over 4260991.51 frames. ], batch size: 230, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:40:11,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1347372.0, ans=0.125 2023-06-22 23:40:22,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1347432.0, ans=0.07 2023-06-22 23:40:24,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1347432.0, ans=0.0 2023-06-22 23:40:37,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1347492.0, ans=0.0 2023-06-22 23:41:00,903 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.372e+02 5.317e+02 7.818e+02 1.562e+03, threshold=1.063e+03, percent-clipped=13.0 2023-06-22 23:41:10,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-22 23:41:14,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1347612.0, ans=0.1 2023-06-22 23:41:34,685 INFO [train.py:996] (1/4) Epoch 8, batch 11150, loss[loss=0.2376, simple_loss=0.2986, pruned_loss=0.08827, over 21640.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3004, pruned_loss=0.07942, over 4266639.95 frames. ], batch size: 282, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:41:56,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1347732.0, ans=0.125 2023-06-22 23:42:26,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1347792.0, ans=0.125 2023-06-22 23:42:37,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1347852.0, ans=0.125 2023-06-22 23:43:15,328 INFO [train.py:996] (1/4) Epoch 8, batch 11200, loss[loss=0.2355, simple_loss=0.2978, pruned_loss=0.08659, over 21635.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2995, pruned_loss=0.07921, over 4254706.90 frames. ], batch size: 298, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:43:37,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1348032.0, ans=0.125 2023-06-22 23:44:19,998 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 4.266e+02 5.477e+02 7.611e+02 1.407e+03, threshold=1.095e+03, percent-clipped=4.0 2023-06-22 23:44:37,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.51 vs. limit=15.0 2023-06-22 23:44:45,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1348212.0, ans=0.125 2023-06-22 23:44:50,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1348212.0, ans=0.1 2023-06-22 23:44:53,143 INFO [train.py:996] (1/4) Epoch 8, batch 11250, loss[loss=0.2149, simple_loss=0.3053, pruned_loss=0.06229, over 21789.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2997, pruned_loss=0.07859, over 4265547.78 frames. ], batch size: 124, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:45:25,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-22 23:45:51,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1348392.0, ans=0.5 2023-06-22 23:45:54,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1348452.0, ans=0.125 2023-06-22 23:46:07,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1348452.0, ans=0.0 2023-06-22 23:46:28,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1348512.0, ans=0.0 2023-06-22 23:46:31,412 INFO [train.py:996] (1/4) Epoch 8, batch 11300, loss[loss=0.2754, simple_loss=0.3437, pruned_loss=0.1036, over 20661.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3013, pruned_loss=0.07943, over 4261384.89 frames. ], batch size: 607, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:46:51,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1348632.0, ans=0.2 2023-06-22 23:47:01,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1348632.0, ans=0.125 2023-06-22 23:47:15,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1348692.0, ans=0.025 2023-06-22 23:47:37,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.34 vs. limit=15.0 2023-06-22 23:47:39,343 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 3.884e+02 4.784e+02 6.961e+02 1.768e+03, threshold=9.568e+02, percent-clipped=7.0 2023-06-22 23:47:48,316 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:47:52,943 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:48:11,959 INFO [train.py:996] (1/4) Epoch 8, batch 11350, loss[loss=0.1787, simple_loss=0.2391, pruned_loss=0.05913, over 20799.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3026, pruned_loss=0.07924, over 4269066.55 frames. ], batch size: 609, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:48:16,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1348872.0, ans=0.015 2023-06-22 23:48:21,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1348872.0, ans=0.2 2023-06-22 23:49:33,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-22 23:49:54,109 INFO [train.py:996] (1/4) Epoch 8, batch 11400, loss[loss=0.2629, simple_loss=0.3392, pruned_loss=0.09329, over 21857.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3091, pruned_loss=0.08209, over 4270291.36 frames. ], batch size: 317, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:49:58,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.34 vs. limit=10.0 2023-06-22 23:50:17,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1349232.0, ans=0.2 2023-06-22 23:51:07,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.908e+02 4.448e+02 6.055e+02 8.360e+02 1.667e+03, threshold=1.211e+03, percent-clipped=10.0 2023-06-22 23:51:07,782 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:51:09,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1349352.0, ans=0.0 2023-06-22 23:51:24,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1349412.0, ans=0.0 2023-06-22 23:51:39,993 INFO [train.py:996] (1/4) Epoch 8, batch 11450, loss[loss=0.2432, simple_loss=0.3118, pruned_loss=0.08734, over 21299.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3096, pruned_loss=0.08077, over 4274611.41 frames. ], batch size: 176, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:51:44,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.33 vs. limit=15.0 2023-06-22 23:52:12,176 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:52:17,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1349592.0, ans=0.1 2023-06-22 23:52:28,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1349592.0, ans=0.125 2023-06-22 23:52:53,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1349652.0, ans=0.2 2023-06-22 23:53:17,207 INFO [train.py:996] (1/4) Epoch 8, batch 11500, loss[loss=0.2581, simple_loss=0.3246, pruned_loss=0.09574, over 21187.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.314, pruned_loss=0.0823, over 4274061.61 frames. ], batch size: 143, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:53:34,643 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.90 vs. limit=22.5 2023-06-22 23:54:06,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-22 23:54:22,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.405e+02 5.880e+02 8.915e+02 1.909e+03, threshold=1.176e+03, percent-clipped=7.0 2023-06-22 23:54:44,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1350012.0, ans=0.2 2023-06-22 23:54:55,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1350012.0, ans=0.125 2023-06-22 23:55:04,825 INFO [train.py:996] (1/4) Epoch 8, batch 11550, loss[loss=0.2813, simple_loss=0.3681, pruned_loss=0.09724, over 21835.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3199, pruned_loss=0.08235, over 4272647.06 frames. ], batch size: 316, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:55:13,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1350072.0, ans=0.125 2023-06-22 23:55:31,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-22 23:55:32,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1350132.0, ans=0.1 2023-06-22 23:56:12,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.31 vs. limit=15.0 2023-06-22 23:56:46,844 INFO [train.py:996] (1/4) Epoch 8, batch 11600, loss[loss=0.2963, simple_loss=0.3948, pruned_loss=0.09893, over 21678.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.337, pruned_loss=0.08565, over 4268409.20 frames. ], batch size: 389, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:57:10,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1350432.0, ans=0.125 2023-06-22 23:57:49,977 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.035e+02 5.071e+02 7.210e+02 9.611e+02 2.245e+03, threshold=1.442e+03, percent-clipped=13.0 2023-06-22 23:58:15,041 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-22 23:58:27,205 INFO [train.py:996] (1/4) Epoch 8, batch 11650, loss[loss=0.2256, simple_loss=0.2984, pruned_loss=0.07645, over 21761.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3412, pruned_loss=0.0856, over 4274393.18 frames. ], batch size: 124, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:58:42,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1350672.0, ans=0.0 2023-06-22 23:59:05,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1350792.0, ans=0.125 2023-06-22 23:59:52,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1350912.0, ans=0.0 2023-06-23 00:00:05,898 INFO [train.py:996] (1/4) Epoch 8, batch 11700, loss[loss=0.2517, simple_loss=0.295, pruned_loss=0.1042, over 21307.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3336, pruned_loss=0.08548, over 4261120.18 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:00:14,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1350972.0, ans=0.125 2023-06-23 00:01:08,224 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.318e+02 4.484e+02 5.547e+02 7.973e+02 1.731e+03, threshold=1.109e+03, percent-clipped=2.0 2023-06-23 00:01:32,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.19 vs. limit=22.5 2023-06-23 00:01:45,121 INFO [train.py:996] (1/4) Epoch 8, batch 11750, loss[loss=0.2436, simple_loss=0.3108, pruned_loss=0.08815, over 21418.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3231, pruned_loss=0.08451, over 4269165.87 frames. ], batch size: 131, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:02:25,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1351392.0, ans=0.0 2023-06-23 00:02:52,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1351452.0, ans=0.125 2023-06-23 00:03:05,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1351452.0, ans=0.05 2023-06-23 00:03:31,060 INFO [train.py:996] (1/4) Epoch 8, batch 11800, loss[loss=0.2164, simple_loss=0.3092, pruned_loss=0.06175, over 21585.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3243, pruned_loss=0.08654, over 4275280.94 frames. ], batch size: 230, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:03:47,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1351632.0, ans=0.125 2023-06-23 00:03:55,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-23 00:04:20,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1351692.0, ans=0.125 2023-06-23 00:04:32,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1351752.0, ans=0.2 2023-06-23 00:04:33,206 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.952e+02 6.755e+02 1.112e+03 2.056e+03, threshold=1.351e+03, percent-clipped=25.0 2023-06-23 00:05:06,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.74 vs. limit=15.0 2023-06-23 00:05:11,136 INFO [train.py:996] (1/4) Epoch 8, batch 11850, loss[loss=0.2112, simple_loss=0.3235, pruned_loss=0.04948, over 20847.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3255, pruned_loss=0.08547, over 4280966.89 frames. ], batch size: 608, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:05:13,856 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-23 00:05:26,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1351932.0, ans=0.1 2023-06-23 00:05:28,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-23 00:05:40,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.61 vs. limit=6.0 2023-06-23 00:06:09,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=22.5 2023-06-23 00:06:21,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1352052.0, ans=0.0 2023-06-23 00:06:52,055 INFO [train.py:996] (1/4) Epoch 8, batch 11900, loss[loss=0.2288, simple_loss=0.2994, pruned_loss=0.07915, over 21583.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3254, pruned_loss=0.08325, over 4270436.45 frames. ], batch size: 230, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:07:14,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1352232.0, ans=0.2 2023-06-23 00:07:26,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-23 00:07:55,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1352352.0, ans=0.125 2023-06-23 00:07:57,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-23 00:08:07,972 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.952e+02 4.101e+02 5.216e+02 6.925e+02 1.642e+03, threshold=1.043e+03, percent-clipped=1.0 2023-06-23 00:08:17,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1352412.0, ans=0.2 2023-06-23 00:08:27,614 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:08:30,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1352412.0, ans=0.0 2023-06-23 00:08:34,716 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-23 00:08:35,029 INFO [train.py:996] (1/4) Epoch 8, batch 11950, loss[loss=0.2043, simple_loss=0.2859, pruned_loss=0.0613, over 21418.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3263, pruned_loss=0.08044, over 4265276.46 frames. ], batch size: 131, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:08:38,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1352472.0, ans=0.1 2023-06-23 00:10:12,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.53 vs. limit=10.0 2023-06-23 00:10:13,534 INFO [train.py:996] (1/4) Epoch 8, batch 12000, loss[loss=0.2185, simple_loss=0.2757, pruned_loss=0.08062, over 21247.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3189, pruned_loss=0.078, over 4270298.44 frames. ], batch size: 160, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:10:13,535 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 00:10:25,170 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.4141, 3.9001, 3.8883, 2.4131], device='cuda:1') 2023-06-23 00:10:32,698 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2606, simple_loss=0.356, pruned_loss=0.08257, over 1796401.00 frames. 2023-06-23 00:10:32,699 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 00:11:09,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1352832.0, ans=0.0 2023-06-23 00:11:39,455 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 4.103e+02 5.711e+02 8.012e+02 1.968e+03, threshold=1.142e+03, percent-clipped=13.0 2023-06-23 00:12:11,473 INFO [train.py:996] (1/4) Epoch 8, batch 12050, loss[loss=0.2676, simple_loss=0.3351, pruned_loss=0.1001, over 21814.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3149, pruned_loss=0.07963, over 4271378.33 frames. ], batch size: 124, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:12:44,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.04 vs. limit=10.0 2023-06-23 00:12:54,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1353192.0, ans=0.1 2023-06-23 00:13:06,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1353192.0, ans=0.035 2023-06-23 00:13:07,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-23 00:13:22,087 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-23 00:13:29,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1353252.0, ans=0.125 2023-06-23 00:13:32,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1353312.0, ans=0.2 2023-06-23 00:13:53,233 INFO [train.py:996] (1/4) Epoch 8, batch 12100, loss[loss=0.2667, simple_loss=0.3446, pruned_loss=0.0944, over 21347.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.322, pruned_loss=0.08544, over 4278422.07 frames. ], batch size: 548, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:14:14,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.94 vs. limit=15.0 2023-06-23 00:14:51,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1353492.0, ans=0.0 2023-06-23 00:15:02,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1353552.0, ans=0.125 2023-06-23 00:15:04,092 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.947e+02 5.144e+02 7.244e+02 1.095e+03 2.232e+03, threshold=1.449e+03, percent-clipped=22.0 2023-06-23 00:15:22,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1353612.0, ans=0.2 2023-06-23 00:15:45,923 INFO [train.py:996] (1/4) Epoch 8, batch 12150, loss[loss=0.2173, simple_loss=0.2943, pruned_loss=0.07014, over 21202.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3251, pruned_loss=0.08429, over 4274904.89 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:15:46,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353672.0, ans=0.1 2023-06-23 00:15:51,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-23 00:16:21,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1353792.0, ans=0.125 2023-06-23 00:16:21,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1353792.0, ans=0.04949747468305833 2023-06-23 00:17:25,434 INFO [train.py:996] (1/4) Epoch 8, batch 12200, loss[loss=0.213, simple_loss=0.2732, pruned_loss=0.07635, over 21630.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.32, pruned_loss=0.08315, over 4270298.12 frames. ], batch size: 231, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:17:35,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1353972.0, ans=0.0 2023-06-23 00:18:16,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1354152.0, ans=0.2 2023-06-23 00:18:27,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.597e+02 6.328e+02 9.392e+02 1.574e+03, threshold=1.266e+03, percent-clipped=2.0 2023-06-23 00:19:03,041 INFO [train.py:996] (1/4) Epoch 8, batch 12250, loss[loss=0.2316, simple_loss=0.301, pruned_loss=0.08108, over 21511.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3119, pruned_loss=0.08004, over 4264676.01 frames. ], batch size: 509, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:19:29,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1354332.0, ans=0.125 2023-06-23 00:19:38,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1354392.0, ans=0.04949747468305833 2023-06-23 00:19:38,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1354392.0, ans=0.0 2023-06-23 00:19:38,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1354392.0, ans=0.04949747468305833 2023-06-23 00:19:42,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1354392.0, ans=10.0 2023-06-23 00:20:32,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1354512.0, ans=0.2 2023-06-23 00:20:39,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-23 00:20:41,602 INFO [train.py:996] (1/4) Epoch 8, batch 12300, loss[loss=0.225, simple_loss=0.3172, pruned_loss=0.06641, over 21060.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3029, pruned_loss=0.07425, over 4255573.26 frames. ], batch size: 607, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:21:02,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-23 00:21:23,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1354692.0, ans=0.1 2023-06-23 00:21:29,598 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-23 00:21:41,018 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.472e+02 4.122e+02 6.339e+02 8.293e+02 1.636e+03, threshold=1.268e+03, percent-clipped=3.0 2023-06-23 00:22:22,504 INFO [train.py:996] (1/4) Epoch 8, batch 12350, loss[loss=0.2344, simple_loss=0.3126, pruned_loss=0.07812, over 21653.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3074, pruned_loss=0.07386, over 4262079.51 frames. ], batch size: 230, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:22:34,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1354872.0, ans=0.0 2023-06-23 00:22:36,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1354872.0, ans=0.1 2023-06-23 00:24:01,294 INFO [train.py:996] (1/4) Epoch 8, batch 12400, loss[loss=0.2826, simple_loss=0.3327, pruned_loss=0.1162, over 21272.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3101, pruned_loss=0.07805, over 4269168.17 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:25:07,648 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.244e+02 4.565e+02 7.015e+02 1.038e+03 2.241e+03, threshold=1.403e+03, percent-clipped=10.0 2023-06-23 00:25:45,044 INFO [train.py:996] (1/4) Epoch 8, batch 12450, loss[loss=0.3066, simple_loss=0.363, pruned_loss=0.1251, over 21317.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3143, pruned_loss=0.08137, over 4276544.38 frames. ], batch size: 159, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:25:54,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355472.0, ans=0.1 2023-06-23 00:27:16,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1355712.0, ans=0.0 2023-06-23 00:27:27,325 INFO [train.py:996] (1/4) Epoch 8, batch 12500, loss[loss=0.2884, simple_loss=0.3909, pruned_loss=0.09296, over 21322.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.323, pruned_loss=0.08464, over 4271779.35 frames. ], batch size: 549, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:28:08,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-23 00:28:12,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1355892.0, ans=0.0 2023-06-23 00:28:23,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1355892.0, ans=0.2 2023-06-23 00:28:44,179 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 4.947e+02 7.092e+02 9.787e+02 2.648e+03, threshold=1.418e+03, percent-clipped=11.0 2023-06-23 00:29:12,290 INFO [train.py:996] (1/4) Epoch 8, batch 12550, loss[loss=0.2536, simple_loss=0.342, pruned_loss=0.08263, over 21645.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3265, pruned_loss=0.08668, over 4275014.17 frames. ], batch size: 389, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:29:23,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1356072.0, ans=22.5 2023-06-23 00:30:17,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1356192.0, ans=0.125 2023-06-23 00:30:29,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-23 00:30:43,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1356312.0, ans=0.0 2023-06-23 00:30:58,487 INFO [train.py:996] (1/4) Epoch 8, batch 12600, loss[loss=0.2011, simple_loss=0.2897, pruned_loss=0.05623, over 21722.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.325, pruned_loss=0.08437, over 4273144.88 frames. ], batch size: 332, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:31:08,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1356372.0, ans=0.125 2023-06-23 00:31:26,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1356432.0, ans=0.125 2023-06-23 00:31:31,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1356432.0, ans=0.0 2023-06-23 00:32:06,804 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 4.323e+02 5.935e+02 8.611e+02 2.067e+03, threshold=1.187e+03, percent-clipped=5.0 2023-06-23 00:32:18,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1356612.0, ans=0.125 2023-06-23 00:32:36,780 INFO [train.py:996] (1/4) Epoch 8, batch 12650, loss[loss=0.234, simple_loss=0.3032, pruned_loss=0.08237, over 21838.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3181, pruned_loss=0.08094, over 4280048.49 frames. ], batch size: 282, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:33:06,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1356732.0, ans=0.125 2023-06-23 00:33:56,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1356912.0, ans=0.125 2023-06-23 00:34:13,538 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-23 00:34:21,061 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-23 00:34:21,587 INFO [train.py:996] (1/4) Epoch 8, batch 12700, loss[loss=0.2658, simple_loss=0.3361, pruned_loss=0.09779, over 21723.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3183, pruned_loss=0.08325, over 4283636.93 frames. ], batch size: 351, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:34:32,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1356972.0, ans=0.125 2023-06-23 00:34:45,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1357032.0, ans=0.0 2023-06-23 00:35:15,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1357092.0, ans=0.0 2023-06-23 00:35:18,347 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:35:18,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1357152.0, ans=0.125 2023-06-23 00:35:21,427 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:35:25,459 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.987e+02 4.487e+02 5.893e+02 8.124e+02 1.594e+03, threshold=1.179e+03, percent-clipped=3.0 2023-06-23 00:35:31,375 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-23 00:35:59,907 INFO [train.py:996] (1/4) Epoch 8, batch 12750, loss[loss=0.1986, simple_loss=0.2827, pruned_loss=0.05728, over 21476.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3203, pruned_loss=0.08349, over 4279995.27 frames. ], batch size: 212, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:36:53,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1357392.0, ans=0.125 2023-06-23 00:37:42,799 INFO [train.py:996] (1/4) Epoch 8, batch 12800, loss[loss=0.2565, simple_loss=0.3202, pruned_loss=0.09643, over 21405.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3206, pruned_loss=0.08483, over 4282828.19 frames. ], batch size: 548, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:38:06,235 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-23 00:38:42,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 4.784e+02 6.128e+02 7.998e+02 1.838e+03, threshold=1.226e+03, percent-clipped=10.0 2023-06-23 00:38:42,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1357752.0, ans=0.0 2023-06-23 00:39:18,539 INFO [train.py:996] (1/4) Epoch 8, batch 12850, loss[loss=0.2254, simple_loss=0.3246, pruned_loss=0.06316, over 21684.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3214, pruned_loss=0.08575, over 4282143.47 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:40:28,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1358052.0, ans=0.0 2023-06-23 00:40:31,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1358052.0, ans=0.125 2023-06-23 00:40:59,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1358112.0, ans=0.05 2023-06-23 00:41:02,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1358172.0, ans=0.125 2023-06-23 00:41:03,896 INFO [train.py:996] (1/4) Epoch 8, batch 12900, loss[loss=0.1838, simple_loss=0.256, pruned_loss=0.05583, over 21778.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.32, pruned_loss=0.08238, over 4265682.33 frames. ], batch size: 124, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:41:50,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-23 00:42:11,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1358352.0, ans=0.125 2023-06-23 00:42:12,807 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.798e+02 4.070e+02 5.539e+02 8.932e+02 2.008e+03, threshold=1.108e+03, percent-clipped=7.0 2023-06-23 00:42:22,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.97 vs. limit=10.0 2023-06-23 00:42:32,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1358412.0, ans=0.2 2023-06-23 00:42:43,902 INFO [train.py:996] (1/4) Epoch 8, batch 12950, loss[loss=0.2781, simple_loss=0.3551, pruned_loss=0.1006, over 21371.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.32, pruned_loss=0.08065, over 4262548.35 frames. ], batch size: 549, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:42:44,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1358472.0, ans=0.125 2023-06-23 00:42:45,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-23 00:43:00,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1358532.0, ans=0.1 2023-06-23 00:43:04,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1358532.0, ans=0.125 2023-06-23 00:43:20,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1358532.0, ans=0.0 2023-06-23 00:43:47,349 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-23 00:43:51,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1358652.0, ans=0.0 2023-06-23 00:44:24,126 INFO [train.py:996] (1/4) Epoch 8, batch 13000, loss[loss=0.1796, simple_loss=0.2591, pruned_loss=0.05011, over 21818.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3208, pruned_loss=0.08109, over 4260501.86 frames. ], batch size: 118, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:44:38,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1358832.0, ans=0.015 2023-06-23 00:44:38,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1358832.0, ans=0.0 2023-06-23 00:45:30,652 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-06-23 00:45:30,875 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.055e+02 4.812e+02 8.002e+02 1.036e+03 2.306e+03, threshold=1.600e+03, percent-clipped=23.0 2023-06-23 00:46:00,840 INFO [train.py:996] (1/4) Epoch 8, batch 13050, loss[loss=0.2423, simple_loss=0.3039, pruned_loss=0.09036, over 21813.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3134, pruned_loss=0.07785, over 4267229.28 frames. ], batch size: 247, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:46:23,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1359132.0, ans=0.0 2023-06-23 00:46:33,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1359132.0, ans=0.2 2023-06-23 00:47:02,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1359252.0, ans=0.125 2023-06-23 00:47:06,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1359252.0, ans=0.125 2023-06-23 00:47:39,018 INFO [train.py:996] (1/4) Epoch 8, batch 13100, loss[loss=0.225, simple_loss=0.3072, pruned_loss=0.07145, over 21831.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3146, pruned_loss=0.0788, over 4279200.53 frames. ], batch size: 282, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:48:27,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1359492.0, ans=0.1 2023-06-23 00:48:53,091 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.010e+02 4.152e+02 4.803e+02 6.425e+02 1.389e+03, threshold=9.605e+02, percent-clipped=0.0 2023-06-23 00:48:55,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1359552.0, ans=0.125 2023-06-23 00:49:23,656 INFO [train.py:996] (1/4) Epoch 8, batch 13150, loss[loss=0.2374, simple_loss=0.3111, pruned_loss=0.08187, over 21521.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3175, pruned_loss=0.08159, over 4284957.92 frames. ], batch size: 389, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:49:43,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1359732.0, ans=0.125 2023-06-23 00:51:00,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1359912.0, ans=0.1 2023-06-23 00:51:08,224 INFO [train.py:996] (1/4) Epoch 8, batch 13200, loss[loss=0.2652, simple_loss=0.3349, pruned_loss=0.0978, over 21832.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3151, pruned_loss=0.08124, over 4283551.62 frames. ], batch size: 124, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:51:12,434 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=22.5 2023-06-23 00:51:37,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1360032.0, ans=0.125 2023-06-23 00:52:03,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1360092.0, ans=0.1 2023-06-23 00:52:13,788 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.899e+02 4.724e+02 6.289e+02 8.620e+02 1.453e+03, threshold=1.258e+03, percent-clipped=16.0 2023-06-23 00:52:45,212 INFO [train.py:996] (1/4) Epoch 8, batch 13250, loss[loss=0.2384, simple_loss=0.3256, pruned_loss=0.07557, over 21868.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3167, pruned_loss=0.0832, over 4278748.28 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:53:06,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1360332.0, ans=0.0 2023-06-23 00:53:36,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1360392.0, ans=0.125 2023-06-23 00:53:47,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.58 vs. limit=10.0 2023-06-23 00:54:02,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1360452.0, ans=0.125 2023-06-23 00:54:15,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-23 00:54:31,052 INFO [train.py:996] (1/4) Epoch 8, batch 13300, loss[loss=0.263, simple_loss=0.3401, pruned_loss=0.09292, over 21730.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3205, pruned_loss=0.08212, over 4274155.06 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:54:32,170 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-23 00:54:33,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1360572.0, ans=0.125 2023-06-23 00:55:11,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1360692.0, ans=0.2 2023-06-23 00:55:41,055 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.129e+02 4.712e+02 5.675e+02 7.796e+02 1.493e+03, threshold=1.135e+03, percent-clipped=5.0 2023-06-23 00:56:11,988 INFO [train.py:996] (1/4) Epoch 8, batch 13350, loss[loss=0.2877, simple_loss=0.3565, pruned_loss=0.1095, over 21546.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3255, pruned_loss=0.08578, over 4274642.10 frames. ], batch size: 414, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:56:22,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1360872.0, ans=0.125 2023-06-23 00:56:24,729 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.34 vs. limit=22.5 2023-06-23 00:56:46,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1360932.0, ans=0.0 2023-06-23 00:57:05,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1360992.0, ans=0.0 2023-06-23 00:57:57,203 INFO [train.py:996] (1/4) Epoch 8, batch 13400, loss[loss=0.3386, simple_loss=0.3875, pruned_loss=0.1449, over 21521.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3287, pruned_loss=0.08968, over 4276391.02 frames. ], batch size: 471, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:59:03,366 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:59:05,830 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.334e+02 4.523e+02 5.899e+02 7.481e+02 1.405e+03, threshold=1.180e+03, percent-clipped=3.0 2023-06-23 00:59:07,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1361352.0, ans=0.125 2023-06-23 00:59:33,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1361412.0, ans=0.0 2023-06-23 00:59:36,216 INFO [train.py:996] (1/4) Epoch 8, batch 13450, loss[loss=0.27, simple_loss=0.3351, pruned_loss=0.1025, over 21633.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3299, pruned_loss=0.09189, over 4272208.24 frames. ], batch size: 441, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 01:00:41,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1361652.0, ans=0.0 2023-06-23 01:00:49,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1361652.0, ans=0.1 2023-06-23 01:01:13,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1361712.0, ans=0.0 2023-06-23 01:01:15,858 INFO [train.py:996] (1/4) Epoch 8, batch 13500, loss[loss=0.2195, simple_loss=0.2916, pruned_loss=0.07371, over 20687.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3203, pruned_loss=0.08829, over 4261990.86 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:01:45,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-23 01:02:11,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1361892.0, ans=0.125 2023-06-23 01:02:34,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1361952.0, ans=0.125 2023-06-23 01:02:35,455 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.391e+02 6.972e+02 1.115e+03 2.286e+03, threshold=1.394e+03, percent-clipped=24.0 2023-06-23 01:02:57,276 INFO [train.py:996] (1/4) Epoch 8, batch 13550, loss[loss=0.2446, simple_loss=0.3736, pruned_loss=0.05784, over 19809.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3237, pruned_loss=0.08715, over 4260191.62 frames. ], batch size: 703, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:03:38,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1362192.0, ans=0.5 2023-06-23 01:04:17,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-23 01:04:31,421 INFO [train.py:996] (1/4) Epoch 8, batch 13600, loss[loss=0.2113, simple_loss=0.2941, pruned_loss=0.06419, over 21798.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3241, pruned_loss=0.08711, over 4260842.84 frames. ], batch size: 298, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:05:47,093 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.174e+02 4.391e+02 6.199e+02 8.558e+02 2.268e+03, threshold=1.240e+03, percent-clipped=7.0 2023-06-23 01:05:50,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1362552.0, ans=0.125 2023-06-23 01:05:55,574 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:06:09,030 INFO [train.py:996] (1/4) Epoch 8, batch 13650, loss[loss=0.2105, simple_loss=0.281, pruned_loss=0.07005, over 21354.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3172, pruned_loss=0.0833, over 4267261.71 frames. ], batch size: 131, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:06:09,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-23 01:06:18,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1362672.0, ans=0.0 2023-06-23 01:06:49,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1362792.0, ans=0.0 2023-06-23 01:07:24,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1362852.0, ans=0.1 2023-06-23 01:07:43,707 INFO [train.py:996] (1/4) Epoch 8, batch 13700, loss[loss=0.2132, simple_loss=0.2847, pruned_loss=0.07083, over 21689.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3134, pruned_loss=0.08324, over 4266865.67 frames. ], batch size: 247, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:07:53,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1362972.0, ans=0.1 2023-06-23 01:08:40,999 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-23 01:08:57,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1363152.0, ans=0.09899494936611666 2023-06-23 01:09:00,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.296e+02 7.505e+02 1.141e+03 2.334e+03, threshold=1.501e+03, percent-clipped=22.0 2023-06-23 01:09:32,564 INFO [train.py:996] (1/4) Epoch 8, batch 13750, loss[loss=0.2571, simple_loss=0.3361, pruned_loss=0.08907, over 21570.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.31, pruned_loss=0.08236, over 4265262.91 frames. ], batch size: 441, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:11:19,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363572.0, ans=0.1 2023-06-23 01:11:20,769 INFO [train.py:996] (1/4) Epoch 8, batch 13800, loss[loss=0.2433, simple_loss=0.3319, pruned_loss=0.07737, over 21489.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3154, pruned_loss=0.08093, over 4274733.65 frames. ], batch size: 211, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:11:27,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363572.0, ans=0.1 2023-06-23 01:11:36,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1363632.0, ans=15.0 2023-06-23 01:11:43,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1363632.0, ans=0.125 2023-06-23 01:12:37,259 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.245e+02 4.845e+02 7.419e+02 1.036e+03 2.562e+03, threshold=1.484e+03, percent-clipped=7.0 2023-06-23 01:12:38,639 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-06-23 01:12:46,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1363812.0, ans=0.125 2023-06-23 01:12:47,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1363812.0, ans=0.125 2023-06-23 01:12:54,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1363812.0, ans=0.2 2023-06-23 01:13:00,675 INFO [train.py:996] (1/4) Epoch 8, batch 13850, loss[loss=0.2711, simple_loss=0.3543, pruned_loss=0.09396, over 21767.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3238, pruned_loss=0.08312, over 4273739.35 frames. ], batch size: 332, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:13:06,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-23 01:13:23,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1363932.0, ans=0.035 2023-06-23 01:13:39,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1363992.0, ans=0.125 2023-06-23 01:14:39,127 INFO [train.py:996] (1/4) Epoch 8, batch 13900, loss[loss=0.2307, simple_loss=0.3015, pruned_loss=0.07992, over 21374.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.329, pruned_loss=0.08764, over 4273631.20 frames. ], batch size: 211, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:14:42,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1364172.0, ans=0.125 2023-06-23 01:14:58,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1364232.0, ans=0.125 2023-06-23 01:15:02,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-23 01:15:29,245 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.04 vs. limit=22.5 2023-06-23 01:15:41,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1364352.0, ans=0.125 2023-06-23 01:15:55,319 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.327e+02 4.266e+02 5.471e+02 7.768e+02 2.129e+03, threshold=1.094e+03, percent-clipped=1.0 2023-06-23 01:16:17,089 INFO [train.py:996] (1/4) Epoch 8, batch 13950, loss[loss=0.2529, simple_loss=0.325, pruned_loss=0.09042, over 21282.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3288, pruned_loss=0.08901, over 4281464.80 frames. ], batch size: 143, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:16:17,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1364472.0, ans=0.125 2023-06-23 01:16:40,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1364532.0, ans=0.2 2023-06-23 01:16:43,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1364532.0, ans=0.0 2023-06-23 01:17:09,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1364592.0, ans=0.0 2023-06-23 01:17:28,244 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:17:37,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1364712.0, ans=0.125 2023-06-23 01:17:53,967 INFO [train.py:996] (1/4) Epoch 8, batch 14000, loss[loss=0.2156, simple_loss=0.3, pruned_loss=0.06556, over 21327.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3252, pruned_loss=0.08648, over 4279313.40 frames. ], batch size: 144, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 01:18:12,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=15.0 2023-06-23 01:18:23,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.85 vs. limit=6.0 2023-06-23 01:18:25,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1364832.0, ans=0.125 2023-06-23 01:19:08,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1364952.0, ans=0.125 2023-06-23 01:19:08,961 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.634e+02 4.329e+02 5.834e+02 8.040e+02 1.947e+03, threshold=1.167e+03, percent-clipped=14.0 2023-06-23 01:19:30,103 INFO [train.py:996] (1/4) Epoch 8, batch 14050, loss[loss=0.2312, simple_loss=0.2916, pruned_loss=0.08545, over 21841.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3192, pruned_loss=0.08213, over 4286857.79 frames. ], batch size: 98, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 01:19:38,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1365072.0, ans=0.2 2023-06-23 01:19:59,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1365132.0, ans=0.125 2023-06-23 01:21:12,118 INFO [train.py:996] (1/4) Epoch 8, batch 14100, loss[loss=0.2204, simple_loss=0.2808, pruned_loss=0.07995, over 21844.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3137, pruned_loss=0.0823, over 4283119.18 frames. ], batch size: 98, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:22:24,449 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 4.771e+02 6.447e+02 8.696e+02 1.773e+03, threshold=1.289e+03, percent-clipped=8.0 2023-06-23 01:22:43,455 INFO [train.py:996] (1/4) Epoch 8, batch 14150, loss[loss=0.238, simple_loss=0.3134, pruned_loss=0.08126, over 21812.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3155, pruned_loss=0.08166, over 4286979.23 frames. ], batch size: 124, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:23:06,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-23 01:23:11,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-23 01:23:26,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1365792.0, ans=0.125 2023-06-23 01:23:41,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1365792.0, ans=0.1 2023-06-23 01:24:15,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1365912.0, ans=0.0 2023-06-23 01:24:19,902 INFO [train.py:996] (1/4) Epoch 8, batch 14200, loss[loss=0.2518, simple_loss=0.3098, pruned_loss=0.09688, over 21772.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3153, pruned_loss=0.08166, over 4288284.59 frames. ], batch size: 118, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:25:31,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1366152.0, ans=0.125 2023-06-23 01:25:32,013 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.024e+02 4.323e+02 5.337e+02 8.028e+02 2.442e+03, threshold=1.067e+03, percent-clipped=5.0 2023-06-23 01:25:48,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1366212.0, ans=0.1 2023-06-23 01:25:57,622 INFO [train.py:996] (1/4) Epoch 8, batch 14250, loss[loss=0.2002, simple_loss=0.2636, pruned_loss=0.06844, over 21407.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3103, pruned_loss=0.08147, over 4285809.87 frames. ], batch size: 195, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:26:35,982 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-23 01:27:16,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1366452.0, ans=0.125 2023-06-23 01:27:22,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1366512.0, ans=0.125 2023-06-23 01:27:35,839 INFO [train.py:996] (1/4) Epoch 8, batch 14300, loss[loss=0.2898, simple_loss=0.3897, pruned_loss=0.09496, over 21778.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3119, pruned_loss=0.0802, over 4274693.46 frames. ], batch size: 332, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:28:06,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1366632.0, ans=0.125 2023-06-23 01:28:08,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1366632.0, ans=0.1 2023-06-23 01:28:54,524 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.931e+02 4.415e+02 6.422e+02 1.030e+03 2.040e+03, threshold=1.284e+03, percent-clipped=23.0 2023-06-23 01:29:07,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1366812.0, ans=0.125 2023-06-23 01:29:13,131 INFO [train.py:996] (1/4) Epoch 8, batch 14350, loss[loss=0.274, simple_loss=0.3474, pruned_loss=0.1003, over 21090.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3166, pruned_loss=0.08023, over 4268669.99 frames. ], batch size: 608, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:29:28,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1366872.0, ans=0.1 2023-06-23 01:29:39,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1366932.0, ans=0.0 2023-06-23 01:30:47,777 INFO [train.py:996] (1/4) Epoch 8, batch 14400, loss[loss=0.243, simple_loss=0.3123, pruned_loss=0.08682, over 20050.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3158, pruned_loss=0.08176, over 4274518.27 frames. ], batch size: 704, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:31:35,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1367292.0, ans=0.125 2023-06-23 01:31:56,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.154e+02 4.970e+02 6.969e+02 1.897e+03, threshold=9.939e+02, percent-clipped=6.0 2023-06-23 01:32:08,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1367412.0, ans=0.125 2023-06-23 01:32:16,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1367412.0, ans=0.0 2023-06-23 01:32:16,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1367412.0, ans=0.125 2023-06-23 01:32:18,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1367472.0, ans=0.125 2023-06-23 01:32:19,344 INFO [train.py:996] (1/4) Epoch 8, batch 14450, loss[loss=0.1876, simple_loss=0.2556, pruned_loss=0.05979, over 21514.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3103, pruned_loss=0.08159, over 4270555.84 frames. ], batch size: 195, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:33:00,771 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:34:04,047 INFO [train.py:996] (1/4) Epoch 8, batch 14500, loss[loss=0.2067, simple_loss=0.2686, pruned_loss=0.07241, over 21656.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3059, pruned_loss=0.08163, over 4260867.15 frames. ], batch size: 282, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:35:21,293 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.784e+02 6.137e+02 8.722e+02 1.642e+03, threshold=1.227e+03, percent-clipped=18.0 2023-06-23 01:35:45,111 INFO [train.py:996] (1/4) Epoch 8, batch 14550, loss[loss=0.2362, simple_loss=0.3014, pruned_loss=0.08551, over 20116.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3088, pruned_loss=0.0828, over 4263317.32 frames. ], batch size: 703, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:36:15,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1368132.0, ans=0.125 2023-06-23 01:37:09,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1368312.0, ans=0.0 2023-06-23 01:37:15,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1368312.0, ans=0.1 2023-06-23 01:37:23,764 INFO [train.py:996] (1/4) Epoch 8, batch 14600, loss[loss=0.2597, simple_loss=0.3379, pruned_loss=0.09077, over 21261.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3163, pruned_loss=0.08637, over 4271383.02 frames. ], batch size: 176, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:37:41,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1368432.0, ans=0.2 2023-06-23 01:38:01,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1368432.0, ans=0.0 2023-06-23 01:38:06,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1368492.0, ans=0.1 2023-06-23 01:38:14,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1368492.0, ans=0.1 2023-06-23 01:38:34,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1368552.0, ans=0.125 2023-06-23 01:38:38,634 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 4.393e+02 5.466e+02 7.760e+02 1.223e+03, threshold=1.093e+03, percent-clipped=0.0 2023-06-23 01:39:02,828 INFO [train.py:996] (1/4) Epoch 8, batch 14650, loss[loss=0.2306, simple_loss=0.3152, pruned_loss=0.07298, over 21803.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3202, pruned_loss=0.08613, over 4275115.24 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:39:16,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1368672.0, ans=0.0 2023-06-23 01:40:41,949 INFO [train.py:996] (1/4) Epoch 8, batch 14700, loss[loss=0.2019, simple_loss=0.2736, pruned_loss=0.06506, over 21421.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3159, pruned_loss=0.08059, over 4274120.48 frames. ], batch size: 131, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:40:49,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.56 vs. limit=6.0 2023-06-23 01:41:15,436 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-06-23 01:41:16,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1369032.0, ans=0.5 2023-06-23 01:41:19,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1369092.0, ans=0.125 2023-06-23 01:41:57,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1369152.0, ans=0.0 2023-06-23 01:42:00,417 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 5.470e+02 7.461e+02 1.083e+03 1.858e+03, threshold=1.492e+03, percent-clipped=24.0 2023-06-23 01:42:04,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1369212.0, ans=0.2 2023-06-23 01:42:06,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1369212.0, ans=0.1 2023-06-23 01:42:17,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1369272.0, ans=0.0 2023-06-23 01:42:18,595 INFO [train.py:996] (1/4) Epoch 8, batch 14750, loss[loss=0.2694, simple_loss=0.3556, pruned_loss=0.09164, over 21580.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3189, pruned_loss=0.0821, over 4278761.51 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:42:30,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1369272.0, ans=0.0 2023-06-23 01:42:31,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-23 01:43:34,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1369452.0, ans=0.0 2023-06-23 01:44:00,047 INFO [train.py:996] (1/4) Epoch 8, batch 14800, loss[loss=0.2695, simple_loss=0.3303, pruned_loss=0.1043, over 21370.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3273, pruned_loss=0.0864, over 4274571.97 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:44:54,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1369692.0, ans=0.1 2023-06-23 01:45:18,472 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.582e+02 5.734e+02 8.027e+02 1.112e+03 2.200e+03, threshold=1.605e+03, percent-clipped=5.0 2023-06-23 01:45:40,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1369872.0, ans=10.0 2023-06-23 01:45:41,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-23 01:45:41,507 INFO [train.py:996] (1/4) Epoch 8, batch 14850, loss[loss=0.2816, simple_loss=0.3604, pruned_loss=0.1014, over 19947.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3212, pruned_loss=0.08634, over 4262859.31 frames. ], batch size: 703, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:46:00,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1369872.0, ans=0.0 2023-06-23 01:46:01,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1369932.0, ans=0.1 2023-06-23 01:46:04,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1369932.0, ans=0.125 2023-06-23 01:46:31,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1369992.0, ans=0.05 2023-06-23 01:46:37,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1369992.0, ans=10.0 2023-06-23 01:46:50,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1370052.0, ans=0.125 2023-06-23 01:46:57,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1370052.0, ans=0.125 2023-06-23 01:47:23,714 INFO [train.py:996] (1/4) Epoch 8, batch 14900, loss[loss=0.3352, simple_loss=0.3962, pruned_loss=0.1371, over 21372.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3235, pruned_loss=0.08819, over 4266443.13 frames. ], batch size: 507, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:47:24,753 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-23 01:47:35,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1370172.0, ans=0.125 2023-06-23 01:48:21,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1370292.0, ans=0.125 2023-06-23 01:48:41,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1370352.0, ans=0.2 2023-06-23 01:48:41,947 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.642e+02 5.851e+02 8.319e+02 1.860e+03, threshold=1.170e+03, percent-clipped=1.0 2023-06-23 01:48:48,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-23 01:49:04,630 INFO [train.py:996] (1/4) Epoch 8, batch 14950, loss[loss=0.2369, simple_loss=0.319, pruned_loss=0.07734, over 21719.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3245, pruned_loss=0.08803, over 4260165.57 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:49:08,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1370472.0, ans=0.125 2023-06-23 01:49:25,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1370532.0, ans=0.0 2023-06-23 01:49:26,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1370532.0, ans=0.125 2023-06-23 01:49:27,217 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-23 01:49:49,536 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-23 01:49:51,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1370592.0, ans=0.125 2023-06-23 01:50:08,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1370652.0, ans=0.125 2023-06-23 01:50:10,817 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-23 01:50:16,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1370652.0, ans=0.0 2023-06-23 01:50:26,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1370712.0, ans=0.1 2023-06-23 01:50:40,423 INFO [train.py:996] (1/4) Epoch 8, batch 15000, loss[loss=0.2432, simple_loss=0.305, pruned_loss=0.09072, over 21334.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3275, pruned_loss=0.08993, over 4266457.91 frames. ], batch size: 159, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:50:40,424 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 01:50:53,392 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8600, 3.3067, 3.5398, 3.3039], device='cuda:1') 2023-06-23 01:51:00,726 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2539, simple_loss=0.3505, pruned_loss=0.07863, over 1796401.00 frames. 2023-06-23 01:51:00,726 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 01:51:14,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1370772.0, ans=0.05 2023-06-23 01:51:41,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1370892.0, ans=0.1 2023-06-23 01:51:47,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1370892.0, ans=0.125 2023-06-23 01:51:51,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2023-06-23 01:51:57,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-23 01:52:17,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1370952.0, ans=0.125 2023-06-23 01:52:21,556 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.233e+02 4.412e+02 5.546e+02 7.158e+02 1.443e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-23 01:52:43,845 INFO [train.py:996] (1/4) Epoch 8, batch 15050, loss[loss=0.2634, simple_loss=0.3659, pruned_loss=0.08046, over 21680.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.33, pruned_loss=0.09119, over 4265152.16 frames. ], batch size: 441, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:53:47,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1371252.0, ans=0.2 2023-06-23 01:54:10,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371312.0, ans=0.1 2023-06-23 01:54:17,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-23 01:54:18,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1371312.0, ans=0.125 2023-06-23 01:54:23,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1371372.0, ans=0.0 2023-06-23 01:54:29,750 INFO [train.py:996] (1/4) Epoch 8, batch 15100, loss[loss=0.2557, simple_loss=0.3297, pruned_loss=0.09089, over 21701.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3335, pruned_loss=0.09112, over 4261100.31 frames. ], batch size: 351, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:54:42,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-23 01:54:51,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1371432.0, ans=0.2 2023-06-23 01:55:18,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1371492.0, ans=0.0 2023-06-23 01:55:50,281 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.664e+02 5.163e+02 6.983e+02 1.038e+03 2.377e+03, threshold=1.397e+03, percent-clipped=16.0 2023-06-23 01:55:56,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1371612.0, ans=0.0 2023-06-23 01:55:59,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371612.0, ans=0.1 2023-06-23 01:56:13,547 INFO [train.py:996] (1/4) Epoch 8, batch 15150, loss[loss=0.256, simple_loss=0.3061, pruned_loss=0.103, over 21477.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3299, pruned_loss=0.09114, over 4257810.18 frames. ], batch size: 441, lr: 3.73e-03, grad_scale: 4.0 2023-06-23 01:57:04,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1371792.0, ans=0.0 2023-06-23 01:57:05,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1371852.0, ans=0.125 2023-06-23 01:57:31,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1371912.0, ans=0.0 2023-06-23 01:57:34,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1371912.0, ans=0.1 2023-06-23 01:57:48,581 INFO [train.py:996] (1/4) Epoch 8, batch 15200, loss[loss=0.2511, simple_loss=0.3401, pruned_loss=0.08105, over 20685.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3197, pruned_loss=0.08682, over 4260118.49 frames. ], batch size: 607, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:58:39,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1372092.0, ans=0.1 2023-06-23 01:59:00,846 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:59:10,130 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.032e+02 4.454e+02 6.348e+02 1.099e+03 2.249e+03, threshold=1.270e+03, percent-clipped=13.0 2023-06-23 01:59:29,130 INFO [train.py:996] (1/4) Epoch 8, batch 15250, loss[loss=0.2533, simple_loss=0.3235, pruned_loss=0.09157, over 20806.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3166, pruned_loss=0.08538, over 4258398.88 frames. ], batch size: 611, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:59:34,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1372272.0, ans=0.125 2023-06-23 02:00:31,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1372452.0, ans=0.125 2023-06-23 02:00:31,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1372452.0, ans=0.2 2023-06-23 02:01:09,087 INFO [train.py:996] (1/4) Epoch 8, batch 15300, loss[loss=0.3026, simple_loss=0.3566, pruned_loss=0.1243, over 21397.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3191, pruned_loss=0.08789, over 4258902.45 frames. ], batch size: 471, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:01:36,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1372632.0, ans=0.2 2023-06-23 02:01:47,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1372692.0, ans=0.1 2023-06-23 02:02:01,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1372692.0, ans=0.125 2023-06-23 02:02:34,707 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.584e+02 4.847e+02 6.325e+02 8.051e+02 1.474e+03, threshold=1.265e+03, percent-clipped=2.0 2023-06-23 02:02:42,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1372812.0, ans=0.0 2023-06-23 02:02:48,591 INFO [train.py:996] (1/4) Epoch 8, batch 15350, loss[loss=0.2603, simple_loss=0.3459, pruned_loss=0.08738, over 21709.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3253, pruned_loss=0.08953, over 4262190.28 frames. ], batch size: 351, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:03:07,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1372932.0, ans=0.125 2023-06-23 02:03:41,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2023-06-23 02:04:27,070 INFO [train.py:996] (1/4) Epoch 8, batch 15400, loss[loss=0.2258, simple_loss=0.2975, pruned_loss=0.07708, over 21845.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3263, pruned_loss=0.08821, over 4267153.88 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:05:07,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1373292.0, ans=0.125 2023-06-23 02:05:20,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1373352.0, ans=0.0 2023-06-23 02:05:35,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1373352.0, ans=0.125 2023-06-23 02:05:42,575 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.174e+02 4.574e+02 6.918e+02 9.277e+02 1.952e+03, threshold=1.384e+03, percent-clipped=9.0 2023-06-23 02:06:06,508 INFO [train.py:996] (1/4) Epoch 8, batch 15450, loss[loss=0.2183, simple_loss=0.2939, pruned_loss=0.07133, over 21495.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3228, pruned_loss=0.08694, over 4261567.19 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:06:19,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2023-06-23 02:06:42,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1373592.0, ans=0.125 2023-06-23 02:07:09,214 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:07:12,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1373652.0, ans=0.125 2023-06-23 02:07:47,638 INFO [train.py:996] (1/4) Epoch 8, batch 15500, loss[loss=0.2832, simple_loss=0.3567, pruned_loss=0.1049, over 21526.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3237, pruned_loss=0.08685, over 4270501.49 frames. ], batch size: 414, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:07:49,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1373772.0, ans=0.125 2023-06-23 02:07:59,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1373772.0, ans=0.0 2023-06-23 02:08:34,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1373892.0, ans=0.125 2023-06-23 02:08:40,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1373892.0, ans=10.0 2023-06-23 02:08:45,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-23 02:08:54,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1373952.0, ans=10.0 2023-06-23 02:09:00,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1373952.0, ans=0.0 2023-06-23 02:09:14,711 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.796e+02 4.547e+02 5.557e+02 7.236e+02 1.680e+03, threshold=1.111e+03, percent-clipped=1.0 2023-06-23 02:09:28,953 INFO [train.py:996] (1/4) Epoch 8, batch 15550, loss[loss=0.2227, simple_loss=0.311, pruned_loss=0.06715, over 21733.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3212, pruned_loss=0.08442, over 4263816.75 frames. ], batch size: 332, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:10:10,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1374192.0, ans=0.0 2023-06-23 02:10:15,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1374192.0, ans=0.125 2023-06-23 02:10:18,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1374192.0, ans=0.125 2023-06-23 02:10:35,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-23 02:10:49,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1374252.0, ans=0.125 2023-06-23 02:10:59,412 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.70 vs. limit=10.0 2023-06-23 02:11:07,706 INFO [train.py:996] (1/4) Epoch 8, batch 15600, loss[loss=0.2882, simple_loss=0.3225, pruned_loss=0.1269, over 21342.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3124, pruned_loss=0.08259, over 4259118.85 frames. ], batch size: 508, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:11:12,894 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:11:51,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1374492.0, ans=22.5 2023-06-23 02:12:15,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1374552.0, ans=0.0 2023-06-23 02:12:32,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.200e+02 4.456e+02 5.982e+02 8.275e+02 1.817e+03, threshold=1.196e+03, percent-clipped=9.0 2023-06-23 02:12:46,668 INFO [train.py:996] (1/4) Epoch 8, batch 15650, loss[loss=0.2088, simple_loss=0.2743, pruned_loss=0.07164, over 21541.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3107, pruned_loss=0.08183, over 4265779.97 frames. ], batch size: 231, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:13:35,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1374792.0, ans=0.1 2023-06-23 02:14:18,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1374912.0, ans=0.125 2023-06-23 02:14:31,075 INFO [train.py:996] (1/4) Epoch 8, batch 15700, loss[loss=0.214, simple_loss=0.2794, pruned_loss=0.07431, over 21242.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3073, pruned_loss=0.08047, over 4263805.76 frames. ], batch size: 144, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:14:38,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1374972.0, ans=0.0 2023-06-23 02:15:37,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1375152.0, ans=0.2 2023-06-23 02:15:42,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1375152.0, ans=0.1 2023-06-23 02:15:48,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1375212.0, ans=0.125 2023-06-23 02:15:50,701 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.000e+02 4.439e+02 5.550e+02 6.958e+02 1.356e+03, threshold=1.110e+03, percent-clipped=1.0 2023-06-23 02:15:56,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1375212.0, ans=0.0 2023-06-23 02:16:04,719 INFO [train.py:996] (1/4) Epoch 8, batch 15750, loss[loss=0.1974, simple_loss=0.2683, pruned_loss=0.06322, over 21637.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3038, pruned_loss=0.081, over 4266576.64 frames. ], batch size: 264, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:17:31,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1375512.0, ans=0.125 2023-06-23 02:17:49,038 INFO [train.py:996] (1/4) Epoch 8, batch 15800, loss[loss=0.2413, simple_loss=0.3174, pruned_loss=0.0826, over 16534.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3006, pruned_loss=0.08043, over 4256207.60 frames. ], batch size: 60, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:18:20,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1375632.0, ans=0.125 2023-06-23 02:19:00,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1375752.0, ans=0.125 2023-06-23 02:19:04,853 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.739e+02 6.417e+02 1.005e+03 2.218e+03, threshold=1.283e+03, percent-clipped=19.0 2023-06-23 02:19:14,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1375812.0, ans=0.1 2023-06-23 02:19:24,019 INFO [train.py:996] (1/4) Epoch 8, batch 15850, loss[loss=0.2542, simple_loss=0.3445, pruned_loss=0.08189, over 16173.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3047, pruned_loss=0.08245, over 4255051.35 frames. ], batch size: 60, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:19:46,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1375932.0, ans=0.2 2023-06-23 02:20:59,237 INFO [train.py:996] (1/4) Epoch 8, batch 15900, loss[loss=0.2175, simple_loss=0.293, pruned_loss=0.07105, over 21763.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.305, pruned_loss=0.08359, over 4260037.37 frames. ], batch size: 124, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:21:36,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1376232.0, ans=0.0 2023-06-23 02:22:05,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-23 02:22:24,067 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 4.482e+02 6.671e+02 9.137e+02 1.402e+03, threshold=1.334e+03, percent-clipped=2.0 2023-06-23 02:22:38,533 INFO [train.py:996] (1/4) Epoch 8, batch 15950, loss[loss=0.2676, simple_loss=0.3452, pruned_loss=0.09496, over 21749.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3059, pruned_loss=0.08177, over 4251804.26 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:22:42,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1376472.0, ans=0.125 2023-06-23 02:23:26,381 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:23:51,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1376652.0, ans=0.125 2023-06-23 02:24:08,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-23 02:24:13,834 INFO [train.py:996] (1/4) Epoch 8, batch 16000, loss[loss=0.216, simple_loss=0.3139, pruned_loss=0.05905, over 21667.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3058, pruned_loss=0.07905, over 4263806.59 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:24:53,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1376832.0, ans=0.125 2023-06-23 02:25:28,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1376952.0, ans=0.2 2023-06-23 02:25:33,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1376952.0, ans=0.0 2023-06-23 02:25:39,168 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.869e+02 4.296e+02 5.678e+02 9.703e+02 1.741e+03, threshold=1.136e+03, percent-clipped=11.0 2023-06-23 02:25:41,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1377012.0, ans=0.1 2023-06-23 02:25:53,876 INFO [train.py:996] (1/4) Epoch 8, batch 16050, loss[loss=0.2418, simple_loss=0.3347, pruned_loss=0.07444, over 21647.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3061, pruned_loss=0.07697, over 4265480.65 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:25:54,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1377072.0, ans=0.125 2023-06-23 02:25:58,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1377072.0, ans=0.2 2023-06-23 02:25:58,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1377072.0, ans=0.2 2023-06-23 02:27:27,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1377312.0, ans=0.0 2023-06-23 02:27:32,659 INFO [train.py:996] (1/4) Epoch 8, batch 16100, loss[loss=0.2459, simple_loss=0.3223, pruned_loss=0.08479, over 21412.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3095, pruned_loss=0.07785, over 4272126.45 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:28:58,835 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.354e+02 5.010e+02 6.146e+02 8.242e+02 2.299e+03, threshold=1.229e+03, percent-clipped=9.0 2023-06-23 02:29:12,571 INFO [train.py:996] (1/4) Epoch 8, batch 16150, loss[loss=0.2316, simple_loss=0.3145, pruned_loss=0.07438, over 21539.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3104, pruned_loss=0.08119, over 4268071.54 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:29:17,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-06-23 02:29:19,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1377672.0, ans=0.125 2023-06-23 02:29:21,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=22.5 2023-06-23 02:29:23,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1377672.0, ans=0.0 2023-06-23 02:29:28,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-23 02:30:05,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1377792.0, ans=0.125 2023-06-23 02:30:50,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1377912.0, ans=0.1 2023-06-23 02:30:53,201 INFO [train.py:996] (1/4) Epoch 8, batch 16200, loss[loss=0.2422, simple_loss=0.3346, pruned_loss=0.07492, over 20109.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3148, pruned_loss=0.08251, over 4274828.93 frames. ], batch size: 703, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:30:58,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1377972.0, ans=0.1 2023-06-23 02:31:29,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1378092.0, ans=0.0 2023-06-23 02:32:14,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.293e+02 4.981e+02 6.694e+02 1.065e+03 1.723e+03, threshold=1.339e+03, percent-clipped=15.0 2023-06-23 02:32:25,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1378212.0, ans=0.5 2023-06-23 02:32:25,152 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:32:27,733 INFO [train.py:996] (1/4) Epoch 8, batch 16250, loss[loss=0.1729, simple_loss=0.2414, pruned_loss=0.05219, over 21260.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3143, pruned_loss=0.08183, over 4277918.82 frames. ], batch size: 159, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:32:41,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1378272.0, ans=0.5 2023-06-23 02:33:18,452 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:34:06,554 INFO [train.py:996] (1/4) Epoch 8, batch 16300, loss[loss=0.1984, simple_loss=0.2785, pruned_loss=0.05915, over 21701.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3089, pruned_loss=0.07757, over 4275452.94 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:35:04,408 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:35:09,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1378692.0, ans=0.0 2023-06-23 02:35:35,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.087e+02 4.179e+02 5.648e+02 7.276e+02 1.488e+03, threshold=1.130e+03, percent-clipped=3.0 2023-06-23 02:35:52,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1378872.0, ans=0.2 2023-06-23 02:35:53,669 INFO [train.py:996] (1/4) Epoch 8, batch 16350, loss[loss=0.2689, simple_loss=0.3394, pruned_loss=0.09918, over 21480.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3081, pruned_loss=0.0772, over 4281276.47 frames. ], batch size: 131, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:36:45,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1378992.0, ans=0.2 2023-06-23 02:37:10,190 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-23 02:37:33,168 INFO [train.py:996] (1/4) Epoch 8, batch 16400, loss[loss=0.2497, simple_loss=0.3392, pruned_loss=0.08014, over 21028.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3149, pruned_loss=0.08044, over 4276426.08 frames. ], batch size: 608, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:37:36,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1379172.0, ans=0.1 2023-06-23 02:37:37,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=22.5 2023-06-23 02:37:48,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1379232.0, ans=0.2 2023-06-23 02:37:57,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1379232.0, ans=0.125 2023-06-23 02:38:09,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1379232.0, ans=0.125 2023-06-23 02:38:19,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1379292.0, ans=0.125 2023-06-23 02:38:55,319 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.176e+02 5.676e+02 8.447e+02 1.118e+03 2.154e+03, threshold=1.689e+03, percent-clipped=24.0 2023-06-23 02:39:11,000 INFO [train.py:996] (1/4) Epoch 8, batch 16450, loss[loss=0.2374, simple_loss=0.2995, pruned_loss=0.08766, over 20000.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3159, pruned_loss=0.08258, over 4278702.28 frames. ], batch size: 702, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:39:32,902 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-23 02:39:46,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379532.0, ans=0.1 2023-06-23 02:39:50,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1379592.0, ans=0.125 2023-06-23 02:39:52,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1379592.0, ans=0.0 2023-06-23 02:39:55,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1379592.0, ans=0.125 2023-06-23 02:40:09,206 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=22.5 2023-06-23 02:40:50,710 INFO [train.py:996] (1/4) Epoch 8, batch 16500, loss[loss=0.1995, simple_loss=0.2784, pruned_loss=0.0603, over 21826.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3151, pruned_loss=0.08288, over 4282880.78 frames. ], batch size: 282, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:41:09,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1379772.0, ans=0.0 2023-06-23 02:42:20,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.553e+02 5.237e+02 7.941e+02 1.285e+03 2.739e+03, threshold=1.588e+03, percent-clipped=14.0 2023-06-23 02:42:36,170 INFO [train.py:996] (1/4) Epoch 8, batch 16550, loss[loss=0.3291, simple_loss=0.3929, pruned_loss=0.1327, over 21375.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3133, pruned_loss=0.08091, over 4285845.35 frames. ], batch size: 507, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:43:10,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1380132.0, ans=0.0 2023-06-23 02:43:25,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1380192.0, ans=0.125 2023-06-23 02:43:44,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-23 02:44:03,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1380312.0, ans=0.0 2023-06-23 02:44:21,933 INFO [train.py:996] (1/4) Epoch 8, batch 16600, loss[loss=0.2967, simple_loss=0.3983, pruned_loss=0.09757, over 21650.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3214, pruned_loss=0.08434, over 4281115.67 frames. ], batch size: 389, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:45:52,272 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.266e+02 4.965e+02 7.305e+02 1.134e+03 2.257e+03, threshold=1.461e+03, percent-clipped=8.0 2023-06-23 02:45:59,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1380612.0, ans=0.0 2023-06-23 02:46:03,742 INFO [train.py:996] (1/4) Epoch 8, batch 16650, loss[loss=0.2834, simple_loss=0.3617, pruned_loss=0.1025, over 21949.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3291, pruned_loss=0.08659, over 4283615.61 frames. ], batch size: 372, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:47:03,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-23 02:47:05,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1380792.0, ans=0.0 2023-06-23 02:47:21,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1380852.0, ans=0.2 2023-06-23 02:47:29,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1380912.0, ans=0.125 2023-06-23 02:47:43,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380912.0, ans=0.1 2023-06-23 02:47:45,735 INFO [train.py:996] (1/4) Epoch 8, batch 16700, loss[loss=0.2295, simple_loss=0.3073, pruned_loss=0.07589, over 21166.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3299, pruned_loss=0.08763, over 4279302.82 frames. ], batch size: 608, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:48:14,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1381032.0, ans=0.0 2023-06-23 02:48:47,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1381092.0, ans=0.125 2023-06-23 02:48:49,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1381092.0, ans=0.125 2023-06-23 02:49:21,412 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.736e+02 6.147e+02 8.645e+02 1.656e+03, threshold=1.229e+03, percent-clipped=2.0 2023-06-23 02:49:35,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1381212.0, ans=0.125 2023-06-23 02:49:38,555 INFO [train.py:996] (1/4) Epoch 8, batch 16750, loss[loss=0.2705, simple_loss=0.3467, pruned_loss=0.09716, over 21583.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3331, pruned_loss=0.08971, over 4271464.29 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:50:10,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-23 02:50:32,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1381392.0, ans=0.1 2023-06-23 02:50:49,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1381452.0, ans=0.125 2023-06-23 02:51:25,396 INFO [train.py:996] (1/4) Epoch 8, batch 16800, loss[loss=0.2669, simple_loss=0.3616, pruned_loss=0.0861, over 21307.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3356, pruned_loss=0.08919, over 4269852.76 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:51:46,767 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:52:05,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1381692.0, ans=0.025 2023-06-23 02:52:10,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1381692.0, ans=0.2 2023-06-23 02:52:47,897 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.159e+02 4.585e+02 6.307e+02 8.952e+02 1.873e+03, threshold=1.261e+03, percent-clipped=4.0 2023-06-23 02:53:03,729 INFO [train.py:996] (1/4) Epoch 8, batch 16850, loss[loss=0.2121, simple_loss=0.286, pruned_loss=0.06909, over 21906.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3309, pruned_loss=0.08955, over 4281521.62 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:53:41,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1381992.0, ans=0.015 2023-06-23 02:53:45,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1381992.0, ans=0.0 2023-06-23 02:54:46,767 INFO [train.py:996] (1/4) Epoch 8, batch 16900, loss[loss=0.2883, simple_loss=0.3649, pruned_loss=0.1059, over 20737.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.326, pruned_loss=0.08749, over 4282790.64 frames. ], batch size: 607, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:54:48,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1382172.0, ans=0.125 2023-06-23 02:55:04,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1382232.0, ans=0.0 2023-06-23 02:56:03,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1382412.0, ans=0.0 2023-06-23 02:56:07,859 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.361e+02 4.825e+02 6.497e+02 9.276e+02 2.744e+03, threshold=1.299e+03, percent-clipped=9.0 2023-06-23 02:56:26,519 INFO [train.py:996] (1/4) Epoch 8, batch 16950, loss[loss=0.2332, simple_loss=0.2996, pruned_loss=0.08341, over 21854.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.321, pruned_loss=0.08588, over 4285124.17 frames. ], batch size: 98, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:56:32,705 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=22.5 2023-06-23 02:56:33,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1382472.0, ans=0.125 2023-06-23 02:56:53,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1382532.0, ans=0.2 2023-06-23 02:57:15,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1382592.0, ans=0.1 2023-06-23 02:57:15,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1382592.0, ans=0.125 2023-06-23 02:57:38,627 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-23 02:57:39,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.38 vs. limit=10.0 2023-06-23 02:57:46,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1382712.0, ans=0.125 2023-06-23 02:57:48,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1382712.0, ans=0.125 2023-06-23 02:57:48,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.92 vs. limit=15.0 2023-06-23 02:58:05,790 INFO [train.py:996] (1/4) Epoch 8, batch 17000, loss[loss=0.231, simple_loss=0.302, pruned_loss=0.08002, over 21688.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3185, pruned_loss=0.08642, over 4289733.92 frames. ], batch size: 263, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 02:59:36,307 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.719e+02 6.116e+02 8.558e+02 1.129e+03 2.527e+03, threshold=1.712e+03, percent-clipped=16.0 2023-06-23 02:59:41,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1383012.0, ans=0.125 2023-06-23 02:59:46,406 INFO [train.py:996] (1/4) Epoch 8, batch 17050, loss[loss=0.2782, simple_loss=0.3587, pruned_loss=0.09883, over 21761.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3247, pruned_loss=0.08853, over 4292475.12 frames. ], batch size: 298, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:00:56,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1383252.0, ans=0.125 2023-06-23 03:01:21,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1383312.0, ans=0.125 2023-06-23 03:01:25,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1383372.0, ans=0.2 2023-06-23 03:01:27,133 INFO [train.py:996] (1/4) Epoch 8, batch 17100, loss[loss=0.2425, simple_loss=0.3131, pruned_loss=0.08596, over 19964.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3237, pruned_loss=0.08951, over 4290537.65 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:01:56,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1383432.0, ans=0.125 2023-06-23 03:01:56,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.21 vs. limit=6.0 2023-06-23 03:02:06,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1383492.0, ans=0.125 2023-06-23 03:02:52,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.295e+02 4.559e+02 5.897e+02 8.703e+02 1.483e+03, threshold=1.179e+03, percent-clipped=0.0 2023-06-23 03:03:01,682 INFO [train.py:996] (1/4) Epoch 8, batch 17150, loss[loss=0.2109, simple_loss=0.2793, pruned_loss=0.07124, over 21444.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3189, pruned_loss=0.08828, over 4292466.15 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:03:05,233 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:03:10,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1383672.0, ans=0.1 2023-06-23 03:03:13,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383672.0, ans=0.1 2023-06-23 03:03:17,285 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-23 03:03:34,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1383732.0, ans=0.0 2023-06-23 03:03:52,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1383792.0, ans=0.2 2023-06-23 03:03:57,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1383792.0, ans=0.0 2023-06-23 03:04:40,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-23 03:04:42,753 INFO [train.py:996] (1/4) Epoch 8, batch 17200, loss[loss=0.263, simple_loss=0.3406, pruned_loss=0.09267, over 21815.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3187, pruned_loss=0.08829, over 4297857.90 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:05:58,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1384152.0, ans=0.2 2023-06-23 03:06:02,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1384152.0, ans=0.1 2023-06-23 03:06:12,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1384212.0, ans=0.2 2023-06-23 03:06:13,105 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.394e+02 4.493e+02 5.793e+02 8.418e+02 1.650e+03, threshold=1.159e+03, percent-clipped=7.0 2023-06-23 03:06:19,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-23 03:06:22,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1384272.0, ans=0.125 2023-06-23 03:06:22,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1384272.0, ans=0.125 2023-06-23 03:06:23,193 INFO [train.py:996] (1/4) Epoch 8, batch 17250, loss[loss=0.2289, simple_loss=0.3104, pruned_loss=0.07365, over 21669.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3212, pruned_loss=0.0896, over 4294834.30 frames. ], batch size: 351, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:06:23,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1384272.0, ans=0.5 2023-06-23 03:06:43,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1384332.0, ans=0.125 2023-06-23 03:07:52,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1384512.0, ans=0.125 2023-06-23 03:08:09,828 INFO [train.py:996] (1/4) Epoch 8, batch 17300, loss[loss=0.2746, simple_loss=0.3529, pruned_loss=0.09814, over 21572.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3292, pruned_loss=0.09265, over 4291278.09 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:08:16,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1384572.0, ans=0.0 2023-06-23 03:08:18,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1384572.0, ans=0.125 2023-06-23 03:08:19,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1384572.0, ans=0.0 2023-06-23 03:08:53,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1384632.0, ans=0.1 2023-06-23 03:09:10,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1384692.0, ans=0.2 2023-06-23 03:09:17,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.15 vs. limit=10.0 2023-06-23 03:09:28,840 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:09:41,183 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.372e+02 4.842e+02 6.354e+02 8.974e+02 2.324e+03, threshold=1.271e+03, percent-clipped=7.0 2023-06-23 03:09:56,688 INFO [train.py:996] (1/4) Epoch 8, batch 17350, loss[loss=0.2295, simple_loss=0.3266, pruned_loss=0.06625, over 21701.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3296, pruned_loss=0.09256, over 4289926.18 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:10:18,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1384932.0, ans=0.0 2023-06-23 03:10:25,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1384932.0, ans=0.04949747468305833 2023-06-23 03:10:28,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1384932.0, ans=0.125 2023-06-23 03:10:39,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1384992.0, ans=0.125 2023-06-23 03:11:16,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1385112.0, ans=0.125 2023-06-23 03:11:33,405 INFO [train.py:996] (1/4) Epoch 8, batch 17400, loss[loss=0.1842, simple_loss=0.2495, pruned_loss=0.05947, over 21812.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.326, pruned_loss=0.08838, over 4283135.49 frames. ], batch size: 118, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:11:57,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-23 03:13:07,300 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.616e+02 6.487e+02 8.880e+02 2.609e+03, threshold=1.297e+03, percent-clipped=10.0 2023-06-23 03:13:18,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1385472.0, ans=0.0 2023-06-23 03:13:19,800 INFO [train.py:996] (1/4) Epoch 8, batch 17450, loss[loss=0.2194, simple_loss=0.3171, pruned_loss=0.06083, over 21178.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3221, pruned_loss=0.08541, over 4268804.83 frames. ], batch size: 548, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:13:52,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1385592.0, ans=0.125 2023-06-23 03:14:22,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1385652.0, ans=0.1 2023-06-23 03:14:36,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1385712.0, ans=0.125 2023-06-23 03:14:54,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1385712.0, ans=0.125 2023-06-23 03:15:00,672 INFO [train.py:996] (1/4) Epoch 8, batch 17500, loss[loss=0.2124, simple_loss=0.2864, pruned_loss=0.06918, over 21818.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3205, pruned_loss=0.08418, over 4270283.58 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:15:22,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-23 03:15:41,093 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-23 03:15:57,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1385952.0, ans=0.0 2023-06-23 03:15:57,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1385952.0, ans=0.0 2023-06-23 03:16:11,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1385952.0, ans=0.1 2023-06-23 03:16:31,995 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.152e+02 4.278e+02 5.524e+02 8.928e+02 1.678e+03, threshold=1.105e+03, percent-clipped=3.0 2023-06-23 03:16:37,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1386012.0, ans=0.125 2023-06-23 03:16:40,144 INFO [train.py:996] (1/4) Epoch 8, batch 17550, loss[loss=0.2277, simple_loss=0.3179, pruned_loss=0.06878, over 21457.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3202, pruned_loss=0.08248, over 4277228.99 frames. ], batch size: 194, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:16:54,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1386132.0, ans=0.125 2023-06-23 03:16:55,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1386132.0, ans=0.0 2023-06-23 03:17:19,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1386192.0, ans=0.0 2023-06-23 03:17:26,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.12 vs. limit=12.0 2023-06-23 03:17:30,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1386192.0, ans=0.125 2023-06-23 03:17:46,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1386252.0, ans=0.125 2023-06-23 03:17:54,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1386312.0, ans=0.125 2023-06-23 03:18:13,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1386312.0, ans=0.125 2023-06-23 03:18:18,848 INFO [train.py:996] (1/4) Epoch 8, batch 17600, loss[loss=0.2504, simple_loss=0.3272, pruned_loss=0.08685, over 21568.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3225, pruned_loss=0.08325, over 4278700.75 frames. ], batch size: 389, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:18:41,873 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-23 03:19:17,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1386552.0, ans=0.125 2023-06-23 03:19:48,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.108e+02 4.735e+02 6.263e+02 8.398e+02 1.704e+03, threshold=1.253e+03, percent-clipped=10.0 2023-06-23 03:19:48,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1386612.0, ans=0.125 2023-06-23 03:19:55,704 INFO [train.py:996] (1/4) Epoch 8, batch 17650, loss[loss=0.1891, simple_loss=0.2578, pruned_loss=0.06019, over 21787.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3203, pruned_loss=0.08313, over 4269452.65 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:20:02,811 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:20:27,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.59 vs. limit=10.0 2023-06-23 03:21:16,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1386852.0, ans=0.0 2023-06-23 03:21:36,250 INFO [train.py:996] (1/4) Epoch 8, batch 17700, loss[loss=0.2298, simple_loss=0.3435, pruned_loss=0.05803, over 19916.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3148, pruned_loss=0.08054, over 4272438.79 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:21:57,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1387032.0, ans=0.125 2023-06-23 03:22:09,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1387032.0, ans=0.125 2023-06-23 03:22:27,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1387092.0, ans=0.125 2023-06-23 03:23:11,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.033e+02 4.293e+02 5.499e+02 1.006e+03 2.228e+03, threshold=1.100e+03, percent-clipped=12.0 2023-06-23 03:23:17,904 INFO [train.py:996] (1/4) Epoch 8, batch 17750, loss[loss=0.261, simple_loss=0.3383, pruned_loss=0.09185, over 21231.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3201, pruned_loss=0.08283, over 4272797.14 frames. ], batch size: 143, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:23:36,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1387272.0, ans=0.0 2023-06-23 03:24:02,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1387392.0, ans=0.025 2023-06-23 03:24:58,734 INFO [train.py:996] (1/4) Epoch 8, batch 17800, loss[loss=0.2239, simple_loss=0.2998, pruned_loss=0.07403, over 21438.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3201, pruned_loss=0.08217, over 4275542.98 frames. ], batch size: 194, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:25:32,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1387632.0, ans=0.125 2023-06-23 03:26:17,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1387812.0, ans=22.5 2023-06-23 03:26:32,501 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 4.467e+02 6.036e+02 8.319e+02 2.220e+03, threshold=1.207e+03, percent-clipped=14.0 2023-06-23 03:26:33,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1387812.0, ans=0.125 2023-06-23 03:26:37,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.12 vs. limit=22.5 2023-06-23 03:26:39,355 INFO [train.py:996] (1/4) Epoch 8, batch 17850, loss[loss=0.2711, simple_loss=0.3343, pruned_loss=0.1039, over 21718.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.321, pruned_loss=0.08302, over 4276813.15 frames. ], batch size: 247, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:26:49,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1387872.0, ans=0.0 2023-06-23 03:27:17,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1387932.0, ans=0.125 2023-06-23 03:27:28,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1387992.0, ans=0.0 2023-06-23 03:27:30,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-23 03:27:35,668 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-23 03:27:38,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1388052.0, ans=0.125 2023-06-23 03:27:43,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1388052.0, ans=0.0 2023-06-23 03:28:16,050 INFO [train.py:996] (1/4) Epoch 8, batch 17900, loss[loss=0.2351, simple_loss=0.3205, pruned_loss=0.07478, over 21401.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3254, pruned_loss=0.0843, over 4271830.43 frames. ], batch size: 194, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:28:16,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388172.0, ans=0.1 2023-06-23 03:28:17,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-23 03:28:26,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1388172.0, ans=0.0 2023-06-23 03:30:00,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.268e+02 4.650e+02 5.974e+02 7.368e+02 1.876e+03, threshold=1.195e+03, percent-clipped=6.0 2023-06-23 03:30:11,646 INFO [train.py:996] (1/4) Epoch 8, batch 17950, loss[loss=0.2111, simple_loss=0.3333, pruned_loss=0.04448, over 19714.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3246, pruned_loss=0.08097, over 4269537.89 frames. ], batch size: 703, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:31:44,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388712.0, ans=0.1 2023-06-23 03:31:50,012 INFO [train.py:996] (1/4) Epoch 8, batch 18000, loss[loss=0.1991, simple_loss=0.2582, pruned_loss=0.07002, over 21493.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3172, pruned_loss=0.07944, over 4261498.97 frames. ], batch size: 195, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:31:50,012 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 03:32:06,869 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2644, simple_loss=0.3593, pruned_loss=0.08473, over 1796401.00 frames. 2023-06-23 03:32:06,870 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 03:32:27,469 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.50 vs. limit=10.0 2023-06-23 03:32:28,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1388832.0, ans=0.0 2023-06-23 03:32:46,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1388892.0, ans=0.2 2023-06-23 03:32:51,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=22.5 2023-06-23 03:32:59,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1388952.0, ans=0.2 2023-06-23 03:33:43,875 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.172e+02 4.301e+02 6.081e+02 8.972e+02 1.795e+03, threshold=1.216e+03, percent-clipped=14.0 2023-06-23 03:33:46,964 INFO [train.py:996] (1/4) Epoch 8, batch 18050, loss[loss=0.2204, simple_loss=0.2871, pruned_loss=0.07687, over 21437.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3115, pruned_loss=0.07847, over 4270243.33 frames. ], batch size: 389, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:33:57,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1389072.0, ans=0.125 2023-06-23 03:35:04,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1389252.0, ans=0.2 2023-06-23 03:35:27,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1389372.0, ans=0.125 2023-06-23 03:35:28,055 INFO [train.py:996] (1/4) Epoch 8, batch 18100, loss[loss=0.2362, simple_loss=0.308, pruned_loss=0.08222, over 21912.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3166, pruned_loss=0.08136, over 4272964.99 frames. ], batch size: 98, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:36:02,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1389432.0, ans=0.125 2023-06-23 03:36:59,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1389612.0, ans=0.125 2023-06-23 03:37:04,994 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.338e+02 4.664e+02 6.579e+02 9.782e+02 2.052e+03, threshold=1.316e+03, percent-clipped=11.0 2023-06-23 03:37:06,649 INFO [train.py:996] (1/4) Epoch 8, batch 18150, loss[loss=0.2105, simple_loss=0.2988, pruned_loss=0.06108, over 21618.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3187, pruned_loss=0.08183, over 4271936.46 frames. ], batch size: 263, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:38:40,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1389912.0, ans=0.125 2023-06-23 03:38:43,072 INFO [train.py:996] (1/4) Epoch 8, batch 18200, loss[loss=0.2029, simple_loss=0.2766, pruned_loss=0.06454, over 21739.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3134, pruned_loss=0.08158, over 4260192.62 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:39:05,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1390032.0, ans=0.125 2023-06-23 03:39:05,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1390032.0, ans=0.1 2023-06-23 03:39:17,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-23 03:39:57,980 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:40:17,376 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.155e+02 4.842e+02 6.713e+02 9.646e+02 2.158e+03, threshold=1.343e+03, percent-clipped=10.0 2023-06-23 03:40:19,042 INFO [train.py:996] (1/4) Epoch 8, batch 18250, loss[loss=0.2308, simple_loss=0.2971, pruned_loss=0.08223, over 21665.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3066, pruned_loss=0.07887, over 4264063.60 frames. ], batch size: 230, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:40:32,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1390272.0, ans=0.5 2023-06-23 03:40:47,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-23 03:40:51,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1390392.0, ans=0.0 2023-06-23 03:40:51,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1390392.0, ans=0.2 2023-06-23 03:41:56,239 INFO [train.py:996] (1/4) Epoch 8, batch 18300, loss[loss=0.2637, simple_loss=0.3608, pruned_loss=0.08332, over 21784.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3054, pruned_loss=0.07953, over 4270529.27 frames. ], batch size: 282, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:42:18,157 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:42:24,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1390632.0, ans=0.1 2023-06-23 03:42:32,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1390692.0, ans=0.125 2023-06-23 03:43:15,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1390812.0, ans=0.125 2023-06-23 03:43:32,359 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.212e+02 4.888e+02 7.173e+02 1.170e+03 2.600e+03, threshold=1.435e+03, percent-clipped=18.0 2023-06-23 03:43:34,047 INFO [train.py:996] (1/4) Epoch 8, batch 18350, loss[loss=0.1976, simple_loss=0.2727, pruned_loss=0.06123, over 21505.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3087, pruned_loss=0.07904, over 4273717.77 frames. ], batch size: 230, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:43:53,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1390932.0, ans=0.125 2023-06-23 03:44:15,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1390992.0, ans=0.1 2023-06-23 03:44:15,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1390992.0, ans=0.125 2023-06-23 03:44:46,212 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-23 03:45:09,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1391112.0, ans=0.125 2023-06-23 03:45:12,227 INFO [train.py:996] (1/4) Epoch 8, batch 18400, loss[loss=0.2112, simple_loss=0.2839, pruned_loss=0.06932, over 21126.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3058, pruned_loss=0.07788, over 4260517.05 frames. ], batch size: 159, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:45:14,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391172.0, ans=0.125 2023-06-23 03:46:22,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1391352.0, ans=0.125 2023-06-23 03:46:46,183 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.306e+02 6.034e+02 8.770e+02 2.014e+03, threshold=1.207e+03, percent-clipped=5.0 2023-06-23 03:46:48,008 INFO [train.py:996] (1/4) Epoch 8, batch 18450, loss[loss=0.2003, simple_loss=0.2875, pruned_loss=0.05661, over 21662.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3028, pruned_loss=0.07442, over 4264135.64 frames. ], batch size: 247, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:47:00,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1391472.0, ans=0.0 2023-06-23 03:47:02,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1391532.0, ans=0.125 2023-06-23 03:47:32,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1391592.0, ans=0.125 2023-06-23 03:47:55,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391652.0, ans=0.125 2023-06-23 03:47:57,865 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-23 03:48:11,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1391712.0, ans=0.1 2023-06-23 03:48:13,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1391712.0, ans=0.125 2023-06-23 03:48:25,060 INFO [train.py:996] (1/4) Epoch 8, batch 18500, loss[loss=0.1834, simple_loss=0.2626, pruned_loss=0.05214, over 21213.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2966, pruned_loss=0.07247, over 4255040.67 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:48:37,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1391772.0, ans=0.0 2023-06-23 03:48:49,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.51 vs. limit=15.0 2023-06-23 03:48:51,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1391832.0, ans=0.0 2023-06-23 03:48:51,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1391832.0, ans=0.125 2023-06-23 03:49:34,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1391952.0, ans=0.025 2023-06-23 03:50:02,678 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.977e+02 4.135e+02 5.661e+02 7.712e+02 1.457e+03, threshold=1.132e+03, percent-clipped=3.0 2023-06-23 03:50:04,119 INFO [train.py:996] (1/4) Epoch 8, batch 18550, loss[loss=0.2023, simple_loss=0.2803, pruned_loss=0.06218, over 21767.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2953, pruned_loss=0.07198, over 4248948.24 frames. ], batch size: 351, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:50:27,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1392132.0, ans=0.0 2023-06-23 03:50:59,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1392192.0, ans=0.125 2023-06-23 03:51:36,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1392312.0, ans=0.125 2023-06-23 03:51:43,202 INFO [train.py:996] (1/4) Epoch 8, batch 18600, loss[loss=0.2137, simple_loss=0.2856, pruned_loss=0.07088, over 21583.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2931, pruned_loss=0.07213, over 4252800.58 frames. ], batch size: 230, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:52:19,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1392492.0, ans=0.2 2023-06-23 03:52:20,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1392492.0, ans=0.125 2023-06-23 03:52:30,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1392492.0, ans=0.2 2023-06-23 03:52:40,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1392552.0, ans=0.2 2023-06-23 03:52:54,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1392552.0, ans=0.125 2023-06-23 03:53:17,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.135e+02 5.174e+02 7.950e+02 1.061e+03 1.906e+03, threshold=1.590e+03, percent-clipped=19.0 2023-06-23 03:53:19,664 INFO [train.py:996] (1/4) Epoch 8, batch 18650, loss[loss=0.2139, simple_loss=0.2745, pruned_loss=0.07664, over 21810.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2919, pruned_loss=0.07258, over 4256461.74 frames. ], batch size: 102, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:53:20,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1392672.0, ans=0.0 2023-06-23 03:53:24,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1392672.0, ans=0.125 2023-06-23 03:53:32,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1392672.0, ans=0.125 2023-06-23 03:53:38,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1392732.0, ans=0.0 2023-06-23 03:53:41,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1392732.0, ans=0.125 2023-06-23 03:54:56,259 INFO [train.py:996] (1/4) Epoch 8, batch 18700, loss[loss=0.1711, simple_loss=0.2371, pruned_loss=0.05251, over 20711.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2899, pruned_loss=0.07419, over 4247218.03 frames. ], batch size: 608, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:55:13,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1393032.0, ans=0.025 2023-06-23 03:55:15,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-23 03:55:27,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1393092.0, ans=0.2 2023-06-23 03:55:45,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1393092.0, ans=0.0 2023-06-23 03:55:55,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1393152.0, ans=0.125 2023-06-23 03:56:11,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1393152.0, ans=0.125 2023-06-23 03:56:28,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-23 03:56:30,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-23 03:56:32,211 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.347e+02 4.104e+02 5.076e+02 6.613e+02 1.727e+03, threshold=1.015e+03, percent-clipped=1.0 2023-06-23 03:56:32,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1393272.0, ans=0.125 2023-06-23 03:56:33,829 INFO [train.py:996] (1/4) Epoch 8, batch 18750, loss[loss=0.2896, simple_loss=0.3558, pruned_loss=0.1117, over 21608.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2933, pruned_loss=0.07733, over 4246991.98 frames. ], batch size: 263, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:56:37,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1393272.0, ans=0.125 2023-06-23 03:56:40,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1393272.0, ans=0.125 2023-06-23 03:56:47,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1393332.0, ans=0.015 2023-06-23 03:56:48,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1393332.0, ans=0.125 2023-06-23 03:57:23,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1393392.0, ans=0.125 2023-06-23 03:57:45,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1393452.0, ans=0.0 2023-06-23 03:58:11,933 INFO [train.py:996] (1/4) Epoch 8, batch 18800, loss[loss=0.25, simple_loss=0.3393, pruned_loss=0.08031, over 21648.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3019, pruned_loss=0.07894, over 4246978.46 frames. ], batch size: 414, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 03:58:23,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.94 vs. limit=22.5 2023-06-23 03:58:32,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1393632.0, ans=0.2 2023-06-23 03:58:34,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1393632.0, ans=0.125 2023-06-23 03:58:40,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1393632.0, ans=0.125 2023-06-23 03:59:13,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1393752.0, ans=0.125 2023-06-23 03:59:48,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.636e+02 4.595e+02 6.296e+02 8.883e+02 2.093e+03, threshold=1.259e+03, percent-clipped=21.0 2023-06-23 03:59:50,079 INFO [train.py:996] (1/4) Epoch 8, batch 18850, loss[loss=0.2392, simple_loss=0.3006, pruned_loss=0.08889, over 21872.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2979, pruned_loss=0.0748, over 4253785.51 frames. ], batch size: 98, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:00:04,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1393932.0, ans=0.125 2023-06-23 04:00:46,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=15.0 2023-06-23 04:00:57,659 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-23 04:01:21,365 INFO [train.py:996] (1/4) Epoch 8, batch 18900, loss[loss=0.2179, simple_loss=0.2828, pruned_loss=0.0765, over 21687.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2945, pruned_loss=0.07548, over 4259974.18 frames. ], batch size: 282, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:01:29,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1394172.0, ans=0.0 2023-06-23 04:01:49,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-23 04:02:58,743 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.232e+02 4.494e+02 5.345e+02 6.718e+02 1.434e+03, threshold=1.069e+03, percent-clipped=2.0 2023-06-23 04:03:00,411 INFO [train.py:996] (1/4) Epoch 8, batch 18950, loss[loss=0.2414, simple_loss=0.3001, pruned_loss=0.09141, over 20039.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2955, pruned_loss=0.07742, over 4265482.80 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:03:04,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1394472.0, ans=0.2 2023-06-23 04:03:04,840 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.75 vs. limit=5.0 2023-06-23 04:03:23,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1394532.0, ans=0.0 2023-06-23 04:04:39,991 INFO [train.py:996] (1/4) Epoch 8, batch 19000, loss[loss=0.2101, simple_loss=0.2665, pruned_loss=0.07684, over 20790.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3035, pruned_loss=0.07837, over 4271559.86 frames. ], batch size: 609, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:04:47,399 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-23 04:05:05,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1394832.0, ans=0.125 2023-06-23 04:05:07,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1394832.0, ans=0.125 2023-06-23 04:05:18,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1394892.0, ans=0.2 2023-06-23 04:06:08,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-23 04:06:11,616 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.757e+02 5.109e+02 8.049e+02 1.091e+03 2.389e+03, threshold=1.610e+03, percent-clipped=25.0 2023-06-23 04:06:13,349 INFO [train.py:996] (1/4) Epoch 8, batch 19050, loss[loss=0.2876, simple_loss=0.3507, pruned_loss=0.1123, over 21400.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3105, pruned_loss=0.08317, over 4272680.63 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:06:56,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=22.5 2023-06-23 04:07:23,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1395252.0, ans=0.1 2023-06-23 04:07:34,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1395252.0, ans=0.125 2023-06-23 04:07:53,266 INFO [train.py:996] (1/4) Epoch 8, batch 19100, loss[loss=0.2385, simple_loss=0.2895, pruned_loss=0.09378, over 21168.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3104, pruned_loss=0.08523, over 4269732.82 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:07:58,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.31 vs. limit=22.5 2023-06-23 04:08:08,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1395432.0, ans=0.0 2023-06-23 04:08:09,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1395432.0, ans=0.5 2023-06-23 04:08:13,748 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-23 04:09:07,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1395552.0, ans=0.125 2023-06-23 04:09:33,317 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.382e+02 4.590e+02 5.879e+02 8.375e+02 2.097e+03, threshold=1.176e+03, percent-clipped=3.0 2023-06-23 04:09:34,844 INFO [train.py:996] (1/4) Epoch 8, batch 19150, loss[loss=0.2537, simple_loss=0.3385, pruned_loss=0.08448, over 21392.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3107, pruned_loss=0.08528, over 4269543.54 frames. ], batch size: 211, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:09:38,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1395672.0, ans=0.1 2023-06-23 04:10:42,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1395852.0, ans=0.0 2023-06-23 04:11:19,413 INFO [train.py:996] (1/4) Epoch 8, batch 19200, loss[loss=0.2251, simple_loss=0.33, pruned_loss=0.06017, over 21421.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3213, pruned_loss=0.08564, over 4270949.85 frames. ], batch size: 194, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:11:49,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1396032.0, ans=0.2 2023-06-23 04:12:44,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1396212.0, ans=0.125 2023-06-23 04:12:50,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.764e+02 4.861e+02 7.058e+02 9.743e+02 2.046e+03, threshold=1.412e+03, percent-clipped=16.0 2023-06-23 04:12:50,673 INFO [train.py:996] (1/4) Epoch 8, batch 19250, loss[loss=0.1872, simple_loss=0.2834, pruned_loss=0.04551, over 21330.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3201, pruned_loss=0.07991, over 4275647.32 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:13:48,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1396392.0, ans=0.125 2023-06-23 04:14:12,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1396512.0, ans=0.0 2023-06-23 04:14:29,806 INFO [train.py:996] (1/4) Epoch 8, batch 19300, loss[loss=0.2482, simple_loss=0.3129, pruned_loss=0.09176, over 21535.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3207, pruned_loss=0.08035, over 4278294.03 frames. ], batch size: 548, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:15:32,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1396692.0, ans=0.125 2023-06-23 04:16:01,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1396812.0, ans=0.2 2023-06-23 04:16:14,261 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.848e+02 5.029e+02 6.832e+02 8.768e+02 1.869e+03, threshold=1.366e+03, percent-clipped=8.0 2023-06-23 04:16:14,292 INFO [train.py:996] (1/4) Epoch 8, batch 19350, loss[loss=0.2805, simple_loss=0.357, pruned_loss=0.102, over 21541.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3137, pruned_loss=0.07557, over 4279662.12 frames. ], batch size: 509, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:16:42,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396932.0, ans=0.1 2023-06-23 04:16:52,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-23 04:17:36,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1397112.0, ans=0.1 2023-06-23 04:17:54,537 INFO [train.py:996] (1/4) Epoch 8, batch 19400, loss[loss=0.2253, simple_loss=0.29, pruned_loss=0.08026, over 21287.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3107, pruned_loss=0.07473, over 4275435.61 frames. ], batch size: 159, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:18:24,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1397232.0, ans=0.0 2023-06-23 04:18:31,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1397232.0, ans=0.125 2023-06-23 04:18:45,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1397292.0, ans=0.5 2023-06-23 04:18:47,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1397292.0, ans=0.125 2023-06-23 04:19:03,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1397352.0, ans=0.125 2023-06-23 04:19:11,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1397412.0, ans=0.2 2023-06-23 04:19:20,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1397412.0, ans=0.0 2023-06-23 04:19:38,377 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.184e+02 4.448e+02 5.769e+02 7.543e+02 1.139e+03, threshold=1.154e+03, percent-clipped=0.0 2023-06-23 04:19:38,407 INFO [train.py:996] (1/4) Epoch 8, batch 19450, loss[loss=0.242, simple_loss=0.2941, pruned_loss=0.09498, over 21582.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3086, pruned_loss=0.0776, over 4281325.82 frames. ], batch size: 441, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:20:20,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1397592.0, ans=0.0 2023-06-23 04:20:31,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1397592.0, ans=0.125 2023-06-23 04:21:16,468 INFO [train.py:996] (1/4) Epoch 8, batch 19500, loss[loss=0.1907, simple_loss=0.247, pruned_loss=0.06716, over 21139.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3046, pruned_loss=0.0779, over 4282590.26 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:21:33,400 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=12.0 2023-06-23 04:21:36,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1397832.0, ans=0.025 2023-06-23 04:22:00,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1397892.0, ans=0.125 2023-06-23 04:22:54,698 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.009e+02 6.847e+02 1.109e+03 2.464e+03, threshold=1.369e+03, percent-clipped=22.0 2023-06-23 04:22:54,730 INFO [train.py:996] (1/4) Epoch 8, batch 19550, loss[loss=0.1938, simple_loss=0.2916, pruned_loss=0.04801, over 21744.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2996, pruned_loss=0.07652, over 4267588.69 frames. ], batch size: 298, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:23:14,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1398132.0, ans=0.0 2023-06-23 04:23:49,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1398252.0, ans=0.0 2023-06-23 04:24:02,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1398252.0, ans=0.125 2023-06-23 04:24:10,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1398312.0, ans=0.1 2023-06-23 04:24:27,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1398312.0, ans=0.0 2023-06-23 04:24:30,137 INFO [train.py:996] (1/4) Epoch 8, batch 19600, loss[loss=0.2302, simple_loss=0.3042, pruned_loss=0.07808, over 21782.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3015, pruned_loss=0.07764, over 4271942.63 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:24:32,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1398372.0, ans=0.0 2023-06-23 04:24:58,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1398432.0, ans=0.125 2023-06-23 04:25:26,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1398552.0, ans=0.0 2023-06-23 04:25:58,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1398612.0, ans=0.125 2023-06-23 04:26:01,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-23 04:26:08,767 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.522e+02 4.617e+02 5.615e+02 7.940e+02 2.383e+03, threshold=1.123e+03, percent-clipped=6.0 2023-06-23 04:26:08,801 INFO [train.py:996] (1/4) Epoch 8, batch 19650, loss[loss=0.2454, simple_loss=0.3559, pruned_loss=0.06746, over 19795.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3065, pruned_loss=0.0814, over 4277321.04 frames. ], batch size: 702, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:26:15,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1398672.0, ans=0.2 2023-06-23 04:27:55,201 INFO [train.py:996] (1/4) Epoch 8, batch 19700, loss[loss=0.2151, simple_loss=0.3062, pruned_loss=0.06199, over 21671.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3098, pruned_loss=0.08183, over 4277046.09 frames. ], batch size: 298, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:28:25,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1399032.0, ans=0.0 2023-06-23 04:28:48,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1399092.0, ans=0.125 2023-06-23 04:29:11,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1399152.0, ans=0.125 2023-06-23 04:29:24,629 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-23 04:29:29,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-23 04:29:30,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1399212.0, ans=0.2 2023-06-23 04:29:32,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-23 04:29:34,758 INFO [train.py:996] (1/4) Epoch 8, batch 19750, loss[loss=0.2706, simple_loss=0.3761, pruned_loss=0.08249, over 21695.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3215, pruned_loss=0.08411, over 4279091.71 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:29:36,363 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.270e+02 5.158e+02 7.198e+02 1.115e+03 3.431e+03, threshold=1.440e+03, percent-clipped=24.0 2023-06-23 04:29:39,332 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-06-23 04:29:53,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-23 04:30:01,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-23 04:30:39,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1399452.0, ans=0.0 2023-06-23 04:30:58,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1399512.0, ans=15.0 2023-06-23 04:31:08,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1399512.0, ans=0.2 2023-06-23 04:31:12,841 INFO [train.py:996] (1/4) Epoch 8, batch 19800, loss[loss=0.1993, simple_loss=0.271, pruned_loss=0.06381, over 21453.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.323, pruned_loss=0.08525, over 4280163.71 frames. ], batch size: 194, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:31:47,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-23 04:32:18,249 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-23 04:32:51,982 INFO [train.py:996] (1/4) Epoch 8, batch 19850, loss[loss=0.2061, simple_loss=0.3032, pruned_loss=0.05446, over 21169.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3147, pruned_loss=0.07996, over 4278465.86 frames. ], batch size: 548, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:32:53,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.926e+02 6.065e+02 9.192e+02 2.099e+03, threshold=1.213e+03, percent-clipped=4.0 2023-06-23 04:33:23,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1399932.0, ans=0.125 2023-06-23 04:33:25,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1399932.0, ans=0.0 2023-06-23 04:33:31,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1399992.0, ans=0.1 2023-06-23 04:33:48,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1399992.0, ans=0.125 2023-06-23 04:33:52,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1400052.0, ans=0.0 2023-06-23 04:34:07,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1400112.0, ans=0.2 2023-06-23 04:34:28,513 INFO [train.py:996] (1/4) Epoch 8, batch 19900, loss[loss=0.2118, simple_loss=0.2771, pruned_loss=0.0733, over 21552.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3121, pruned_loss=0.07666, over 4276108.61 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:34:40,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1400172.0, ans=0.125 2023-06-23 04:35:12,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1400232.0, ans=0.1 2023-06-23 04:36:08,921 INFO [train.py:996] (1/4) Epoch 8, batch 19950, loss[loss=0.2573, simple_loss=0.3105, pruned_loss=0.102, over 21871.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3067, pruned_loss=0.07659, over 4275768.09 frames. ], batch size: 107, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:36:10,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.980e+02 4.068e+02 6.013e+02 8.874e+02 2.224e+03, threshold=1.203e+03, percent-clipped=12.0 2023-06-23 04:36:32,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1400532.0, ans=0.125 2023-06-23 04:36:32,740 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:36:50,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1400592.0, ans=0.125 2023-06-23 04:37:09,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1400652.0, ans=0.125 2023-06-23 04:37:21,025 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:37:40,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1400712.0, ans=0.09899494936611666 2023-06-23 04:37:42,773 INFO [train.py:996] (1/4) Epoch 8, batch 20000, loss[loss=0.2149, simple_loss=0.2986, pruned_loss=0.06556, over 21115.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.308, pruned_loss=0.07695, over 4279756.21 frames. ], batch size: 608, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:38:17,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1400832.0, ans=0.125 2023-06-23 04:38:58,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1401012.0, ans=0.125 2023-06-23 04:38:58,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1401012.0, ans=0.1 2023-06-23 04:39:11,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1401012.0, ans=0.125 2023-06-23 04:39:15,702 INFO [train.py:996] (1/4) Epoch 8, batch 20050, loss[loss=0.2549, simple_loss=0.3228, pruned_loss=0.09348, over 21953.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3101, pruned_loss=0.07999, over 4285322.01 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:39:18,836 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.252e+02 4.577e+02 6.319e+02 8.281e+02 1.487e+03, threshold=1.264e+03, percent-clipped=6.0 2023-06-23 04:39:38,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1401072.0, ans=0.1 2023-06-23 04:39:39,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1401132.0, ans=0.1 2023-06-23 04:39:40,358 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-23 04:40:00,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1401192.0, ans=0.0 2023-06-23 04:40:02,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1401192.0, ans=0.125 2023-06-23 04:40:29,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-23 04:40:48,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1401312.0, ans=0.125 2023-06-23 04:40:49,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-23 04:40:54,579 INFO [train.py:996] (1/4) Epoch 8, batch 20100, loss[loss=0.2228, simple_loss=0.3058, pruned_loss=0.06988, over 21811.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3121, pruned_loss=0.08252, over 4293673.71 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:41:38,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1401492.0, ans=0.2 2023-06-23 04:41:49,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-23 04:41:53,342 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:42:16,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1401612.0, ans=0.125 2023-06-23 04:42:47,132 INFO [train.py:996] (1/4) Epoch 8, batch 20150, loss[loss=0.2856, simple_loss=0.3552, pruned_loss=0.108, over 21725.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3218, pruned_loss=0.08553, over 4289877.12 frames. ], batch size: 298, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:42:50,184 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.397e+02 4.538e+02 5.704e+02 8.156e+02 2.453e+03, threshold=1.141e+03, percent-clipped=8.0 2023-06-23 04:43:01,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1401672.0, ans=0.0 2023-06-23 04:43:07,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1401732.0, ans=0.1 2023-06-23 04:44:16,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0 2023-06-23 04:44:24,630 INFO [train.py:996] (1/4) Epoch 8, batch 20200, loss[loss=0.2204, simple_loss=0.3111, pruned_loss=0.06489, over 21639.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3274, pruned_loss=0.08844, over 4287498.67 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:44:46,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1402032.0, ans=0.0 2023-06-23 04:45:47,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1402212.0, ans=0.125 2023-06-23 04:45:50,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1402212.0, ans=0.1 2023-06-23 04:45:52,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1402212.0, ans=0.0 2023-06-23 04:45:58,127 INFO [train.py:996] (1/4) Epoch 8, batch 20250, loss[loss=0.2424, simple_loss=0.3087, pruned_loss=0.08809, over 21291.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.327, pruned_loss=0.08619, over 4285295.88 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:46:01,469 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 5.077e+02 7.177e+02 9.506e+02 2.179e+03, threshold=1.435e+03, percent-clipped=12.0 2023-06-23 04:47:19,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1402512.0, ans=0.09899494936611666 2023-06-23 04:47:25,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-23 04:47:26,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1402512.0, ans=0.125 2023-06-23 04:47:37,021 INFO [train.py:996] (1/4) Epoch 8, batch 20300, loss[loss=0.2113, simple_loss=0.2849, pruned_loss=0.06881, over 21340.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3251, pruned_loss=0.08395, over 4281105.88 frames. ], batch size: 131, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:47:40,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1402572.0, ans=0.125 2023-06-23 04:48:23,329 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.20 vs. limit=12.0 2023-06-23 04:48:50,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1402752.0, ans=15.0 2023-06-23 04:49:10,175 INFO [train.py:996] (1/4) Epoch 8, batch 20350, loss[loss=0.2789, simple_loss=0.3474, pruned_loss=0.1052, over 21873.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3247, pruned_loss=0.08408, over 4265128.00 frames. ], batch size: 118, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:49:12,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1402872.0, ans=0.125 2023-06-23 04:49:13,375 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.458e+02 4.961e+02 7.570e+02 1.006e+03 1.715e+03, threshold=1.514e+03, percent-clipped=7.0 2023-06-23 04:49:55,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1402992.0, ans=0.1 2023-06-23 04:50:38,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1403112.0, ans=0.2 2023-06-23 04:50:43,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1403172.0, ans=0.125 2023-06-23 04:50:44,177 INFO [train.py:996] (1/4) Epoch 8, batch 20400, loss[loss=0.2506, simple_loss=0.3276, pruned_loss=0.08684, over 21651.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3275, pruned_loss=0.08712, over 4266652.12 frames. ], batch size: 230, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:50:47,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1403172.0, ans=0.125 2023-06-23 04:51:19,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1403292.0, ans=0.125 2023-06-23 04:51:41,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-23 04:51:58,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1403352.0, ans=0.0 2023-06-23 04:52:08,905 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-23 04:52:17,050 INFO [train.py:996] (1/4) Epoch 8, batch 20450, loss[loss=0.2445, simple_loss=0.3077, pruned_loss=0.0906, over 21501.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3285, pruned_loss=0.0889, over 4248636.76 frames. ], batch size: 194, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:52:20,042 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.597e+02 5.065e+02 6.621e+02 9.433e+02 1.870e+03, threshold=1.324e+03, percent-clipped=2.0 2023-06-23 04:52:20,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1403472.0, ans=0.125 2023-06-23 04:52:31,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1403532.0, ans=0.0 2023-06-23 04:52:31,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1403532.0, ans=0.07 2023-06-23 04:52:40,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1403532.0, ans=0.2 2023-06-23 04:52:42,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-23 04:53:54,330 INFO [train.py:996] (1/4) Epoch 8, batch 20500, loss[loss=0.2221, simple_loss=0.2918, pruned_loss=0.07621, over 21462.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3235, pruned_loss=0.08893, over 4250885.72 frames. ], batch size: 211, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:54:43,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1403892.0, ans=0.0 2023-06-23 04:55:04,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1403952.0, ans=0.125 2023-06-23 04:55:17,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1404012.0, ans=0.125 2023-06-23 04:55:28,231 INFO [train.py:996] (1/4) Epoch 8, batch 20550, loss[loss=0.2414, simple_loss=0.3207, pruned_loss=0.0811, over 21855.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3166, pruned_loss=0.08738, over 4245431.92 frames. ], batch size: 372, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:55:31,311 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.304e+02 5.833e+02 8.675e+02 1.439e+03, threshold=1.167e+03, percent-clipped=3.0 2023-06-23 04:55:39,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1404072.0, ans=0.125 2023-06-23 04:55:47,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1404132.0, ans=0.125 2023-06-23 04:56:07,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1404192.0, ans=0.125 2023-06-23 04:56:52,467 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:57:07,618 INFO [train.py:996] (1/4) Epoch 8, batch 20600, loss[loss=0.2468, simple_loss=0.3231, pruned_loss=0.08525, over 21724.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3196, pruned_loss=0.08573, over 4247697.96 frames. ], batch size: 389, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:57:36,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1404432.0, ans=0.1 2023-06-23 04:58:05,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1404492.0, ans=0.2 2023-06-23 04:58:13,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1404552.0, ans=0.0 2023-06-23 04:58:21,418 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:58:46,053 INFO [train.py:996] (1/4) Epoch 8, batch 20650, loss[loss=0.2142, simple_loss=0.2696, pruned_loss=0.07939, over 21541.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3162, pruned_loss=0.08618, over 4261575.93 frames. ], batch size: 230, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:58:49,136 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.528e+02 4.605e+02 7.657e+02 1.188e+03 2.326e+03, threshold=1.531e+03, percent-clipped=25.0 2023-06-23 04:59:24,001 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-23 05:00:02,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1404852.0, ans=0.1 2023-06-23 05:00:10,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1404912.0, ans=0.125 2023-06-23 05:00:17,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1404912.0, ans=0.0 2023-06-23 05:00:26,477 INFO [train.py:996] (1/4) Epoch 8, batch 20700, loss[loss=0.1846, simple_loss=0.2682, pruned_loss=0.0505, over 21609.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3069, pruned_loss=0.08236, over 4258178.34 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 05:00:50,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-23 05:02:07,892 INFO [train.py:996] (1/4) Epoch 8, batch 20750, loss[loss=0.2723, simple_loss=0.3659, pruned_loss=0.08931, over 21734.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3104, pruned_loss=0.0816, over 4264762.40 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 05:02:11,643 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.166e+02 4.120e+02 5.727e+02 9.009e+02 2.135e+03, threshold=1.145e+03, percent-clipped=5.0 2023-06-23 05:02:33,804 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-23 05:02:38,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1405332.0, ans=15.0 2023-06-23 05:02:59,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1405392.0, ans=0.0 2023-06-23 05:03:12,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1405392.0, ans=0.125 2023-06-23 05:03:20,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1405452.0, ans=0.125 2023-06-23 05:03:22,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1405452.0, ans=0.0 2023-06-23 05:03:47,517 INFO [train.py:996] (1/4) Epoch 8, batch 20800, loss[loss=0.1959, simple_loss=0.2664, pruned_loss=0.06273, over 21598.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3122, pruned_loss=0.0826, over 4266609.24 frames. ], batch size: 298, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:03:58,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-23 05:04:09,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1405572.0, ans=0.1 2023-06-23 05:05:04,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1405812.0, ans=0.125 2023-06-23 05:05:19,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1405872.0, ans=0.0 2023-06-23 05:05:19,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1405872.0, ans=0.1 2023-06-23 05:05:20,450 INFO [train.py:996] (1/4) Epoch 8, batch 20850, loss[loss=0.1826, simple_loss=0.2527, pruned_loss=0.05623, over 21016.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3036, pruned_loss=0.07967, over 4243977.12 frames. ], batch size: 608, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:05:28,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.968e+02 4.772e+02 9.225e+02 1.220e+03 2.670e+03, threshold=1.845e+03, percent-clipped=33.0 2023-06-23 05:06:03,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1405932.0, ans=0.125 2023-06-23 05:06:24,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1405992.0, ans=0.0 2023-06-23 05:06:57,363 INFO [train.py:996] (1/4) Epoch 8, batch 20900, loss[loss=0.2351, simple_loss=0.3151, pruned_loss=0.07757, over 21818.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3034, pruned_loss=0.08081, over 4250590.18 frames. ], batch size: 333, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:07:00,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406172.0, ans=0.1 2023-06-23 05:07:02,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1406172.0, ans=0.0 2023-06-23 05:07:33,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406232.0, ans=0.1 2023-06-23 05:07:55,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1406292.0, ans=0.2 2023-06-23 05:08:04,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1406352.0, ans=0.125 2023-06-23 05:08:33,364 INFO [train.py:996] (1/4) Epoch 8, batch 20950, loss[loss=0.1996, simple_loss=0.2741, pruned_loss=0.06258, over 21840.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2994, pruned_loss=0.07727, over 4252809.33 frames. ], batch size: 118, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:08:36,659 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 4.522e+02 5.963e+02 9.256e+02 1.585e+03, threshold=1.193e+03, percent-clipped=0.0 2023-06-23 05:08:47,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-23 05:09:05,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1406532.0, ans=0.125 2023-06-23 05:09:57,967 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-23 05:10:11,185 INFO [train.py:996] (1/4) Epoch 8, batch 21000, loss[loss=0.2715, simple_loss=0.3389, pruned_loss=0.1021, over 21833.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3005, pruned_loss=0.07798, over 4256089.61 frames. ], batch size: 112, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:10:11,186 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 05:10:27,248 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2634, simple_loss=0.3611, pruned_loss=0.08288, over 1796401.00 frames. 2023-06-23 05:10:27,249 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 05:11:22,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1406892.0, ans=0.2 2023-06-23 05:11:26,729 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:11:31,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1406952.0, ans=0.0 2023-06-23 05:12:04,114 INFO [train.py:996] (1/4) Epoch 8, batch 21050, loss[loss=0.2498, simple_loss=0.3157, pruned_loss=0.09194, over 15895.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2996, pruned_loss=0.07873, over 4256463.59 frames. ], batch size: 63, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:12:04,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1407072.0, ans=0.0 2023-06-23 05:12:07,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.662e+02 4.992e+02 6.776e+02 1.028e+03 2.055e+03, threshold=1.355e+03, percent-clipped=16.0 2023-06-23 05:12:52,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1407132.0, ans=0.1 2023-06-23 05:13:00,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1407192.0, ans=0.125 2023-06-23 05:13:09,043 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-23 05:13:25,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1407312.0, ans=0.125 2023-06-23 05:13:26,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1407312.0, ans=0.1 2023-06-23 05:13:28,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1407312.0, ans=0.125 2023-06-23 05:13:31,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1407312.0, ans=0.0 2023-06-23 05:13:42,103 INFO [train.py:996] (1/4) Epoch 8, batch 21100, loss[loss=0.2102, simple_loss=0.2771, pruned_loss=0.07167, over 21714.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2965, pruned_loss=0.07824, over 4266409.34 frames. ], batch size: 316, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:14:37,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1407492.0, ans=0.1 2023-06-23 05:15:01,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1407612.0, ans=0.125 2023-06-23 05:15:15,056 INFO [train.py:996] (1/4) Epoch 8, batch 21150, loss[loss=0.2554, simple_loss=0.3687, pruned_loss=0.07099, over 19793.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2936, pruned_loss=0.07926, over 4273122.82 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:15:18,000 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.999e+02 4.608e+02 5.910e+02 9.200e+02 1.578e+03, threshold=1.182e+03, percent-clipped=4.0 2023-06-23 05:16:27,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1407852.0, ans=0.1 2023-06-23 05:16:28,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2023-06-23 05:16:39,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-23 05:16:53,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1407972.0, ans=0.2 2023-06-23 05:16:54,301 INFO [train.py:996] (1/4) Epoch 8, batch 21200, loss[loss=0.1997, simple_loss=0.265, pruned_loss=0.06725, over 21259.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.29, pruned_loss=0.07836, over 4253614.51 frames. ], batch size: 144, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:16:57,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1407972.0, ans=0.125 2023-06-23 05:17:10,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1407972.0, ans=0.2 2023-06-23 05:17:45,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1408092.0, ans=0.125 2023-06-23 05:17:48,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1408092.0, ans=0.125 2023-06-23 05:18:10,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1408152.0, ans=0.0 2023-06-23 05:18:32,004 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.20 vs. limit=15.0 2023-06-23 05:18:32,275 INFO [train.py:996] (1/4) Epoch 8, batch 21250, loss[loss=0.2375, simple_loss=0.3061, pruned_loss=0.08446, over 21530.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2883, pruned_loss=0.0781, over 4257511.77 frames. ], batch size: 195, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:18:41,952 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 4.434e+02 5.437e+02 7.242e+02 2.137e+03, threshold=1.087e+03, percent-clipped=7.0 2023-06-23 05:20:05,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1408512.0, ans=0.0 2023-06-23 05:20:10,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1408572.0, ans=0.125 2023-06-23 05:20:11,408 INFO [train.py:996] (1/4) Epoch 8, batch 21300, loss[loss=0.2506, simple_loss=0.3246, pruned_loss=0.08833, over 21518.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.296, pruned_loss=0.08092, over 4260805.84 frames. ], batch size: 131, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:20:28,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-23 05:20:41,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1408632.0, ans=0.2 2023-06-23 05:21:09,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.82 vs. limit=22.5 2023-06-23 05:21:20,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1408752.0, ans=0.2 2023-06-23 05:21:38,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1408812.0, ans=0.125 2023-06-23 05:21:54,430 INFO [train.py:996] (1/4) Epoch 8, batch 21350, loss[loss=0.2208, simple_loss=0.3102, pruned_loss=0.06569, over 21386.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.2997, pruned_loss=0.08088, over 4263270.40 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:22:10,268 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.040e+02 5.053e+02 6.684e+02 9.217e+02 2.330e+03, threshold=1.337e+03, percent-clipped=18.0 2023-06-23 05:22:53,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1408992.0, ans=0.125 2023-06-23 05:23:05,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1409052.0, ans=15.0 2023-06-23 05:23:34,657 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-23 05:23:38,548 INFO [train.py:996] (1/4) Epoch 8, batch 21400, loss[loss=0.2902, simple_loss=0.3558, pruned_loss=0.1123, over 21419.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3043, pruned_loss=0.08131, over 4268940.09 frames. ], batch size: 471, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:24:38,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1409352.0, ans=0.125 2023-06-23 05:24:51,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1409412.0, ans=0.125 2023-06-23 05:25:10,503 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:25:11,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-23 05:25:22,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-23 05:25:22,975 INFO [train.py:996] (1/4) Epoch 8, batch 21450, loss[loss=0.253, simple_loss=0.3148, pruned_loss=0.09562, over 21922.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3069, pruned_loss=0.08226, over 4276284.02 frames. ], batch size: 316, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:25:28,997 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.984e+02 4.393e+02 5.335e+02 6.741e+02 1.398e+03, threshold=1.067e+03, percent-clipped=1.0 2023-06-23 05:25:48,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1409532.0, ans=0.125 2023-06-23 05:25:49,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-23 05:26:01,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1409592.0, ans=0.2 2023-06-23 05:26:02,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1409592.0, ans=0.0 2023-06-23 05:26:10,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1409592.0, ans=0.125 2023-06-23 05:26:28,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1409652.0, ans=0.125 2023-06-23 05:27:01,238 INFO [train.py:996] (1/4) Epoch 8, batch 21500, loss[loss=0.2599, simple_loss=0.3209, pruned_loss=0.09949, over 21682.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3056, pruned_loss=0.08358, over 4276348.43 frames. ], batch size: 230, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:27:13,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409772.0, ans=0.1 2023-06-23 05:27:19,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1409772.0, ans=0.125 2023-06-23 05:27:37,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409892.0, ans=0.1 2023-06-23 05:27:55,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-23 05:28:27,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1410012.0, ans=0.0 2023-06-23 05:28:39,691 INFO [train.py:996] (1/4) Epoch 8, batch 21550, loss[loss=0.1945, simple_loss=0.2554, pruned_loss=0.06683, over 21328.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2987, pruned_loss=0.08056, over 4280565.18 frames. ], batch size: 144, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:28:46,250 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.014e+02 4.565e+02 6.143e+02 8.904e+02 1.889e+03, threshold=1.229e+03, percent-clipped=13.0 2023-06-23 05:29:11,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1410132.0, ans=0.0 2023-06-23 05:29:49,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1410252.0, ans=0.0 2023-06-23 05:29:55,396 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-23 05:30:26,687 INFO [train.py:996] (1/4) Epoch 8, batch 21600, loss[loss=0.23, simple_loss=0.2967, pruned_loss=0.08167, over 21390.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2946, pruned_loss=0.07888, over 4280972.60 frames. ], batch size: 131, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:31:05,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1410492.0, ans=0.2 2023-06-23 05:31:12,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1410492.0, ans=0.1 2023-06-23 05:31:41,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1410612.0, ans=0.125 2023-06-23 05:32:05,255 INFO [train.py:996] (1/4) Epoch 8, batch 21650, loss[loss=0.2238, simple_loss=0.3193, pruned_loss=0.06417, over 21585.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2991, pruned_loss=0.07677, over 4281780.70 frames. ], batch size: 230, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:32:06,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-23 05:32:10,913 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.286e+02 5.401e+02 7.635e+02 1.107e+03 2.032e+03, threshold=1.527e+03, percent-clipped=20.0 2023-06-23 05:32:36,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1410792.0, ans=0.04949747468305833 2023-06-23 05:33:03,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1410852.0, ans=0.1 2023-06-23 05:33:36,444 INFO [train.py:996] (1/4) Epoch 8, batch 21700, loss[loss=0.1998, simple_loss=0.2719, pruned_loss=0.06383, over 21727.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2991, pruned_loss=0.07503, over 4287305.02 frames. ], batch size: 112, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:34:24,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1411092.0, ans=0.07 2023-06-23 05:34:40,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1411152.0, ans=0.0 2023-06-23 05:35:02,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1411212.0, ans=0.2 2023-06-23 05:35:11,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1411212.0, ans=0.125 2023-06-23 05:35:15,293 INFO [train.py:996] (1/4) Epoch 8, batch 21750, loss[loss=0.21, simple_loss=0.2703, pruned_loss=0.07491, over 21517.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2951, pruned_loss=0.07497, over 4279381.08 frames. ], batch size: 442, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:35:24,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1411272.0, ans=0.2 2023-06-23 05:35:27,365 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.075e+02 4.568e+02 6.230e+02 8.144e+02 2.277e+03, threshold=1.246e+03, percent-clipped=1.0 2023-06-23 05:35:49,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1411332.0, ans=10.0 2023-06-23 05:35:54,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1411392.0, ans=0.125 2023-06-23 05:36:17,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1411452.0, ans=0.125 2023-06-23 05:36:38,926 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:36:48,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=1411512.0, ans=15.0 2023-06-23 05:37:01,115 INFO [train.py:996] (1/4) Epoch 8, batch 21800, loss[loss=0.231, simple_loss=0.3132, pruned_loss=0.07436, over 21610.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2944, pruned_loss=0.07696, over 4274047.61 frames. ], batch size: 247, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:37:42,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1411692.0, ans=0.2 2023-06-23 05:37:50,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1411752.0, ans=0.0 2023-06-23 05:37:50,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1411752.0, ans=0.0 2023-06-23 05:37:53,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1411752.0, ans=0.125 2023-06-23 05:37:53,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1411752.0, ans=0.0 2023-06-23 05:38:39,327 INFO [train.py:996] (1/4) Epoch 8, batch 21850, loss[loss=0.2535, simple_loss=0.3071, pruned_loss=0.09992, over 21300.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2995, pruned_loss=0.07728, over 4259334.97 frames. ], batch size: 143, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:38:47,509 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.413e+02 4.593e+02 6.628e+02 8.915e+02 2.617e+03, threshold=1.326e+03, percent-clipped=11.0 2023-06-23 05:38:52,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1411872.0, ans=0.125 2023-06-23 05:39:19,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1411992.0, ans=22.5 2023-06-23 05:39:20,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1411992.0, ans=0.0 2023-06-23 05:39:31,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1412052.0, ans=0.2 2023-06-23 05:39:48,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1412052.0, ans=0.125 2023-06-23 05:40:13,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1412112.0, ans=0.125 2023-06-23 05:40:20,487 INFO [train.py:996] (1/4) Epoch 8, batch 21900, loss[loss=0.2419, simple_loss=0.3357, pruned_loss=0.07401, over 19858.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3007, pruned_loss=0.07844, over 4267266.44 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:40:21,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1412172.0, ans=0.125 2023-06-23 05:40:27,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1412172.0, ans=0.95 2023-06-23 05:40:31,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1412172.0, ans=0.125 2023-06-23 05:40:38,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1412232.0, ans=0.125 2023-06-23 05:40:46,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1412232.0, ans=0.035 2023-06-23 05:40:51,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1412292.0, ans=0.2 2023-06-23 05:40:59,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1412292.0, ans=0.125 2023-06-23 05:41:27,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1412352.0, ans=22.5 2023-06-23 05:41:44,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412412.0, ans=0.1 2023-06-23 05:42:00,053 INFO [train.py:996] (1/4) Epoch 8, batch 21950, loss[loss=0.2334, simple_loss=0.3017, pruned_loss=0.08259, over 21398.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2958, pruned_loss=0.07808, over 4267619.49 frames. ], batch size: 473, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:42:02,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1412472.0, ans=0.09899494936611666 2023-06-23 05:42:07,950 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.901e+02 4.723e+02 6.314e+02 7.880e+02 1.650e+03, threshold=1.263e+03, percent-clipped=2.0 2023-06-23 05:42:43,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1412592.0, ans=0.125 2023-06-23 05:43:21,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1412652.0, ans=0.95 2023-06-23 05:43:40,037 INFO [train.py:996] (1/4) Epoch 8, batch 22000, loss[loss=0.2093, simple_loss=0.2667, pruned_loss=0.07594, over 21196.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2896, pruned_loss=0.0747, over 4251984.45 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:43:45,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-23 05:43:47,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1412772.0, ans=0.0 2023-06-23 05:43:48,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-23 05:43:54,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1412832.0, ans=0.125 2023-06-23 05:44:31,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1412892.0, ans=0.0 2023-06-23 05:45:21,173 INFO [train.py:996] (1/4) Epoch 8, batch 22050, loss[loss=0.2497, simple_loss=0.3244, pruned_loss=0.0875, over 21477.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2948, pruned_loss=0.07641, over 4252212.37 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:45:33,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.843e+02 7.365e+02 1.302e+03 3.775e+03, threshold=1.473e+03, percent-clipped=26.0 2023-06-23 05:45:41,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1413132.0, ans=0.125 2023-06-23 05:45:45,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-23 05:45:50,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.79 vs. limit=5.0 2023-06-23 05:46:25,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.07 vs. limit=12.0 2023-06-23 05:47:02,726 INFO [train.py:996] (1/4) Epoch 8, batch 22100, loss[loss=0.341, simple_loss=0.3998, pruned_loss=0.1411, over 21774.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3052, pruned_loss=0.08111, over 4247578.69 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:47:36,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1413492.0, ans=0.125 2023-06-23 05:47:44,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1413492.0, ans=0.125 2023-06-23 05:48:15,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1413552.0, ans=0.2 2023-06-23 05:48:41,589 INFO [train.py:996] (1/4) Epoch 8, batch 22150, loss[loss=0.2474, simple_loss=0.3159, pruned_loss=0.08947, over 21253.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3091, pruned_loss=0.08297, over 4258492.11 frames. ], batch size: 159, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:48:50,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1413672.0, ans=0.025 2023-06-23 05:48:52,635 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 4.829e+02 6.848e+02 1.021e+03 2.130e+03, threshold=1.370e+03, percent-clipped=6.0 2023-06-23 05:49:07,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1413732.0, ans=0.2 2023-06-23 05:49:38,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1413852.0, ans=0.125 2023-06-23 05:50:21,042 INFO [train.py:996] (1/4) Epoch 8, batch 22200, loss[loss=0.2383, simple_loss=0.3358, pruned_loss=0.07043, over 21807.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3116, pruned_loss=0.08428, over 4267015.15 frames. ], batch size: 298, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:50:37,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1414032.0, ans=0.0 2023-06-23 05:50:41,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1414032.0, ans=0.07 2023-06-23 05:51:13,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1414092.0, ans=0.1 2023-06-23 05:51:28,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1414152.0, ans=0.125 2023-06-23 05:51:35,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1414152.0, ans=0.125 2023-06-23 05:51:42,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1414152.0, ans=0.125 2023-06-23 05:51:42,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1414152.0, ans=0.05 2023-06-23 05:51:48,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.17 vs. limit=10.0 2023-06-23 05:52:01,001 INFO [train.py:996] (1/4) Epoch 8, batch 22250, loss[loss=0.23, simple_loss=0.2884, pruned_loss=0.08575, over 21254.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3185, pruned_loss=0.08634, over 4274173.36 frames. ], batch size: 608, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:52:12,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.556e+02 5.033e+02 6.376e+02 9.699e+02 1.847e+03, threshold=1.275e+03, percent-clipped=11.0 2023-06-23 05:52:58,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1414392.0, ans=22.5 2023-06-23 05:53:26,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1414512.0, ans=0.125 2023-06-23 05:53:35,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1414512.0, ans=0.125 2023-06-23 05:53:40,287 INFO [train.py:996] (1/4) Epoch 8, batch 22300, loss[loss=0.2495, simple_loss=0.3172, pruned_loss=0.09085, over 21741.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3207, pruned_loss=0.08836, over 4278983.56 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:53:54,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1414632.0, ans=0.07 2023-06-23 05:54:22,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-23 05:54:34,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1414692.0, ans=0.125 2023-06-23 05:54:34,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1414692.0, ans=0.04949747468305833 2023-06-23 05:54:36,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1414752.0, ans=0.1 2023-06-23 05:54:53,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1414752.0, ans=0.125 2023-06-23 05:54:53,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1414752.0, ans=0.2 2023-06-23 05:55:08,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.48 vs. limit=22.5 2023-06-23 05:55:13,978 INFO [train.py:996] (1/4) Epoch 8, batch 22350, loss[loss=0.2074, simple_loss=0.2729, pruned_loss=0.07094, over 21954.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.319, pruned_loss=0.08918, over 4288548.99 frames. ], batch size: 316, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:55:25,673 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.473e+02 4.765e+02 6.117e+02 7.891e+02 1.509e+03, threshold=1.223e+03, percent-clipped=2.0 2023-06-23 05:55:27,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1414872.0, ans=0.2 2023-06-23 05:56:13,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1414992.0, ans=0.2 2023-06-23 05:56:19,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1415052.0, ans=0.025 2023-06-23 05:56:27,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1415052.0, ans=0.125 2023-06-23 05:56:31,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1415112.0, ans=0.125 2023-06-23 05:56:38,610 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-23 05:56:48,479 INFO [train.py:996] (1/4) Epoch 8, batch 22400, loss[loss=0.1994, simple_loss=0.2696, pruned_loss=0.06462, over 21784.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3155, pruned_loss=0.08577, over 4278683.25 frames. ], batch size: 124, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 05:56:53,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1415172.0, ans=0.125 2023-06-23 05:57:45,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-23 05:58:12,149 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-23 05:58:26,832 INFO [train.py:996] (1/4) Epoch 8, batch 22450, loss[loss=0.2353, simple_loss=0.2827, pruned_loss=0.09399, over 21332.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3098, pruned_loss=0.08455, over 4274203.84 frames. ], batch size: 177, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 05:58:30,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1415472.0, ans=0.125 2023-06-23 05:58:35,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=15.0 2023-06-23 05:58:37,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 3.949e+02 5.140e+02 7.263e+02 1.360e+03, threshold=1.028e+03, percent-clipped=2.0 2023-06-23 05:59:39,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1415652.0, ans=0.125 2023-06-23 05:59:49,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1415712.0, ans=0.125 2023-06-23 05:59:53,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1415712.0, ans=0.125 2023-06-23 06:00:05,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1415772.0, ans=0.0 2023-06-23 06:00:06,813 INFO [train.py:996] (1/4) Epoch 8, batch 22500, loss[loss=0.2576, simple_loss=0.3496, pruned_loss=0.08287, over 21569.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3043, pruned_loss=0.08359, over 4269115.39 frames. ], batch size: 230, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:00:49,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-23 06:01:13,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1415952.0, ans=0.125 2023-06-23 06:01:47,310 INFO [train.py:996] (1/4) Epoch 8, batch 22550, loss[loss=0.2657, simple_loss=0.3329, pruned_loss=0.0993, over 21782.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3084, pruned_loss=0.0838, over 4272534.87 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:02:04,071 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.380e+02 5.264e+02 6.977e+02 1.047e+03 2.151e+03, threshold=1.395e+03, percent-clipped=25.0 2023-06-23 06:02:46,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1416192.0, ans=0.07 2023-06-23 06:02:53,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1416252.0, ans=0.125 2023-06-23 06:03:29,274 INFO [train.py:996] (1/4) Epoch 8, batch 22600, loss[loss=0.2987, simple_loss=0.3843, pruned_loss=0.1066, over 21633.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3123, pruned_loss=0.08458, over 4274656.52 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:03:41,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-23 06:04:03,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-23 06:05:05,357 INFO [train.py:996] (1/4) Epoch 8, batch 22650, loss[loss=0.2185, simple_loss=0.2825, pruned_loss=0.07726, over 21415.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3079, pruned_loss=0.08404, over 4277106.75 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:05:21,128 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.542e+02 6.131e+02 9.012e+02 1.354e+03 2.560e+03, threshold=1.802e+03, percent-clipped=24.0 2023-06-23 06:05:47,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-23 06:06:01,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1416852.0, ans=0.125 2023-06-23 06:06:10,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1416852.0, ans=0.1 2023-06-23 06:06:29,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.62 vs. limit=6.0 2023-06-23 06:06:37,812 INFO [train.py:996] (1/4) Epoch 8, batch 22700, loss[loss=0.245, simple_loss=0.2952, pruned_loss=0.09746, over 21261.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3007, pruned_loss=0.08261, over 4284155.14 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:08:16,131 INFO [train.py:996] (1/4) Epoch 8, batch 22750, loss[loss=0.1967, simple_loss=0.248, pruned_loss=0.07273, over 20733.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3031, pruned_loss=0.08466, over 4269221.69 frames. ], batch size: 609, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:08:31,928 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.804e+02 6.420e+02 9.928e+02 2.099e+03, threshold=1.284e+03, percent-clipped=4.0 2023-06-23 06:09:48,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1417512.0, ans=0.125 2023-06-23 06:09:54,223 INFO [train.py:996] (1/4) Epoch 8, batch 22800, loss[loss=0.2418, simple_loss=0.3146, pruned_loss=0.08445, over 21827.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3066, pruned_loss=0.08653, over 4278008.48 frames. ], batch size: 124, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:10:40,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1417692.0, ans=0.0 2023-06-23 06:11:32,455 INFO [train.py:996] (1/4) Epoch 8, batch 22850, loss[loss=0.2149, simple_loss=0.2772, pruned_loss=0.07635, over 21704.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3043, pruned_loss=0.08597, over 4269606.24 frames. ], batch size: 333, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:11:45,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1417872.0, ans=0.125 2023-06-23 06:11:49,371 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.448e+02 5.341e+02 7.317e+02 9.622e+02 1.873e+03, threshold=1.463e+03, percent-clipped=13.0 2023-06-23 06:12:04,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1417932.0, ans=0.1 2023-06-23 06:12:07,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1417932.0, ans=0.1 2023-06-23 06:12:17,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-23 06:12:40,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1418052.0, ans=0.125 2023-06-23 06:12:45,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1418112.0, ans=0.95 2023-06-23 06:13:06,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1418172.0, ans=0.125 2023-06-23 06:13:07,139 INFO [train.py:996] (1/4) Epoch 8, batch 22900, loss[loss=0.2519, simple_loss=0.3704, pruned_loss=0.06673, over 21169.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.307, pruned_loss=0.08539, over 4257246.76 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:13:27,343 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:13:48,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1418292.0, ans=0.125 2023-06-23 06:14:56,820 INFO [train.py:996] (1/4) Epoch 8, batch 22950, loss[loss=0.2395, simple_loss=0.3408, pruned_loss=0.06906, over 21568.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3209, pruned_loss=0.08341, over 4266826.06 frames. ], batch size: 230, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:15:10,128 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.029e+02 4.953e+02 7.269e+02 1.039e+03 2.026e+03, threshold=1.454e+03, percent-clipped=12.0 2023-06-23 06:15:16,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1418532.0, ans=0.125 2023-06-23 06:15:32,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1418592.0, ans=0.125 2023-06-23 06:16:03,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1418652.0, ans=0.2 2023-06-23 06:16:30,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1418712.0, ans=0.0 2023-06-23 06:16:36,804 INFO [train.py:996] (1/4) Epoch 8, batch 23000, loss[loss=0.2671, simple_loss=0.3301, pruned_loss=0.1021, over 21620.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3211, pruned_loss=0.08147, over 4267179.86 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:17:12,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1418892.0, ans=0.125 2023-06-23 06:17:14,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1418892.0, ans=0.0 2023-06-23 06:17:16,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1418892.0, ans=0.2 2023-06-23 06:17:47,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-23 06:17:50,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1419012.0, ans=0.125 2023-06-23 06:17:56,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419012.0, ans=0.1 2023-06-23 06:18:00,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1419012.0, ans=0.125 2023-06-23 06:18:07,221 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.88 vs. limit=6.0 2023-06-23 06:18:12,373 INFO [train.py:996] (1/4) Epoch 8, batch 23050, loss[loss=0.288, simple_loss=0.3519, pruned_loss=0.112, over 21479.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3213, pruned_loss=0.08378, over 4273576.83 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:18:25,326 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.192e+02 4.592e+02 5.368e+02 6.927e+02 1.540e+03, threshold=1.074e+03, percent-clipped=1.0 2023-06-23 06:18:25,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1419072.0, ans=0.125 2023-06-23 06:18:38,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1419132.0, ans=0.0 2023-06-23 06:18:41,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1419132.0, ans=0.0 2023-06-23 06:19:05,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1419252.0, ans=0.0 2023-06-23 06:19:11,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=12.0 2023-06-23 06:19:16,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1419252.0, ans=0.0 2023-06-23 06:19:47,039 INFO [train.py:996] (1/4) Epoch 8, batch 23100, loss[loss=0.2025, simple_loss=0.2592, pruned_loss=0.0729, over 20685.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3171, pruned_loss=0.08462, over 4273547.94 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:19:59,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1419372.0, ans=0.125 2023-06-23 06:20:32,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1419492.0, ans=0.0 2023-06-23 06:20:39,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-23 06:20:41,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1419552.0, ans=0.125 2023-06-23 06:21:21,810 INFO [train.py:996] (1/4) Epoch 8, batch 23150, loss[loss=0.2433, simple_loss=0.3058, pruned_loss=0.09043, over 21520.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3112, pruned_loss=0.08406, over 4274843.15 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:21:34,635 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.444e+02 4.721e+02 6.329e+02 9.421e+02 1.968e+03, threshold=1.266e+03, percent-clipped=20.0 2023-06-23 06:21:49,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1419732.0, ans=0.1 2023-06-23 06:22:31,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1419852.0, ans=0.1 2023-06-23 06:22:38,365 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-23 06:22:42,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1419912.0, ans=0.1 2023-06-23 06:22:59,195 INFO [train.py:996] (1/4) Epoch 8, batch 23200, loss[loss=0.2314, simple_loss=0.297, pruned_loss=0.08291, over 21366.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.31, pruned_loss=0.08523, over 4278138.12 frames. ], batch size: 176, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:23:12,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1419972.0, ans=0.125 2023-06-23 06:23:22,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1420032.0, ans=0.07 2023-06-23 06:23:30,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-23 06:23:58,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1420152.0, ans=0.1 2023-06-23 06:24:06,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1420152.0, ans=0.125 2023-06-23 06:24:15,782 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-23 06:24:16,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1420152.0, ans=0.125 2023-06-23 06:24:36,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1420272.0, ans=0.125 2023-06-23 06:24:37,825 INFO [train.py:996] (1/4) Epoch 8, batch 23250, loss[loss=0.2784, simple_loss=0.3347, pruned_loss=0.1111, over 21477.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3099, pruned_loss=0.08606, over 4284010.00 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:24:43,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1420272.0, ans=0.0 2023-06-23 06:24:49,271 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:24:50,332 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.546e+02 4.969e+02 6.559e+02 1.052e+03 2.390e+03, threshold=1.312e+03, percent-clipped=18.0 2023-06-23 06:24:56,601 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-06-23 06:25:00,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1420332.0, ans=0.1 2023-06-23 06:25:17,636 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-23 06:25:28,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1420392.0, ans=0.1 2023-06-23 06:25:40,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1420452.0, ans=0.05 2023-06-23 06:26:18,076 INFO [train.py:996] (1/4) Epoch 8, batch 23300, loss[loss=0.2477, simple_loss=0.3458, pruned_loss=0.07475, over 21428.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3155, pruned_loss=0.08642, over 4288646.96 frames. ], batch size: 194, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:27:03,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1420692.0, ans=0.125 2023-06-23 06:27:31,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1420752.0, ans=22.5 2023-06-23 06:27:37,831 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-23 06:27:49,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1420812.0, ans=0.09899494936611666 2023-06-23 06:27:58,462 INFO [train.py:996] (1/4) Epoch 8, batch 23350, loss[loss=0.1804, simple_loss=0.2529, pruned_loss=0.05393, over 21165.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3196, pruned_loss=0.08554, over 4284144.64 frames. ], batch size: 143, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:28:10,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1420872.0, ans=0.0 2023-06-23 06:28:18,040 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.315e+02 4.912e+02 6.155e+02 8.820e+02 1.771e+03, threshold=1.231e+03, percent-clipped=5.0 2023-06-23 06:28:20,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1420932.0, ans=0.0 2023-06-23 06:29:19,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1421112.0, ans=0.2 2023-06-23 06:29:37,123 INFO [train.py:996] (1/4) Epoch 8, batch 23400, loss[loss=0.2556, simple_loss=0.3218, pruned_loss=0.09473, over 21780.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3128, pruned_loss=0.08087, over 4267035.57 frames. ], batch size: 441, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:30:06,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1421232.0, ans=0.1 2023-06-23 06:30:07,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1421232.0, ans=0.0 2023-06-23 06:31:20,230 INFO [train.py:996] (1/4) Epoch 8, batch 23450, loss[loss=0.2645, simple_loss=0.3297, pruned_loss=0.09966, over 21747.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3141, pruned_loss=0.08415, over 4266945.70 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:31:38,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.252e+02 4.296e+02 5.237e+02 7.563e+02 1.579e+03, threshold=1.047e+03, percent-clipped=8.0 2023-06-23 06:32:10,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1421592.0, ans=0.0 2023-06-23 06:32:11,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.89 vs. limit=22.5 2023-06-23 06:32:32,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.00 vs. limit=10.0 2023-06-23 06:32:40,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1421712.0, ans=0.125 2023-06-23 06:32:58,364 INFO [train.py:996] (1/4) Epoch 8, batch 23500, loss[loss=0.22, simple_loss=0.2873, pruned_loss=0.07635, over 21421.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3144, pruned_loss=0.08607, over 4273107.43 frames. ], batch size: 211, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:33:45,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-23 06:34:35,806 INFO [train.py:996] (1/4) Epoch 8, batch 23550, loss[loss=0.231, simple_loss=0.2896, pruned_loss=0.08623, over 21671.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3104, pruned_loss=0.08588, over 4271700.83 frames. ], batch size: 264, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:34:54,235 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.322e+02 4.998e+02 7.038e+02 9.548e+02 2.153e+03, threshold=1.408e+03, percent-clipped=14.0 2023-06-23 06:35:30,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-23 06:36:18,156 INFO [train.py:996] (1/4) Epoch 8, batch 23600, loss[loss=0.2478, simple_loss=0.3208, pruned_loss=0.08735, over 21669.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3111, pruned_loss=0.08667, over 4273711.60 frames. ], batch size: 351, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:36:54,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1422432.0, ans=0.0 2023-06-23 06:37:39,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1422612.0, ans=0.0 2023-06-23 06:37:58,177 INFO [train.py:996] (1/4) Epoch 8, batch 23650, loss[loss=0.3069, simple_loss=0.3747, pruned_loss=0.1196, over 21450.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.312, pruned_loss=0.08522, over 4277882.81 frames. ], batch size: 471, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:38:22,832 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.610e+02 4.602e+02 5.917e+02 8.221e+02 1.589e+03, threshold=1.183e+03, percent-clipped=3.0 2023-06-23 06:39:17,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1422852.0, ans=0.125 2023-06-23 06:39:48,554 INFO [train.py:996] (1/4) Epoch 8, batch 23700, loss[loss=0.2121, simple_loss=0.2913, pruned_loss=0.06643, over 21926.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3126, pruned_loss=0.08403, over 4277242.99 frames. ], batch size: 317, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:39:57,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-23 06:40:19,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1423032.0, ans=0.125 2023-06-23 06:40:58,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1423152.0, ans=0.125 2023-06-23 06:41:14,748 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:41:28,738 INFO [train.py:996] (1/4) Epoch 8, batch 23750, loss[loss=0.2583, simple_loss=0.3271, pruned_loss=0.09481, over 21641.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3184, pruned_loss=0.08638, over 4276925.79 frames. ], batch size: 263, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:41:42,899 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.157e+02 4.173e+02 5.450e+02 7.281e+02 1.269e+03, threshold=1.090e+03, percent-clipped=1.0 2023-06-23 06:41:43,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1423332.0, ans=0.0 2023-06-23 06:42:53,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.90 vs. limit=15.0 2023-06-23 06:43:07,499 INFO [train.py:996] (1/4) Epoch 8, batch 23800, loss[loss=0.2602, simple_loss=0.3536, pruned_loss=0.08337, over 21775.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3174, pruned_loss=0.08372, over 4283101.66 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:43:14,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-23 06:43:17,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1423572.0, ans=0.125 2023-06-23 06:43:53,719 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-23 06:44:11,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1423692.0, ans=0.125 2023-06-23 06:44:14,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1423752.0, ans=0.95 2023-06-23 06:44:32,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1423812.0, ans=0.1 2023-06-23 06:44:47,958 INFO [train.py:996] (1/4) Epoch 8, batch 23850, loss[loss=0.249, simple_loss=0.3283, pruned_loss=0.08487, over 21660.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3214, pruned_loss=0.08405, over 4278722.35 frames. ], batch size: 230, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:45:07,739 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.341e+02 5.290e+02 6.961e+02 9.016e+02 2.497e+03, threshold=1.392e+03, percent-clipped=15.0 2023-06-23 06:46:33,157 INFO [train.py:996] (1/4) Epoch 8, batch 23900, loss[loss=0.28, simple_loss=0.362, pruned_loss=0.09903, over 20714.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3279, pruned_loss=0.0867, over 4275664.10 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:47:12,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1424292.0, ans=0.2 2023-06-23 06:47:26,273 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.48 vs. limit=22.5 2023-06-23 06:48:03,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1424412.0, ans=0.2 2023-06-23 06:48:06,331 INFO [train.py:996] (1/4) Epoch 8, batch 23950, loss[loss=0.2791, simple_loss=0.396, pruned_loss=0.0811, over 20804.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3231, pruned_loss=0.08679, over 4274745.16 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:48:10,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1424472.0, ans=0.125 2023-06-23 06:48:25,218 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.604e+02 5.747e+02 7.946e+02 1.092e+03 1.988e+03, threshold=1.589e+03, percent-clipped=11.0 2023-06-23 06:48:55,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1424592.0, ans=0.2 2023-06-23 06:49:02,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1424592.0, ans=0.0 2023-06-23 06:49:24,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-23 06:49:45,380 INFO [train.py:996] (1/4) Epoch 8, batch 24000, loss[loss=0.2792, simple_loss=0.3391, pruned_loss=0.1096, over 21462.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3246, pruned_loss=0.08968, over 4282766.70 frames. ], batch size: 211, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:49:45,380 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 06:50:04,122 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2639, simple_loss=0.3603, pruned_loss=0.08376, over 1796401.00 frames. 2023-06-23 06:50:04,123 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 06:50:51,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-23 06:51:43,402 INFO [train.py:996] (1/4) Epoch 8, batch 24050, loss[loss=0.2069, simple_loss=0.2836, pruned_loss=0.06508, over 21244.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3251, pruned_loss=0.08998, over 4283197.29 frames. ], batch size: 159, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:51:58,974 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:52:08,109 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.279e+02 4.739e+02 5.574e+02 8.138e+02 1.478e+03, threshold=1.115e+03, percent-clipped=0.0 2023-06-23 06:52:29,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1425192.0, ans=0.1 2023-06-23 06:53:05,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1425312.0, ans=0.125 2023-06-23 06:53:06,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1425312.0, ans=0.125 2023-06-23 06:53:28,956 INFO [train.py:996] (1/4) Epoch 8, batch 24100, loss[loss=0.3116, simple_loss=0.375, pruned_loss=0.1241, over 21485.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3236, pruned_loss=0.08739, over 4285209.65 frames. ], batch size: 471, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:53:40,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1425372.0, ans=0.1 2023-06-23 06:54:08,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1425492.0, ans=0.0 2023-06-23 06:54:27,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1425552.0, ans=0.0 2023-06-23 06:54:43,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1425612.0, ans=0.1 2023-06-23 06:55:07,257 INFO [train.py:996] (1/4) Epoch 8, batch 24150, loss[loss=0.2588, simple_loss=0.323, pruned_loss=0.09734, over 21746.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3236, pruned_loss=0.08885, over 4292161.39 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:55:22,330 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.625e+02 4.851e+02 6.515e+02 9.296e+02 1.728e+03, threshold=1.303e+03, percent-clipped=14.0 2023-06-23 06:55:35,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=22.5 2023-06-23 06:55:37,567 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:56:17,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1425852.0, ans=0.125 2023-06-23 06:56:43,548 INFO [train.py:996] (1/4) Epoch 8, batch 24200, loss[loss=0.246, simple_loss=0.3074, pruned_loss=0.09226, over 21142.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3253, pruned_loss=0.08989, over 4293330.30 frames. ], batch size: 143, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:56:58,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1426032.0, ans=0.1 2023-06-23 06:57:08,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1426032.0, ans=0.125 2023-06-23 06:57:49,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1426152.0, ans=0.0 2023-06-23 06:58:12,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1426212.0, ans=0.125 2023-06-23 06:58:14,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1426212.0, ans=0.025 2023-06-23 06:58:22,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1426212.0, ans=0.125 2023-06-23 06:58:25,358 INFO [train.py:996] (1/4) Epoch 8, batch 24250, loss[loss=0.171, simple_loss=0.271, pruned_loss=0.03547, over 21730.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3217, pruned_loss=0.08335, over 4285304.18 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:58:25,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1426272.0, ans=0.125 2023-06-23 06:58:28,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1426272.0, ans=0.0 2023-06-23 06:58:30,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1426272.0, ans=0.2 2023-06-23 06:58:44,561 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.260e+02 4.495e+02 7.277e+02 1.167e+03 2.451e+03, threshold=1.455e+03, percent-clipped=16.0 2023-06-23 06:59:46,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=15.0 2023-06-23 06:59:56,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-23 07:00:04,121 INFO [train.py:996] (1/4) Epoch 8, batch 24300, loss[loss=0.2198, simple_loss=0.2796, pruned_loss=0.07995, over 21847.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3153, pruned_loss=0.07725, over 4280981.42 frames. ], batch size: 118, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:00:08,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1426572.0, ans=0.0 2023-06-23 07:00:20,937 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.09 vs. limit=15.0 2023-06-23 07:00:27,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1426632.0, ans=0.125 2023-06-23 07:01:47,241 INFO [train.py:996] (1/4) Epoch 8, batch 24350, loss[loss=0.2173, simple_loss=0.2903, pruned_loss=0.07217, over 21768.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3101, pruned_loss=0.07662, over 4282443.50 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:01:59,233 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:02:03,696 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.048e+02 4.784e+02 6.670e+02 9.592e+02 1.817e+03, threshold=1.334e+03, percent-clipped=7.0 2023-06-23 07:02:06,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.72 vs. limit=10.0 2023-06-23 07:03:14,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1427112.0, ans=0.0 2023-06-23 07:03:27,455 INFO [train.py:996] (1/4) Epoch 8, batch 24400, loss[loss=0.3103, simple_loss=0.3798, pruned_loss=0.1204, over 21806.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3154, pruned_loss=0.08086, over 4280557.18 frames. ], batch size: 118, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:05:01,171 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:05:07,077 INFO [train.py:996] (1/4) Epoch 8, batch 24450, loss[loss=0.2311, simple_loss=0.3275, pruned_loss=0.06738, over 21721.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3162, pruned_loss=0.08205, over 4275120.08 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:05:07,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1427472.0, ans=0.125 2023-06-23 07:05:14,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1427472.0, ans=0.125 2023-06-23 07:05:23,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.696e+02 5.464e+02 7.459e+02 1.124e+03 2.090e+03, threshold=1.492e+03, percent-clipped=14.0 2023-06-23 07:05:31,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1427532.0, ans=0.125 2023-06-23 07:05:34,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1427532.0, ans=0.035 2023-06-23 07:05:39,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1427532.0, ans=0.0 2023-06-23 07:06:40,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1427712.0, ans=0.2 2023-06-23 07:06:44,617 INFO [train.py:996] (1/4) Epoch 8, batch 24500, loss[loss=0.2543, simple_loss=0.3237, pruned_loss=0.09241, over 21920.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.318, pruned_loss=0.08216, over 4282795.56 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:07:24,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1427892.0, ans=0.125 2023-06-23 07:08:05,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1427952.0, ans=0.125 2023-06-23 07:08:24,410 INFO [train.py:996] (1/4) Epoch 8, batch 24550, loss[loss=0.268, simple_loss=0.3361, pruned_loss=0.09998, over 21472.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3212, pruned_loss=0.08554, over 4286094.45 frames. ], batch size: 194, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:08:48,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1428132.0, ans=0.07 2023-06-23 07:08:50,842 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.336e+02 4.677e+02 6.091e+02 7.782e+02 1.609e+03, threshold=1.218e+03, percent-clipped=3.0 2023-06-23 07:08:54,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1428132.0, ans=0.0 2023-06-23 07:08:54,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-23 07:09:15,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1428192.0, ans=0.0 2023-06-23 07:09:30,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1428252.0, ans=0.125 2023-06-23 07:09:33,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1428252.0, ans=22.5 2023-06-23 07:09:40,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1428252.0, ans=0.125 2023-06-23 07:09:44,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-23 07:10:02,262 INFO [train.py:996] (1/4) Epoch 8, batch 24600, loss[loss=0.2932, simple_loss=0.3336, pruned_loss=0.1264, over 21239.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3179, pruned_loss=0.08631, over 4278093.56 frames. ], batch size: 471, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:10:18,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1428372.0, ans=0.09899494936611666 2023-06-23 07:10:25,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-23 07:10:55,494 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:11:05,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.90 vs. limit=6.0 2023-06-23 07:11:15,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1428552.0, ans=0.1 2023-06-23 07:11:22,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1428552.0, ans=0.0 2023-06-23 07:11:22,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1428552.0, ans=0.0 2023-06-23 07:11:25,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1428612.0, ans=6.0 2023-06-23 07:11:40,908 INFO [train.py:996] (1/4) Epoch 8, batch 24650, loss[loss=0.2194, simple_loss=0.2888, pruned_loss=0.07504, over 21885.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3115, pruned_loss=0.0852, over 4270654.93 frames. ], batch size: 98, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:12:13,321 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.435e+02 5.561e+02 8.132e+02 1.139e+03 1.963e+03, threshold=1.626e+03, percent-clipped=16.0 2023-06-23 07:13:19,838 INFO [train.py:996] (1/4) Epoch 8, batch 24700, loss[loss=0.2794, simple_loss=0.3538, pruned_loss=0.1025, over 21395.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3089, pruned_loss=0.08322, over 4274985.75 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:13:23,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1428972.0, ans=0.015 2023-06-23 07:13:45,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1429032.0, ans=0.025 2023-06-23 07:14:02,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1429092.0, ans=0.1 2023-06-23 07:14:33,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1429152.0, ans=0.125 2023-06-23 07:14:47,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1429212.0, ans=0.0 2023-06-23 07:14:52,867 INFO [train.py:996] (1/4) Epoch 8, batch 24750, loss[loss=0.1971, simple_loss=0.2586, pruned_loss=0.06785, over 21338.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3011, pruned_loss=0.07992, over 4273895.66 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:15:19,819 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.105e+02 4.833e+02 6.692e+02 9.106e+02 2.171e+03, threshold=1.338e+03, percent-clipped=2.0 2023-06-23 07:15:37,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1429392.0, ans=0.0 2023-06-23 07:16:24,301 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-23 07:16:31,278 INFO [train.py:996] (1/4) Epoch 8, batch 24800, loss[loss=0.221, simple_loss=0.2882, pruned_loss=0.0769, over 21859.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2966, pruned_loss=0.08004, over 4272890.55 frames. ], batch size: 333, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:17:04,161 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-23 07:17:25,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1429692.0, ans=0.0 2023-06-23 07:17:35,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.19 vs. limit=15.0 2023-06-23 07:18:04,071 INFO [train.py:996] (1/4) Epoch 8, batch 24850, loss[loss=0.2337, simple_loss=0.3132, pruned_loss=0.07712, over 21062.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.2987, pruned_loss=0.08198, over 4281388.71 frames. ], batch size: 608, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:18:28,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429932.0, ans=0.1 2023-06-23 07:18:33,271 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.302e+02 4.718e+02 6.141e+02 8.581e+02 1.389e+03, threshold=1.228e+03, percent-clipped=1.0 2023-06-23 07:18:59,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.54 vs. limit=15.0 2023-06-23 07:19:31,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1430112.0, ans=0.0 2023-06-23 07:19:49,224 INFO [train.py:996] (1/4) Epoch 8, batch 24900, loss[loss=0.2668, simple_loss=0.3338, pruned_loss=0.09995, over 21622.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3009, pruned_loss=0.08267, over 4277885.32 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:20:06,634 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-23 07:20:37,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1430292.0, ans=0.0 2023-06-23 07:20:38,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-23 07:21:34,318 INFO [train.py:996] (1/4) Epoch 8, batch 24950, loss[loss=0.2557, simple_loss=0.3292, pruned_loss=0.0911, over 21669.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3084, pruned_loss=0.08663, over 4280443.32 frames. ], batch size: 263, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:21:54,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1430472.0, ans=0.2 2023-06-23 07:22:03,173 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.572e+02 4.703e+02 5.868e+02 8.505e+02 2.192e+03, threshold=1.174e+03, percent-clipped=6.0 2023-06-23 07:22:11,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1430532.0, ans=0.125 2023-06-23 07:22:24,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1430592.0, ans=0.125 2023-06-23 07:22:27,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1430592.0, ans=0.125 2023-06-23 07:22:32,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1430652.0, ans=0.125 2023-06-23 07:22:49,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1430652.0, ans=0.1 2023-06-23 07:23:00,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1430712.0, ans=0.0 2023-06-23 07:23:21,910 INFO [train.py:996] (1/4) Epoch 8, batch 25000, loss[loss=0.2146, simple_loss=0.2803, pruned_loss=0.07442, over 21704.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3153, pruned_loss=0.08823, over 4288048.24 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:23:33,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1430772.0, ans=0.04949747468305833 2023-06-23 07:24:31,636 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-23 07:24:41,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1431012.0, ans=0.125 2023-06-23 07:24:46,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-23 07:24:52,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1431072.0, ans=0.1 2023-06-23 07:24:53,448 INFO [train.py:996] (1/4) Epoch 8, batch 25050, loss[loss=0.2253, simple_loss=0.285, pruned_loss=0.08277, over 21159.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3085, pruned_loss=0.08623, over 4274179.47 frames. ], batch size: 143, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:25:00,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1431072.0, ans=0.07 2023-06-23 07:25:17,637 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.487e+02 5.838e+02 7.912e+02 1.332e+03, threshold=1.168e+03, percent-clipped=3.0 2023-06-23 07:25:31,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=10.0 2023-06-23 07:26:04,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-23 07:26:10,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1431312.0, ans=0.125 2023-06-23 07:26:23,862 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-23 07:26:33,600 INFO [train.py:996] (1/4) Epoch 8, batch 25100, loss[loss=0.2418, simple_loss=0.2942, pruned_loss=0.09468, over 21794.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3022, pruned_loss=0.08472, over 4275332.98 frames. ], batch size: 102, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:26:53,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-23 07:27:24,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1431552.0, ans=0.125 2023-06-23 07:27:39,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1431552.0, ans=0.5 2023-06-23 07:27:54,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1431612.0, ans=0.125 2023-06-23 07:28:11,766 INFO [train.py:996] (1/4) Epoch 8, batch 25150, loss[loss=0.2325, simple_loss=0.3173, pruned_loss=0.07381, over 21391.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3066, pruned_loss=0.08262, over 4275113.01 frames. ], batch size: 194, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:28:23,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1431672.0, ans=0.125 2023-06-23 07:28:34,958 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.013e+02 4.444e+02 6.518e+02 1.039e+03 2.142e+03, threshold=1.304e+03, percent-clipped=17.0 2023-06-23 07:28:44,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1431732.0, ans=0.1 2023-06-23 07:28:46,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1431792.0, ans=0.125 2023-06-23 07:29:15,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-23 07:29:48,562 INFO [train.py:996] (1/4) Epoch 8, batch 25200, loss[loss=0.2194, simple_loss=0.3148, pruned_loss=0.06206, over 21836.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.306, pruned_loss=0.07975, over 4274764.28 frames. ], batch size: 316, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:30:32,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1432092.0, ans=0.0 2023-06-23 07:31:04,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1432212.0, ans=0.04949747468305833 2023-06-23 07:31:26,047 INFO [train.py:996] (1/4) Epoch 8, batch 25250, loss[loss=0.2189, simple_loss=0.2977, pruned_loss=0.07008, over 15691.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3036, pruned_loss=0.07782, over 4273713.98 frames. ], batch size: 60, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:31:26,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1432272.0, ans=0.0 2023-06-23 07:31:42,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1432272.0, ans=0.0 2023-06-23 07:31:49,881 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.012e+02 4.456e+02 5.347e+02 9.796e+02 2.256e+03, threshold=1.069e+03, percent-clipped=12.0 2023-06-23 07:32:01,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-23 07:32:07,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.05 vs. limit=22.5 2023-06-23 07:32:23,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1432452.0, ans=0.2 2023-06-23 07:32:49,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1432512.0, ans=0.125 2023-06-23 07:32:58,683 INFO [train.py:996] (1/4) Epoch 8, batch 25300, loss[loss=0.2147, simple_loss=0.2954, pruned_loss=0.06696, over 21733.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3017, pruned_loss=0.07793, over 4264819.69 frames. ], batch size: 298, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:33:26,441 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.0 2023-06-23 07:33:29,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-23 07:33:40,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1432692.0, ans=0.125 2023-06-23 07:33:46,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1432692.0, ans=0.125 2023-06-23 07:34:12,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-23 07:34:37,983 INFO [train.py:996] (1/4) Epoch 8, batch 25350, loss[loss=0.2687, simple_loss=0.358, pruned_loss=0.08971, over 21823.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3032, pruned_loss=0.0775, over 4251823.37 frames. ], batch size: 371, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:34:44,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1432872.0, ans=0.125 2023-06-23 07:35:02,947 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.301e+02 4.631e+02 6.587e+02 1.003e+03 1.652e+03, threshold=1.317e+03, percent-clipped=14.0 2023-06-23 07:35:17,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1432992.0, ans=0.0 2023-06-23 07:35:31,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1433052.0, ans=0.2 2023-06-23 07:35:42,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1433052.0, ans=0.125 2023-06-23 07:35:48,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1433112.0, ans=0.2 2023-06-23 07:36:14,598 INFO [train.py:996] (1/4) Epoch 8, batch 25400, loss[loss=0.2269, simple_loss=0.3029, pruned_loss=0.07543, over 19883.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2993, pruned_loss=0.07752, over 4253528.13 frames. ], batch size: 703, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:36:18,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1433172.0, ans=0.1 2023-06-23 07:36:42,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1433232.0, ans=0.125 2023-06-23 07:37:20,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1433352.0, ans=0.125 2023-06-23 07:37:51,436 INFO [train.py:996] (1/4) Epoch 8, batch 25450, loss[loss=0.2001, simple_loss=0.2517, pruned_loss=0.07432, over 17092.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2997, pruned_loss=0.07857, over 4248173.38 frames. ], batch size: 64, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:37:54,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-23 07:38:06,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1433472.0, ans=0.125 2023-06-23 07:38:08,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-23 07:38:12,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1433532.0, ans=0.2 2023-06-23 07:38:17,197 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.052e+02 4.125e+02 5.251e+02 6.939e+02 1.396e+03, threshold=1.050e+03, percent-clipped=1.0 2023-06-23 07:38:49,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1433652.0, ans=0.125 2023-06-23 07:39:27,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1433712.0, ans=0.0 2023-06-23 07:39:32,052 INFO [train.py:996] (1/4) Epoch 8, batch 25500, loss[loss=0.276, simple_loss=0.3634, pruned_loss=0.09432, over 21655.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3001, pruned_loss=0.07485, over 4250178.58 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 8.0 2023-06-23 07:39:59,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1433832.0, ans=0.0 2023-06-23 07:40:06,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1433832.0, ans=0.1 2023-06-23 07:40:28,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1433952.0, ans=0.1 2023-06-23 07:40:28,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1433952.0, ans=0.07 2023-06-23 07:40:53,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1434012.0, ans=0.125 2023-06-23 07:41:11,163 INFO [train.py:996] (1/4) Epoch 8, batch 25550, loss[loss=0.2143, simple_loss=0.3209, pruned_loss=0.05383, over 21858.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3079, pruned_loss=0.07606, over 4256500.23 frames. ], batch size: 371, lr: 3.65e-03, grad_scale: 8.0 2023-06-23 07:41:38,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.786e+02 4.210e+02 5.304e+02 7.417e+02 2.336e+03, threshold=1.061e+03, percent-clipped=9.0 2023-06-23 07:42:55,569 INFO [train.py:996] (1/4) Epoch 8, batch 25600, loss[loss=0.3025, simple_loss=0.365, pruned_loss=0.12, over 21785.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3124, pruned_loss=0.07694, over 4256964.17 frames. ], batch size: 118, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:43:05,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1434372.0, ans=0.0 2023-06-23 07:43:23,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1434432.0, ans=0.125 2023-06-23 07:44:11,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1434612.0, ans=0.0 2023-06-23 07:44:15,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1434612.0, ans=0.1 2023-06-23 07:44:33,965 INFO [train.py:996] (1/4) Epoch 8, batch 25650, loss[loss=0.2467, simple_loss=0.3063, pruned_loss=0.09357, over 21292.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3133, pruned_loss=0.07996, over 4260211.98 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:44:52,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-23 07:44:55,733 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.690e+02 5.926e+02 8.067e+02 1.090e+03 2.033e+03, threshold=1.613e+03, percent-clipped=28.0 2023-06-23 07:44:58,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-23 07:45:46,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1434852.0, ans=0.04949747468305833 2023-06-23 07:46:07,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1434912.0, ans=10.0 2023-06-23 07:46:11,857 INFO [train.py:996] (1/4) Epoch 8, batch 25700, loss[loss=0.2732, simple_loss=0.341, pruned_loss=0.1027, over 21857.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3109, pruned_loss=0.08164, over 4253635.26 frames. ], batch size: 124, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:46:23,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1434972.0, ans=0.125 2023-06-23 07:46:34,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1435032.0, ans=0.0 2023-06-23 07:46:34,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1435032.0, ans=0.2 2023-06-23 07:47:52,548 INFO [train.py:996] (1/4) Epoch 8, batch 25750, loss[loss=0.3118, simple_loss=0.3801, pruned_loss=0.1218, over 21580.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3154, pruned_loss=0.08405, over 4257618.95 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:47:58,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1435272.0, ans=0.0 2023-06-23 07:48:25,288 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.333e+02 5.092e+02 6.488e+02 8.589e+02 2.442e+03, threshold=1.298e+03, percent-clipped=2.0 2023-06-23 07:48:25,729 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:49:03,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1435452.0, ans=0.2 2023-06-23 07:49:38,480 INFO [train.py:996] (1/4) Epoch 8, batch 25800, loss[loss=0.263, simple_loss=0.3355, pruned_loss=0.0952, over 21731.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3286, pruned_loss=0.08876, over 4254080.69 frames. ], batch size: 332, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:50:04,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.13 vs. limit=10.0 2023-06-23 07:50:33,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1435692.0, ans=0.125 2023-06-23 07:50:59,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-23 07:51:08,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1435812.0, ans=0.0 2023-06-23 07:51:10,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1435812.0, ans=0.05 2023-06-23 07:51:22,180 INFO [train.py:996] (1/4) Epoch 8, batch 25850, loss[loss=0.2866, simple_loss=0.3388, pruned_loss=0.1172, over 21403.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3297, pruned_loss=0.08817, over 4265483.23 frames. ], batch size: 144, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:51:45,495 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.179e+02 4.988e+02 6.409e+02 1.000e+03 3.081e+03, threshold=1.282e+03, percent-clipped=14.0 2023-06-23 07:51:55,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1435932.0, ans=0.0 2023-06-23 07:52:09,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1435992.0, ans=0.09899494936611666 2023-06-23 07:52:11,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1435992.0, ans=0.0 2023-06-23 07:52:17,952 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:52:54,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1436112.0, ans=0.1 2023-06-23 07:52:57,082 INFO [train.py:996] (1/4) Epoch 8, batch 25900, loss[loss=0.2301, simple_loss=0.3254, pruned_loss=0.06737, over 19958.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3309, pruned_loss=0.0889, over 4273959.31 frames. ], batch size: 702, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:53:06,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-23 07:53:31,714 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-23 07:53:40,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1436292.0, ans=0.125 2023-06-23 07:54:24,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1436412.0, ans=0.125 2023-06-23 07:54:36,050 INFO [train.py:996] (1/4) Epoch 8, batch 25950, loss[loss=0.3076, simple_loss=0.3665, pruned_loss=0.1244, over 21368.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3374, pruned_loss=0.09181, over 4276785.79 frames. ], batch size: 507, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 07:55:03,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.640e+02 4.822e+02 6.504e+02 9.167e+02 2.432e+03, threshold=1.301e+03, percent-clipped=14.0 2023-06-23 07:55:59,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1436712.0, ans=0.0 2023-06-23 07:56:12,147 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:56:14,703 INFO [train.py:996] (1/4) Epoch 8, batch 26000, loss[loss=0.2356, simple_loss=0.3143, pruned_loss=0.07845, over 21350.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3371, pruned_loss=0.09034, over 4280740.97 frames. ], batch size: 159, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 07:56:40,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1436832.0, ans=0.125 2023-06-23 07:57:36,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1437012.0, ans=0.125 2023-06-23 07:57:49,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=15.0 2023-06-23 07:57:57,835 INFO [train.py:996] (1/4) Epoch 8, batch 26050, loss[loss=0.2403, simple_loss=0.3042, pruned_loss=0.08823, over 21493.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3359, pruned_loss=0.0909, over 4276412.31 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 07:57:58,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-23 07:58:13,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1437132.0, ans=0.0 2023-06-23 07:58:19,780 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 4.589e+02 6.004e+02 7.871e+02 1.709e+03, threshold=1.201e+03, percent-clipped=5.0 2023-06-23 07:59:06,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1437252.0, ans=0.0 2023-06-23 07:59:36,462 INFO [train.py:996] (1/4) Epoch 8, batch 26100, loss[loss=0.224, simple_loss=0.2917, pruned_loss=0.07819, over 21950.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3302, pruned_loss=0.08974, over 4275040.93 frames. ], batch size: 316, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 07:59:49,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1437372.0, ans=0.125 2023-06-23 08:00:42,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1437552.0, ans=0.125 2023-06-23 08:00:50,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1437552.0, ans=0.0 2023-06-23 08:00:51,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1437552.0, ans=0.125 2023-06-23 08:01:09,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1437612.0, ans=22.5 2023-06-23 08:01:16,867 INFO [train.py:996] (1/4) Epoch 8, batch 26150, loss[loss=0.2229, simple_loss=0.2973, pruned_loss=0.07423, over 21734.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3264, pruned_loss=0.08895, over 4271553.35 frames. ], batch size: 332, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:01:19,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-23 08:01:28,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437672.0, ans=0.1 2023-06-23 08:01:30,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1437672.0, ans=0.125 2023-06-23 08:01:45,503 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 4.992e+02 6.219e+02 9.688e+02 1.983e+03, threshold=1.244e+03, percent-clipped=15.0 2023-06-23 08:01:58,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1437792.0, ans=0.125 2023-06-23 08:02:25,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1437852.0, ans=0.0 2023-06-23 08:02:32,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1437852.0, ans=0.125 2023-06-23 08:02:44,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1437912.0, ans=0.0 2023-06-23 08:02:55,528 INFO [train.py:996] (1/4) Epoch 8, batch 26200, loss[loss=0.2654, simple_loss=0.3755, pruned_loss=0.07766, over 20844.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3267, pruned_loss=0.08729, over 4269472.12 frames. ], batch size: 608, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:03:07,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1437972.0, ans=0.125 2023-06-23 08:03:36,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-23 08:04:34,819 INFO [train.py:996] (1/4) Epoch 8, batch 26250, loss[loss=0.2783, simple_loss=0.3533, pruned_loss=0.1016, over 21877.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3298, pruned_loss=0.08613, over 4269710.19 frames. ], batch size: 107, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:04:35,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1438272.0, ans=0.125 2023-06-23 08:05:07,708 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 4.875e+02 6.519e+02 1.074e+03 2.423e+03, threshold=1.304e+03, percent-clipped=19.0 2023-06-23 08:05:22,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1438392.0, ans=0.125 2023-06-23 08:05:41,168 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:05:47,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1438452.0, ans=0.125 2023-06-23 08:06:12,270 INFO [train.py:996] (1/4) Epoch 8, batch 26300, loss[loss=0.2704, simple_loss=0.3306, pruned_loss=0.1051, over 21461.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.327, pruned_loss=0.08677, over 4272694.99 frames. ], batch size: 144, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:06:17,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1438572.0, ans=0.125 2023-06-23 08:07:26,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1438752.0, ans=0.125 2023-06-23 08:08:01,857 INFO [train.py:996] (1/4) Epoch 8, batch 26350, loss[loss=0.2782, simple_loss=0.3583, pruned_loss=0.09904, over 21783.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3253, pruned_loss=0.08718, over 4273126.60 frames. ], batch size: 124, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:08:30,359 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.621e+02 4.985e+02 6.232e+02 7.669e+02 1.189e+03, threshold=1.246e+03, percent-clipped=0.0 2023-06-23 08:08:59,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1439052.0, ans=0.0 2023-06-23 08:09:20,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1439112.0, ans=0.125 2023-06-23 08:09:40,192 INFO [train.py:996] (1/4) Epoch 8, batch 26400, loss[loss=0.2118, simple_loss=0.2768, pruned_loss=0.07342, over 21133.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3195, pruned_loss=0.08767, over 4270360.63 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:10:25,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1439292.0, ans=0.125 2023-06-23 08:11:02,082 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-23 08:11:08,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.38 vs. limit=22.5 2023-06-23 08:11:20,569 INFO [train.py:996] (1/4) Epoch 8, batch 26450, loss[loss=0.2988, simple_loss=0.3916, pruned_loss=0.103, over 21772.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3196, pruned_loss=0.08773, over 4267664.67 frames. ], batch size: 351, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:11:28,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1439472.0, ans=0.125 2023-06-23 08:11:50,690 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 6.327e+02 8.779e+02 1.313e+03 2.472e+03, threshold=1.756e+03, percent-clipped=25.0 2023-06-23 08:11:56,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1439592.0, ans=0.125 2023-06-23 08:12:07,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1439592.0, ans=0.125 2023-06-23 08:12:30,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1439652.0, ans=0.125 2023-06-23 08:13:01,283 INFO [train.py:996] (1/4) Epoch 8, batch 26500, loss[loss=0.2656, simple_loss=0.3527, pruned_loss=0.08927, over 21687.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.323, pruned_loss=0.086, over 4262601.31 frames. ], batch size: 414, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:13:36,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.64 vs. limit=10.0 2023-06-23 08:13:53,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-23 08:14:25,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1440012.0, ans=10.0 2023-06-23 08:14:39,361 INFO [train.py:996] (1/4) Epoch 8, batch 26550, loss[loss=0.2018, simple_loss=0.2744, pruned_loss=0.06461, over 21537.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3194, pruned_loss=0.08328, over 4260722.56 frames. ], batch size: 195, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:15:15,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1440132.0, ans=0.1 2023-06-23 08:15:20,125 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.436e+02 5.263e+02 8.028e+02 1.102e+03 2.204e+03, threshold=1.606e+03, percent-clipped=5.0 2023-06-23 08:15:41,536 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.55 vs. limit=10.0 2023-06-23 08:15:47,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1440252.0, ans=10.0 2023-06-23 08:16:14,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1440312.0, ans=0.125 2023-06-23 08:16:23,298 INFO [train.py:996] (1/4) Epoch 8, batch 26600, loss[loss=0.2009, simple_loss=0.2724, pruned_loss=0.06465, over 21590.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3176, pruned_loss=0.08038, over 4249689.27 frames. ], batch size: 247, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:16:28,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1440372.0, ans=0.125 2023-06-23 08:16:32,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-23 08:17:45,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-23 08:17:47,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.33 vs. limit=10.0 2023-06-23 08:18:02,016 INFO [train.py:996] (1/4) Epoch 8, batch 26650, loss[loss=0.1657, simple_loss=0.2453, pruned_loss=0.04305, over 21417.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3109, pruned_loss=0.0792, over 4259124.76 frames. ], batch size: 194, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:18:02,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1440672.0, ans=0.1 2023-06-23 08:18:36,399 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 4.294e+02 5.616e+02 7.721e+02 1.631e+03, threshold=1.123e+03, percent-clipped=1.0 2023-06-23 08:18:54,876 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-23 08:19:24,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1440912.0, ans=0.2 2023-06-23 08:19:39,937 INFO [train.py:996] (1/4) Epoch 8, batch 26700, loss[loss=0.2608, simple_loss=0.3309, pruned_loss=0.09538, over 21753.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3029, pruned_loss=0.0759, over 4262002.60 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:20:21,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-23 08:20:54,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1441152.0, ans=15.0 2023-06-23 08:20:55,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1441152.0, ans=15.0 2023-06-23 08:21:25,379 INFO [train.py:996] (1/4) Epoch 8, batch 26750, loss[loss=0.2793, simple_loss=0.3677, pruned_loss=0.09547, over 21471.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3038, pruned_loss=0.075, over 4261481.99 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:21:29,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1441272.0, ans=0.125 2023-06-23 08:21:34,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1441272.0, ans=0.1 2023-06-23 08:21:56,101 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.690e+02 4.314e+02 5.876e+02 8.992e+02 1.662e+03, threshold=1.175e+03, percent-clipped=13.0 2023-06-23 08:22:01,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1441392.0, ans=0.0 2023-06-23 08:22:40,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1441452.0, ans=0.0 2023-06-23 08:23:11,023 INFO [train.py:996] (1/4) Epoch 8, batch 26800, loss[loss=0.2405, simple_loss=0.3141, pruned_loss=0.08341, over 20679.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3102, pruned_loss=0.07835, over 4263206.68 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:23:30,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1441632.0, ans=0.125 2023-06-23 08:24:01,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1441692.0, ans=0.04949747468305833 2023-06-23 08:24:16,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1441752.0, ans=0.125 2023-06-23 08:24:42,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1441812.0, ans=0.2 2023-06-23 08:24:49,805 INFO [train.py:996] (1/4) Epoch 8, batch 26850, loss[loss=0.2172, simple_loss=0.2987, pruned_loss=0.06781, over 20108.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3111, pruned_loss=0.08111, over 4265150.18 frames. ], batch size: 703, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:25:13,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1441932.0, ans=10.0 2023-06-23 08:25:14,923 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.645e+02 5.050e+02 6.196e+02 9.210e+02 1.737e+03, threshold=1.239e+03, percent-clipped=8.0 2023-06-23 08:25:21,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1441992.0, ans=0.0 2023-06-23 08:25:34,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1441992.0, ans=0.0 2023-06-23 08:25:45,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1442052.0, ans=0.125 2023-06-23 08:25:48,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1442052.0, ans=0.0 2023-06-23 08:26:00,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1442052.0, ans=0.125 2023-06-23 08:26:15,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1442112.0, ans=0.125 2023-06-23 08:26:18,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1442112.0, ans=0.1 2023-06-23 08:26:22,984 INFO [train.py:996] (1/4) Epoch 8, batch 26900, loss[loss=0.2403, simple_loss=0.2779, pruned_loss=0.1014, over 21553.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3039, pruned_loss=0.08081, over 4258819.71 frames. ], batch size: 512, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:26:56,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1442232.0, ans=0.0 2023-06-23 08:27:48,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1442412.0, ans=0.0 2023-06-23 08:28:02,569 INFO [train.py:996] (1/4) Epoch 8, batch 26950, loss[loss=0.2594, simple_loss=0.3543, pruned_loss=0.08226, over 21200.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3032, pruned_loss=0.08106, over 4258088.37 frames. ], batch size: 548, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:28:13,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1442472.0, ans=10.0 2023-06-23 08:28:33,468 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.377e+02 4.817e+02 6.890e+02 1.132e+03 2.322e+03, threshold=1.378e+03, percent-clipped=18.0 2023-06-23 08:29:36,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.99 vs. limit=12.0 2023-06-23 08:29:46,693 INFO [train.py:996] (1/4) Epoch 8, batch 27000, loss[loss=0.2185, simple_loss=0.3132, pruned_loss=0.06187, over 21702.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.304, pruned_loss=0.0796, over 4260800.52 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:29:46,693 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 08:30:02,861 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.2419, simple_loss=0.3397, pruned_loss=0.07206, over 1796401.00 frames. 2023-06-23 08:30:02,862 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 08:31:42,047 INFO [train.py:996] (1/4) Epoch 8, batch 27050, loss[loss=0.229, simple_loss=0.3146, pruned_loss=0.07173, over 21382.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3071, pruned_loss=0.07665, over 4264510.79 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:31:44,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1443072.0, ans=0.1 2023-06-23 08:32:18,721 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.931e+02 4.261e+02 5.762e+02 7.370e+02 1.710e+03, threshold=1.152e+03, percent-clipped=3.0 2023-06-23 08:33:00,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1443252.0, ans=0.0 2023-06-23 08:33:15,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.10 vs. limit=8.0 2023-06-23 08:33:20,881 INFO [train.py:996] (1/4) Epoch 8, batch 27100, loss[loss=0.2643, simple_loss=0.3548, pruned_loss=0.08686, over 21838.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3104, pruned_loss=0.07721, over 4264404.22 frames. ], batch size: 371, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:33:29,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1443372.0, ans=0.125 2023-06-23 08:34:10,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1443492.0, ans=0.125 2023-06-23 08:34:16,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1443492.0, ans=0.2 2023-06-23 08:34:25,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1443552.0, ans=0.125 2023-06-23 08:35:01,728 INFO [train.py:996] (1/4) Epoch 8, batch 27150, loss[loss=0.2443, simple_loss=0.3296, pruned_loss=0.07945, over 21392.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3201, pruned_loss=0.08044, over 4273861.12 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:35:42,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1443732.0, ans=0.0 2023-06-23 08:35:43,350 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.713e+02 7.787e+02 1.225e+03 2.393e+03, threshold=1.557e+03, percent-clipped=28.0 2023-06-23 08:36:26,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1443912.0, ans=0.125 2023-06-23 08:36:41,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1443912.0, ans=0.1 2023-06-23 08:36:41,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-23 08:36:46,662 INFO [train.py:996] (1/4) Epoch 8, batch 27200, loss[loss=0.3186, simple_loss=0.4257, pruned_loss=0.1057, over 20718.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3292, pruned_loss=0.08354, over 4275002.18 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:37:16,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-23 08:37:32,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1444092.0, ans=0.0 2023-06-23 08:37:58,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1444152.0, ans=0.0 2023-06-23 08:38:21,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.44 vs. limit=6.0 2023-06-23 08:38:36,535 INFO [train.py:996] (1/4) Epoch 8, batch 27250, loss[loss=0.2694, simple_loss=0.3362, pruned_loss=0.1013, over 21940.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3319, pruned_loss=0.08756, over 4277876.45 frames. ], batch size: 372, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:39:09,374 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.379e+02 5.510e+02 6.974e+02 9.879e+02 1.721e+03, threshold=1.395e+03, percent-clipped=1.0 2023-06-23 08:40:17,675 INFO [train.py:996] (1/4) Epoch 8, batch 27300, loss[loss=0.2953, simple_loss=0.3722, pruned_loss=0.1092, over 21486.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3334, pruned_loss=0.08869, over 4273777.09 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:40:22,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1444572.0, ans=0.125 2023-06-23 08:41:57,261 INFO [train.py:996] (1/4) Epoch 8, batch 27350, loss[loss=0.257, simple_loss=0.3268, pruned_loss=0.0936, over 21794.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3354, pruned_loss=0.08939, over 4277667.72 frames. ], batch size: 247, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:42:28,133 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.496e+02 4.684e+02 5.886e+02 7.664e+02 1.698e+03, threshold=1.177e+03, percent-clipped=3.0 2023-06-23 08:43:14,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1445052.0, ans=0.0 2023-06-23 08:43:27,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1445112.0, ans=0.04949747468305833 2023-06-23 08:43:30,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1445112.0, ans=0.1 2023-06-23 08:43:34,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1445172.0, ans=0.04949747468305833 2023-06-23 08:43:40,296 INFO [train.py:996] (1/4) Epoch 8, batch 27400, loss[loss=0.2711, simple_loss=0.3177, pruned_loss=0.1122, over 21458.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3309, pruned_loss=0.08952, over 4282639.97 frames. ], batch size: 508, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:43:52,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1445172.0, ans=0.07 2023-06-23 08:44:26,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1445292.0, ans=0.0 2023-06-23 08:44:40,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1445292.0, ans=0.0 2023-06-23 08:44:48,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1445352.0, ans=0.2 2023-06-23 08:45:19,327 INFO [train.py:996] (1/4) Epoch 8, batch 27450, loss[loss=0.2403, simple_loss=0.3262, pruned_loss=0.07718, over 21417.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3242, pruned_loss=0.08758, over 4290728.41 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:45:50,316 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.374e+02 5.145e+02 6.858e+02 8.934e+02 1.227e+03, threshold=1.372e+03, percent-clipped=2.0 2023-06-23 08:46:01,210 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:46:55,706 INFO [train.py:996] (1/4) Epoch 8, batch 27500, loss[loss=0.2308, simple_loss=0.3066, pruned_loss=0.07757, over 21946.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3221, pruned_loss=0.08723, over 4295917.74 frames. ], batch size: 316, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:47:02,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1445772.0, ans=0.0 2023-06-23 08:47:08,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1445772.0, ans=0.07 2023-06-23 08:47:15,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1445832.0, ans=0.125 2023-06-23 08:48:34,525 INFO [train.py:996] (1/4) Epoch 8, batch 27550, loss[loss=0.2275, simple_loss=0.2921, pruned_loss=0.08148, over 21519.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3171, pruned_loss=0.08441, over 4286969.27 frames. ], batch size: 441, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:48:39,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1446072.0, ans=0.0 2023-06-23 08:48:49,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-23 08:49:06,592 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 4.176e+02 5.018e+02 7.145e+02 2.103e+03, threshold=1.004e+03, percent-clipped=5.0 2023-06-23 08:49:07,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1446132.0, ans=0.125 2023-06-23 08:49:35,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1446252.0, ans=0.125 2023-06-23 08:49:39,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1446252.0, ans=0.125 2023-06-23 08:49:51,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1446312.0, ans=0.125 2023-06-23 08:50:02,110 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:50:05,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1446312.0, ans=0.2 2023-06-23 08:50:07,966 INFO [train.py:996] (1/4) Epoch 8, batch 27600, loss[loss=0.2306, simple_loss=0.2918, pruned_loss=0.0847, over 21547.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3128, pruned_loss=0.08377, over 4281074.25 frames. ], batch size: 391, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:50:13,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1446372.0, ans=0.0 2023-06-23 08:50:38,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1446432.0, ans=0.125 2023-06-23 08:50:53,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-23 08:51:14,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1446552.0, ans=0.0 2023-06-23 08:51:28,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-23 08:51:45,622 INFO [train.py:996] (1/4) Epoch 8, batch 27650, loss[loss=0.2298, simple_loss=0.2927, pruned_loss=0.08342, over 21921.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3072, pruned_loss=0.0837, over 4270326.10 frames. ], batch size: 107, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:52:19,290 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.399e+02 4.844e+02 6.403e+02 8.598e+02 1.573e+03, threshold=1.281e+03, percent-clipped=18.0 2023-06-23 08:53:20,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1446972.0, ans=0.2 2023-06-23 08:53:21,713 INFO [train.py:996] (1/4) Epoch 8, batch 27700, loss[loss=0.2472, simple_loss=0.3346, pruned_loss=0.07988, over 21713.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3083, pruned_loss=0.08225, over 4272328.46 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:53:27,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-23 08:54:54,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1447212.0, ans=0.05 2023-06-23 08:55:00,523 INFO [train.py:996] (1/4) Epoch 8, batch 27750, loss[loss=0.2053, simple_loss=0.2664, pruned_loss=0.0721, over 20224.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3104, pruned_loss=0.08132, over 4272634.40 frames. ], batch size: 703, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:55:00,938 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1447272.0, ans=0.125 2023-06-23 08:55:32,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 5.055e+02 6.711e+02 8.629e+02 1.749e+03, threshold=1.342e+03, percent-clipped=9.0 2023-06-23 08:56:35,742 INFO [train.py:996] (1/4) Epoch 8, batch 27800, loss[loss=0.2203, simple_loss=0.2915, pruned_loss=0.07458, over 21992.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3081, pruned_loss=0.08126, over 4270519.49 frames. ], batch size: 373, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:56:36,941 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-23 08:57:29,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1447692.0, ans=0.125 2023-06-23 08:57:45,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1447752.0, ans=0.125 2023-06-23 08:58:08,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1447812.0, ans=0.125 2023-06-23 08:58:08,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1447812.0, ans=0.125 2023-06-23 08:58:15,621 INFO [train.py:996] (1/4) Epoch 8, batch 27850, loss[loss=0.258, simple_loss=0.3176, pruned_loss=0.09919, over 21338.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3075, pruned_loss=0.08266, over 4280617.65 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:58:50,988 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.347e+02 5.210e+02 6.936e+02 1.592e+03, threshold=1.042e+03, percent-clipped=2.0 2023-06-23 08:59:09,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1447992.0, ans=0.1 2023-06-23 08:59:14,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1448052.0, ans=0.0 2023-06-23 08:59:57,485 INFO [train.py:996] (1/4) Epoch 8, batch 27900, loss[loss=0.1949, simple_loss=0.2759, pruned_loss=0.057, over 16561.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3154, pruned_loss=0.08318, over 4274438.34 frames. ], batch size: 60, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:00:24,893 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-23 09:01:06,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1448352.0, ans=0.125 2023-06-23 09:01:34,344 INFO [train.py:996] (1/4) Epoch 8, batch 27950, loss[loss=0.2915, simple_loss=0.3958, pruned_loss=0.09356, over 20001.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3167, pruned_loss=0.08036, over 4269610.13 frames. ], batch size: 703, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:01:54,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1448532.0, ans=0.0 2023-06-23 09:01:54,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1448532.0, ans=10.0 2023-06-23 09:02:07,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1448532.0, ans=0.1 2023-06-23 09:02:08,161 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.302e+02 4.594e+02 6.671e+02 9.534e+02 1.876e+03, threshold=1.334e+03, percent-clipped=19.0 2023-06-23 09:02:13,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1448592.0, ans=0.125 2023-06-23 09:02:55,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1448712.0, ans=0.0 2023-06-23 09:03:08,012 INFO [train.py:996] (1/4) Epoch 8, batch 28000, loss[loss=0.2197, simple_loss=0.2839, pruned_loss=0.07775, over 21860.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3151, pruned_loss=0.07765, over 4281951.37 frames. ], batch size: 107, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:04:05,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1448952.0, ans=0.0 2023-06-23 09:04:29,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1449012.0, ans=0.0 2023-06-23 09:04:35,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1449012.0, ans=0.2 2023-06-23 09:04:52,719 INFO [train.py:996] (1/4) Epoch 8, batch 28050, loss[loss=0.1992, simple_loss=0.2617, pruned_loss=0.06835, over 21391.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.313, pruned_loss=0.07897, over 4282931.27 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:05:26,544 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.011e+02 4.954e+02 6.052e+02 8.048e+02 2.120e+03, threshold=1.210e+03, percent-clipped=2.0 2023-06-23 09:05:43,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1449192.0, ans=0.125 2023-06-23 09:06:06,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1449312.0, ans=0.0 2023-06-23 09:06:27,354 INFO [train.py:996] (1/4) Epoch 8, batch 28100, loss[loss=0.205, simple_loss=0.268, pruned_loss=0.07102, over 21514.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3094, pruned_loss=0.07857, over 4280230.76 frames. ], batch size: 195, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:06:33,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-23 09:07:03,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1449492.0, ans=0.125 2023-06-23 09:07:03,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1449492.0, ans=0.125 2023-06-23 09:07:57,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-23 09:08:06,377 INFO [train.py:996] (1/4) Epoch 8, batch 28150, loss[loss=0.2436, simple_loss=0.2989, pruned_loss=0.09414, over 21612.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3038, pruned_loss=0.07931, over 4285894.19 frames. ], batch size: 264, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:08:27,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1449732.0, ans=0.0 2023-06-23 09:08:39,171 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.413e+02 4.985e+02 7.502e+02 1.116e+03 2.390e+03, threshold=1.500e+03, percent-clipped=18.0 2023-06-23 09:09:44,388 INFO [train.py:996] (1/4) Epoch 8, batch 28200, loss[loss=0.2427, simple_loss=0.3062, pruned_loss=0.08963, over 21670.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3007, pruned_loss=0.08052, over 4281749.86 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:10:13,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1450032.0, ans=0.1 2023-06-23 09:11:27,089 INFO [train.py:996] (1/4) Epoch 8, batch 28250, loss[loss=0.2098, simple_loss=0.2734, pruned_loss=0.07306, over 21852.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3042, pruned_loss=0.08358, over 4283067.69 frames. ], batch size: 107, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:11:49,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1450332.0, ans=0.07 2023-06-23 09:12:04,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.687e+02 5.319e+02 7.100e+02 8.712e+02 1.908e+03, threshold=1.420e+03, percent-clipped=3.0 2023-06-23 09:12:27,578 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.64 vs. limit=12.0 2023-06-23 09:12:31,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1450452.0, ans=0.95 2023-06-23 09:13:00,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1450512.0, ans=0.0 2023-06-23 09:13:05,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1450572.0, ans=0.0 2023-06-23 09:13:05,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1450572.0, ans=0.125 2023-06-23 09:13:06,466 INFO [train.py:996] (1/4) Epoch 8, batch 28300, loss[loss=0.2108, simple_loss=0.2998, pruned_loss=0.0609, over 21837.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3019, pruned_loss=0.08132, over 4272101.73 frames. ], batch size: 371, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:13:10,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1450572.0, ans=0.05 2023-06-23 09:13:29,884 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-23 09:13:54,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1450692.0, ans=0.125 2023-06-23 09:14:31,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=15.0 2023-06-23 09:14:32,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1450812.0, ans=0.125 2023-06-23 09:14:41,599 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-23 09:14:45,013 INFO [train.py:996] (1/4) Epoch 8, batch 28350, loss[loss=0.2237, simple_loss=0.3443, pruned_loss=0.05157, over 19818.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2989, pruned_loss=0.0761, over 4269355.00 frames. ], batch size: 703, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:14:56,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1450872.0, ans=0.125 2023-06-23 09:15:21,055 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.828e+02 5.599e+02 8.860e+02 1.294e+03 2.489e+03, threshold=1.772e+03, percent-clipped=23.0 2023-06-23 09:16:16,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1451112.0, ans=0.125 2023-06-23 09:16:23,431 INFO [train.py:996] (1/4) Epoch 8, batch 28400, loss[loss=0.2254, simple_loss=0.2944, pruned_loss=0.07825, over 21638.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2967, pruned_loss=0.07622, over 4261923.14 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:16:32,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1451172.0, ans=0.125 2023-06-23 09:16:38,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1451232.0, ans=0.125 2023-06-23 09:16:42,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1451232.0, ans=0.125 2023-06-23 09:17:52,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451412.0, ans=0.1 2023-06-23 09:18:03,305 INFO [train.py:996] (1/4) Epoch 8, batch 28450, loss[loss=0.2666, simple_loss=0.3363, pruned_loss=0.09847, over 21409.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3026, pruned_loss=0.07997, over 4264985.66 frames. ], batch size: 131, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:18:15,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1451472.0, ans=0.0 2023-06-23 09:18:40,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451592.0, ans=0.1 2023-06-23 09:18:41,081 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.622e+02 7.842e+02 1.295e+03 2.358e+03, threshold=1.568e+03, percent-clipped=7.0 2023-06-23 09:19:39,554 INFO [train.py:996] (1/4) Epoch 8, batch 28500, loss[loss=0.2785, simple_loss=0.3454, pruned_loss=0.1057, over 21632.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3064, pruned_loss=0.08355, over 4279233.22 frames. ], batch size: 414, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:19:41,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1451772.0, ans=0.125 2023-06-23 09:20:21,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=22.5 2023-06-23 09:20:28,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1451892.0, ans=0.02 2023-06-23 09:20:37,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-23 09:21:00,108 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:21:16,172 INFO [train.py:996] (1/4) Epoch 8, batch 28550, loss[loss=0.2885, simple_loss=0.3786, pruned_loss=0.09925, over 21550.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3136, pruned_loss=0.08569, over 4279445.82 frames. ], batch size: 230, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:21:21,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-23 09:22:02,861 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.502e+02 4.663e+02 6.262e+02 9.623e+02 1.798e+03, threshold=1.252e+03, percent-clipped=1.0 2023-06-23 09:22:32,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1452252.0, ans=0.125 2023-06-23 09:23:01,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1452372.0, ans=0.2 2023-06-23 09:23:02,395 INFO [train.py:996] (1/4) Epoch 8, batch 28600, loss[loss=0.3069, simple_loss=0.3746, pruned_loss=0.1195, over 21545.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.321, pruned_loss=0.08771, over 4279786.49 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:23:18,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-23 09:24:34,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1452612.0, ans=0.0 2023-06-23 09:24:41,386 INFO [train.py:996] (1/4) Epoch 8, batch 28650, loss[loss=0.2235, simple_loss=0.2931, pruned_loss=0.07693, over 21586.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3163, pruned_loss=0.08782, over 4272484.46 frames. ], batch size: 415, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:24:54,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1452672.0, ans=0.0 2023-06-23 09:25:06,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1452732.0, ans=0.125 2023-06-23 09:25:23,242 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.145e+02 4.493e+02 5.758e+02 7.930e+02 1.580e+03, threshold=1.152e+03, percent-clipped=4.0 2023-06-23 09:26:08,064 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-23 09:26:26,176 INFO [train.py:996] (1/4) Epoch 8, batch 28700, loss[loss=0.2538, simple_loss=0.3245, pruned_loss=0.09155, over 21245.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.315, pruned_loss=0.08861, over 4273730.47 frames. ], batch size: 143, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:26:32,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1452972.0, ans=0.0 2023-06-23 09:26:45,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1453032.0, ans=0.0 2023-06-23 09:26:45,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1453032.0, ans=0.125 2023-06-23 09:27:09,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1453092.0, ans=0.0 2023-06-23 09:27:50,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1453212.0, ans=0.2 2023-06-23 09:28:04,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1453272.0, ans=0.0 2023-06-23 09:28:05,325 INFO [train.py:996] (1/4) Epoch 8, batch 28750, loss[loss=0.1999, simple_loss=0.2932, pruned_loss=0.05337, over 21891.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3143, pruned_loss=0.08804, over 4286748.63 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:28:07,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1453272.0, ans=0.0 2023-06-23 09:28:41,965 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.805e+02 4.984e+02 6.274e+02 9.092e+02 1.737e+03, threshold=1.255e+03, percent-clipped=10.0 2023-06-23 09:28:44,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1453392.0, ans=0.125 2023-06-23 09:29:26,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1453512.0, ans=0.0 2023-06-23 09:29:49,454 INFO [train.py:996] (1/4) Epoch 8, batch 28800, loss[loss=0.2912, simple_loss=0.3549, pruned_loss=0.1138, over 21919.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3183, pruned_loss=0.08853, over 4289902.72 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:29:59,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-23 09:30:25,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1453692.0, ans=0.125 2023-06-23 09:30:32,466 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.58 vs. limit=10.0 2023-06-23 09:31:00,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.20 vs. limit=15.0 2023-06-23 09:31:09,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1453812.0, ans=0.0 2023-06-23 09:31:22,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1453812.0, ans=0.2 2023-06-23 09:31:29,217 INFO [train.py:996] (1/4) Epoch 8, batch 28850, loss[loss=0.2677, simple_loss=0.3379, pruned_loss=0.09874, over 21830.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.32, pruned_loss=0.09052, over 4292530.40 frames. ], batch size: 107, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:31:48,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1453932.0, ans=0.125 2023-06-23 09:32:03,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.517e+02 4.921e+02 6.393e+02 7.769e+02 1.909e+03, threshold=1.279e+03, percent-clipped=3.0 2023-06-23 09:32:05,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1453992.0, ans=0.2 2023-06-23 09:32:15,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1453992.0, ans=0.125 2023-06-23 09:32:52,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1454112.0, ans=0.125 2023-06-23 09:33:11,085 INFO [train.py:996] (1/4) Epoch 8, batch 28900, loss[loss=0.3554, simple_loss=0.4597, pruned_loss=0.1256, over 19840.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3219, pruned_loss=0.09145, over 4289811.35 frames. ], batch size: 702, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:34:08,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1454352.0, ans=0.125 2023-06-23 09:34:25,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1454352.0, ans=0.125 2023-06-23 09:34:52,377 INFO [train.py:996] (1/4) Epoch 8, batch 28950, loss[loss=0.2382, simple_loss=0.3478, pruned_loss=0.06433, over 21668.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3229, pruned_loss=0.09084, over 4286032.05 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:34:56,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1454472.0, ans=0.125 2023-06-23 09:35:05,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1454472.0, ans=0.1 2023-06-23 09:35:09,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1454532.0, ans=0.1 2023-06-23 09:35:17,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1454532.0, ans=0.1 2023-06-23 09:35:33,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1454592.0, ans=0.125 2023-06-23 09:35:41,160 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 4.837e+02 6.969e+02 9.888e+02 2.996e+03, threshold=1.394e+03, percent-clipped=10.0 2023-06-23 09:35:48,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1454592.0, ans=0.1 2023-06-23 09:35:50,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1454592.0, ans=0.5 2023-06-23 09:35:51,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1454592.0, ans=0.1 2023-06-23 09:36:20,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1454712.0, ans=0.125 2023-06-23 09:36:32,726 INFO [train.py:996] (1/4) Epoch 8, batch 29000, loss[loss=0.2666, simple_loss=0.3459, pruned_loss=0.09365, over 21757.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3258, pruned_loss=0.08922, over 4284518.86 frames. ], batch size: 332, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:36:59,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1454832.0, ans=0.125 2023-06-23 09:37:58,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-23 09:38:10,316 INFO [train.py:996] (1/4) Epoch 8, batch 29050, loss[loss=0.2797, simple_loss=0.3358, pruned_loss=0.1118, over 21722.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3242, pruned_loss=0.08993, over 4282740.40 frames. ], batch size: 473, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:38:14,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1455072.0, ans=0.125 2023-06-23 09:39:01,860 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.533e+02 4.892e+02 6.390e+02 8.567e+02 1.270e+03, threshold=1.278e+03, percent-clipped=0.0 2023-06-23 09:39:18,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-23 09:39:48,132 INFO [train.py:996] (1/4) Epoch 8, batch 29100, loss[loss=0.2079, simple_loss=0.2623, pruned_loss=0.07674, over 20736.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3154, pruned_loss=0.08743, over 4285671.64 frames. ], batch size: 608, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:40:16,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1455432.0, ans=0.2 2023-06-23 09:41:02,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1455552.0, ans=0.0 2023-06-23 09:41:22,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1455612.0, ans=0.1 2023-06-23 09:41:26,878 INFO [train.py:996] (1/4) Epoch 8, batch 29150, loss[loss=0.2319, simple_loss=0.2887, pruned_loss=0.08749, over 21963.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3148, pruned_loss=0.0853, over 4283672.49 frames. ], batch size: 103, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:42:14,900 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.362e+02 4.552e+02 5.999e+02 9.339e+02 2.396e+03, threshold=1.200e+03, percent-clipped=6.0 2023-06-23 09:42:18,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1455792.0, ans=0.125 2023-06-23 09:42:48,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1455912.0, ans=0.125 2023-06-23 09:43:01,028 INFO [train.py:996] (1/4) Epoch 8, batch 29200, loss[loss=0.2363, simple_loss=0.3135, pruned_loss=0.07957, over 21741.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3109, pruned_loss=0.08439, over 4281801.39 frames. ], batch size: 333, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:43:19,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1455972.0, ans=0.1 2023-06-23 09:44:07,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-23 09:44:27,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1456212.0, ans=0.125 2023-06-23 09:44:33,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1456212.0, ans=0.0 2023-06-23 09:44:35,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1456212.0, ans=0.1 2023-06-23 09:44:50,182 INFO [train.py:996] (1/4) Epoch 8, batch 29250, loss[loss=0.2423, simple_loss=0.3085, pruned_loss=0.08804, over 21271.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3102, pruned_loss=0.08288, over 4273182.10 frames. ], batch size: 144, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:44:50,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456272.0, ans=0.1 2023-06-23 09:45:33,755 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.540e+02 4.714e+02 6.019e+02 9.609e+02 2.170e+03, threshold=1.204e+03, percent-clipped=18.0 2023-06-23 09:45:47,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1456452.0, ans=0.2 2023-06-23 09:45:59,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-23 09:46:07,509 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:46:24,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-23 09:46:33,668 INFO [train.py:996] (1/4) Epoch 8, batch 29300, loss[loss=0.2213, simple_loss=0.3058, pruned_loss=0.06843, over 21528.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3106, pruned_loss=0.08196, over 4266130.52 frames. ], batch size: 195, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:46:51,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1456572.0, ans=0.125 2023-06-23 09:47:04,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1456632.0, ans=0.125 2023-06-23 09:47:29,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-23 09:47:43,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1456812.0, ans=0.04949747468305833 2023-06-23 09:48:17,854 INFO [train.py:996] (1/4) Epoch 8, batch 29350, loss[loss=0.225, simple_loss=0.2943, pruned_loss=0.07786, over 21826.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3069, pruned_loss=0.08172, over 4266332.25 frames. ], batch size: 118, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:48:47,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-06-23 09:48:47,636 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-23 09:48:52,802 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.303e+02 4.822e+02 6.215e+02 9.294e+02 1.604e+03, threshold=1.243e+03, percent-clipped=12.0 2023-06-23 09:49:00,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-23 09:49:50,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1457112.0, ans=0.0 2023-06-23 09:49:56,836 INFO [train.py:996] (1/4) Epoch 8, batch 29400, loss[loss=0.1498, simple_loss=0.1981, pruned_loss=0.05077, over 21379.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3036, pruned_loss=0.07915, over 4265050.66 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:50:27,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1457292.0, ans=0.0 2023-06-23 09:50:37,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-23 09:50:46,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1457352.0, ans=0.125 2023-06-23 09:51:10,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.08 vs. limit=12.0 2023-06-23 09:51:26,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1457412.0, ans=0.125 2023-06-23 09:51:31,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1457412.0, ans=0.125 2023-06-23 09:51:35,726 INFO [train.py:996] (1/4) Epoch 8, batch 29450, loss[loss=0.2783, simple_loss=0.3445, pruned_loss=0.1061, over 21811.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3037, pruned_loss=0.07837, over 4267485.55 frames. ], batch size: 333, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:51:41,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1457472.0, ans=0.015 2023-06-23 09:52:09,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1457592.0, ans=0.2 2023-06-23 09:52:11,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.254e+02 6.189e+02 1.171e+03 1.650e+03 2.483e+03, threshold=2.343e+03, percent-clipped=48.0 2023-06-23 09:52:30,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-23 09:53:07,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.29 vs. limit=22.5 2023-06-23 09:53:14,234 INFO [train.py:996] (1/4) Epoch 8, batch 29500, loss[loss=0.2747, simple_loss=0.3336, pruned_loss=0.1079, over 21563.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3085, pruned_loss=0.08129, over 4267387.31 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:54:07,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-23 09:54:09,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-23 09:54:49,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1458012.0, ans=0.1 2023-06-23 09:54:52,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=22.5 2023-06-23 09:54:52,640 INFO [train.py:996] (1/4) Epoch 8, batch 29550, loss[loss=0.2648, simple_loss=0.3268, pruned_loss=0.1014, over 21885.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.307, pruned_loss=0.08227, over 4271460.90 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:55:26,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-23 09:55:27,666 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:55:28,668 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.866e+02 5.921e+02 7.933e+02 1.124e+03 2.184e+03, threshold=1.587e+03, percent-clipped=0.0 2023-06-23 09:56:28,483 INFO [train.py:996] (1/4) Epoch 8, batch 29600, loss[loss=0.2913, simple_loss=0.3728, pruned_loss=0.1049, over 21819.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.314, pruned_loss=0.08511, over 4276058.40 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:56:34,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-23 09:57:42,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1458552.0, ans=0.0 2023-06-23 09:57:50,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1458612.0, ans=0.0 2023-06-23 09:57:53,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1458612.0, ans=0.125 2023-06-23 09:57:54,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1458612.0, ans=0.0 2023-06-23 09:58:06,727 INFO [train.py:996] (1/4) Epoch 8, batch 29650, loss[loss=0.2186, simple_loss=0.2893, pruned_loss=0.07396, over 21824.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3139, pruned_loss=0.0825, over 4279532.61 frames. ], batch size: 298, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:58:14,014 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-23 09:58:39,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1458792.0, ans=0.0 2023-06-23 09:58:46,221 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.053e+02 5.356e+02 7.008e+02 1.123e+03 3.687e+03, threshold=1.402e+03, percent-clipped=10.0 2023-06-23 09:58:55,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1458792.0, ans=0.0 2023-06-23 09:59:45,985 INFO [train.py:996] (1/4) Epoch 8, batch 29700, loss[loss=0.2309, simple_loss=0.3045, pruned_loss=0.07861, over 21505.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3159, pruned_loss=0.08302, over 4280480.14 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 10:00:03,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-23 10:00:41,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-23 10:01:14,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-23 10:01:19,989 INFO [train.py:996] (1/4) Epoch 8, batch 29750, loss[loss=0.2256, simple_loss=0.2967, pruned_loss=0.07724, over 21865.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3198, pruned_loss=0.0829, over 4281982.90 frames. ], batch size: 107, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 10:02:05,842 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 4.705e+02 6.606e+02 1.008e+03 2.185e+03, threshold=1.321e+03, percent-clipped=11.0 2023-06-23 10:02:17,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1459392.0, ans=0.0 2023-06-23 10:02:43,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=12.0 2023-06-23 10:02:47,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1459512.0, ans=0.0 2023-06-23 10:02:58,158 INFO [train.py:996] (1/4) Epoch 8, batch 29800, loss[loss=0.2187, simple_loss=0.2966, pruned_loss=0.07043, over 21673.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3206, pruned_loss=0.08301, over 4279702.10 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:03:01,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1459572.0, ans=0.125 2023-06-23 10:03:17,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1459632.0, ans=0.125 2023-06-23 10:03:48,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1459692.0, ans=0.125 2023-06-23 10:04:24,034 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-23 10:04:30,524 INFO [train.py:996] (1/4) Epoch 8, batch 29850, loss[loss=0.2535, simple_loss=0.316, pruned_loss=0.09551, over 21534.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3157, pruned_loss=0.08113, over 4281061.90 frames. ], batch size: 548, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:04:38,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459872.0, ans=0.1 2023-06-23 10:05:16,481 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.132e+02 5.162e+02 6.765e+02 9.039e+02 1.623e+03, threshold=1.353e+03, percent-clipped=3.0 2023-06-23 10:05:28,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1459992.0, ans=0.125 2023-06-23 10:05:57,305 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.12 vs. limit=22.5 2023-06-23 10:06:00,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-23 10:06:08,527 INFO [train.py:996] (1/4) Epoch 8, batch 29900, loss[loss=0.2738, simple_loss=0.3633, pruned_loss=0.09213, over 21434.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3152, pruned_loss=0.08182, over 4281430.63 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:06:44,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.42 vs. limit=22.5 2023-06-23 10:06:51,347 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:07:13,429 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:07:42,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1460412.0, ans=0.125 2023-06-23 10:07:48,913 INFO [train.py:996] (1/4) Epoch 8, batch 29950, loss[loss=0.2887, simple_loss=0.3541, pruned_loss=0.1116, over 21434.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3197, pruned_loss=0.08596, over 4287411.55 frames. ], batch size: 131, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:07:54,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1460472.0, ans=0.125 2023-06-23 10:08:45,493 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.472e+02 5.581e+02 7.305e+02 1.013e+03 2.177e+03, threshold=1.461e+03, percent-clipped=7.0 2023-06-23 10:09:33,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1460772.0, ans=0.1 2023-06-23 10:09:34,058 INFO [train.py:996] (1/4) Epoch 8, batch 30000, loss[loss=0.229, simple_loss=0.3109, pruned_loss=0.07358, over 21235.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3217, pruned_loss=0.08649, over 4283225.49 frames. ], batch size: 143, lr: 3.61e-03, grad_scale: 32.0 2023-06-23 10:09:34,059 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 10:09:54,199 INFO [train.py:1028] (1/4) Epoch 8, validation: loss=0.244, simple_loss=0.3443, pruned_loss=0.07188, over 1796401.00 frames. 2023-06-23 10:09:54,200 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 10:10:18,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1460832.0, ans=0.04949747468305833 2023-06-23 10:10:28,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1460832.0, ans=0.125 2023-06-23 10:11:11,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1460952.0, ans=0.125 2023-06-23 10:11:31,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1461012.0, ans=0.125 2023-06-23 10:11:32,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1461012.0, ans=0.125 2023-06-23 10:11:42,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1461012.0, ans=0.125 2023-06-23 10:11:46,628 INFO [train.py:996] (1/4) Epoch 8, batch 30050, loss[loss=0.2592, simple_loss=0.3871, pruned_loss=0.06568, over 20803.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.326, pruned_loss=0.08384, over 4276585.71 frames. ], batch size: 607, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:12:14,148 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.48 vs. limit=22.5 2023-06-23 10:12:21,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1461192.0, ans=0.2 2023-06-23 10:12:25,998 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.441e+02 4.815e+02 7.305e+02 9.705e+02 3.214e+03, threshold=1.461e+03, percent-clipped=9.0 2023-06-23 10:13:05,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.69 vs. limit=8.0 2023-06-23 10:13:18,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.04 vs. limit=15.0 2023-06-23 10:13:26,313 INFO [train.py:996] (1/4) Epoch 8, batch 30100, loss[loss=0.2054, simple_loss=0.2687, pruned_loss=0.07104, over 21894.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3239, pruned_loss=0.0829, over 4273848.12 frames. ], batch size: 113, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:13:54,121 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-23 10:14:48,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461612.0, ans=0.1 2023-06-23 10:14:48,699 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-23 10:14:48,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1461612.0, ans=22.5 2023-06-23 10:14:56,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1461612.0, ans=0.125 2023-06-23 10:15:05,453 INFO [train.py:996] (1/4) Epoch 8, batch 30150, loss[loss=0.2519, simple_loss=0.3226, pruned_loss=0.0906, over 20000.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.319, pruned_loss=0.08402, over 4266965.97 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:15:34,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1461732.0, ans=0.125 2023-06-23 10:15:59,658 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.498e+02 5.519e+02 7.633e+02 1.440e+03, threshold=1.104e+03, percent-clipped=0.0 2023-06-23 10:16:01,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1461792.0, ans=0.125 2023-06-23 10:16:19,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1461852.0, ans=0.0 2023-06-23 10:16:35,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.81 vs. limit=8.0 2023-06-23 10:16:47,033 INFO [train.py:996] (1/4) Epoch 8, batch 30200, loss[loss=0.2929, simple_loss=0.3771, pruned_loss=0.1044, over 21474.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3199, pruned_loss=0.08274, over 4260521.00 frames. ], batch size: 471, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:16:54,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1461972.0, ans=0.125 2023-06-23 10:16:54,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1461972.0, ans=0.0 2023-06-23 10:17:38,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-23 10:17:50,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1462092.0, ans=0.125 2023-06-23 10:18:27,140 INFO [train.py:996] (1/4) Epoch 8, batch 30250, loss[loss=0.2402, simple_loss=0.3106, pruned_loss=0.08493, over 19962.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3276, pruned_loss=0.08546, over 4260746.30 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:19:25,070 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.543e+02 5.739e+02 8.338e+02 1.276e+03 3.132e+03, threshold=1.668e+03, percent-clipped=33.0 2023-06-23 10:19:35,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1462452.0, ans=0.125 2023-06-23 10:19:35,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-23 10:19:47,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1462452.0, ans=0.125 2023-06-23 10:20:10,531 INFO [train.py:996] (1/4) Epoch 8, batch 30300, loss[loss=0.2004, simple_loss=0.2645, pruned_loss=0.06813, over 21610.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3254, pruned_loss=0.08562, over 4259786.31 frames. ], batch size: 282, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:20:53,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1462632.0, ans=0.0 2023-06-23 10:21:00,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1462692.0, ans=0.125 2023-06-23 10:21:18,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1462752.0, ans=0.125 2023-06-23 10:21:19,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-23 10:21:27,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1462752.0, ans=0.1 2023-06-23 10:22:08,468 INFO [train.py:996] (1/4) Epoch 8, batch 30350, loss[loss=0.2629, simple_loss=0.3449, pruned_loss=0.09044, over 21733.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3278, pruned_loss=0.08779, over 4264839.49 frames. ], batch size: 332, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:22:08,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1462872.0, ans=0.0 2023-06-23 10:22:23,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1462932.0, ans=0.125 2023-06-23 10:22:43,072 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.418e+02 5.113e+02 8.159e+02 1.296e+03 2.782e+03, threshold=1.632e+03, percent-clipped=10.0 2023-06-23 10:22:50,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1463052.0, ans=0.125 2023-06-23 10:23:13,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1463112.0, ans=0.2 2023-06-23 10:23:21,521 INFO [train.py:996] (1/4) Epoch 8, batch 30400, loss[loss=0.2156, simple_loss=0.2661, pruned_loss=0.08251, over 20335.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3225, pruned_loss=0.08598, over 4259331.44 frames. ], batch size: 703, lr: 3.61e-03, grad_scale: 32.0 2023-06-23 10:23:22,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1463172.0, ans=0.125 2023-06-23 10:24:01,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1463292.0, ans=0.1 2023-06-23 10:24:05,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1463292.0, ans=0.125 2023-06-23 10:24:24,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1463352.0, ans=0.125 2023-06-23 10:24:45,445 INFO [train.py:996] (1/4) Epoch 8, batch 30450, loss[loss=0.2969, simple_loss=0.4205, pruned_loss=0.08665, over 19889.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3235, pruned_loss=0.08497, over 4200518.03 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:25:01,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-23 10:25:13,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1463532.0, ans=0.125 2023-06-23 10:25:24,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.983e+02 7.886e+02 1.299e+03 2.180e+03 7.301e+03, threshold=2.598e+03, percent-clipped=35.0 2023-06-23 10:27:25,974 INFO [train.py:996] (1/4) Epoch 9, batch 0, loss[loss=0.2055, simple_loss=0.2747, pruned_loss=0.06818, over 21754.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2747, pruned_loss=0.06818, over 21754.00 frames. ], batch size: 317, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:27:25,975 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 10:27:41,473 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2407, simple_loss=0.3498, pruned_loss=0.06579, over 1796401.00 frames. 2023-06-23 10:27:41,473 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 10:27:52,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1463742.0, ans=0.0 2023-06-23 10:28:04,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-23 10:28:05,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1463802.0, ans=0.0 2023-06-23 10:28:22,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1463862.0, ans=0.0 2023-06-23 10:28:40,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1463862.0, ans=0.125 2023-06-23 10:28:42,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1463922.0, ans=0.125 2023-06-23 10:28:57,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1463922.0, ans=0.125 2023-06-23 10:28:59,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1463922.0, ans=0.0 2023-06-23 10:29:00,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1463922.0, ans=0.2 2023-06-23 10:29:01,464 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-23 10:29:21,252 INFO [train.py:996] (1/4) Epoch 9, batch 50, loss[loss=0.2783, simple_loss=0.3378, pruned_loss=0.1094, over 21906.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3256, pruned_loss=0.0854, over 960969.17 frames. ], batch size: 107, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:29:26,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1464042.0, ans=0.1 2023-06-23 10:29:39,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-23 10:30:22,206 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.385e+02 5.823e+02 9.334e+02 1.610e+03 5.016e+03, threshold=1.867e+03, percent-clipped=15.0 2023-06-23 10:30:22,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1464222.0, ans=0.125 2023-06-23 10:30:58,876 INFO [train.py:996] (1/4) Epoch 9, batch 100, loss[loss=0.2508, simple_loss=0.351, pruned_loss=0.07528, over 21313.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3373, pruned_loss=0.08759, over 1694056.28 frames. ], batch size: 176, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:31:00,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.68 vs. limit=15.0 2023-06-23 10:31:08,974 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:31:39,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1464462.0, ans=0.2 2023-06-23 10:31:55,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1464462.0, ans=0.125 2023-06-23 10:32:15,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1464522.0, ans=0.0 2023-06-23 10:32:26,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1464582.0, ans=0.1 2023-06-23 10:32:31,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1464582.0, ans=0.125 2023-06-23 10:32:35,506 INFO [train.py:996] (1/4) Epoch 9, batch 150, loss[loss=0.2436, simple_loss=0.3073, pruned_loss=0.08995, over 21940.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3404, pruned_loss=0.08768, over 2272418.79 frames. ], batch size: 316, lr: 3.39e-03, grad_scale: 16.0 2023-06-23 10:32:50,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1464702.0, ans=0.125 2023-06-23 10:33:26,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1464762.0, ans=0.0 2023-06-23 10:33:38,728 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.525e+02 5.241e+02 6.632e+02 9.762e+02 2.000e+03, threshold=1.326e+03, percent-clipped=1.0 2023-06-23 10:33:40,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1464822.0, ans=0.125 2023-06-23 10:33:42,907 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-23 10:33:58,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1464882.0, ans=0.0 2023-06-23 10:34:12,646 INFO [train.py:996] (1/4) Epoch 9, batch 200, loss[loss=0.2316, simple_loss=0.3226, pruned_loss=0.07028, over 21893.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3354, pruned_loss=0.08545, over 2717335.09 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:34:20,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1464942.0, ans=0.2 2023-06-23 10:34:45,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1465002.0, ans=0.2 2023-06-23 10:35:08,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1465122.0, ans=0.125 2023-06-23 10:35:29,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1465122.0, ans=0.125 2023-06-23 10:35:43,234 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-23 10:35:50,372 INFO [train.py:996] (1/4) Epoch 9, batch 250, loss[loss=0.2305, simple_loss=0.3059, pruned_loss=0.07761, over 21869.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3304, pruned_loss=0.08461, over 3060003.23 frames. ], batch size: 332, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:35:54,384 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.54 vs. limit=10.0 2023-06-23 10:36:03,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1465242.0, ans=0.0 2023-06-23 10:36:08,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1465302.0, ans=0.0 2023-06-23 10:36:30,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1465362.0, ans=0.09899494936611666 2023-06-23 10:36:37,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1465362.0, ans=0.09899494936611666 2023-06-23 10:36:55,173 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 4.793e+02 6.542e+02 9.601e+02 1.948e+03, threshold=1.308e+03, percent-clipped=7.0 2023-06-23 10:37:29,676 INFO [train.py:996] (1/4) Epoch 9, batch 300, loss[loss=0.2522, simple_loss=0.3171, pruned_loss=0.09361, over 21760.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3267, pruned_loss=0.0855, over 3332442.08 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:37:44,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1465602.0, ans=0.0 2023-06-23 10:37:50,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1465602.0, ans=0.0 2023-06-23 10:38:51,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1465722.0, ans=0.125 2023-06-23 10:38:53,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1465782.0, ans=0.125 2023-06-23 10:39:11,159 INFO [train.py:996] (1/4) Epoch 9, batch 350, loss[loss=0.215, simple_loss=0.278, pruned_loss=0.07599, over 21431.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3209, pruned_loss=0.0838, over 3540899.52 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:40:16,456 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 5.700e+02 8.166e+02 1.374e+03 3.481e+03, threshold=1.633e+03, percent-clipped=26.0 2023-06-23 10:40:25,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1466022.0, ans=0.0 2023-06-23 10:40:52,652 INFO [train.py:996] (1/4) Epoch 9, batch 400, loss[loss=0.1754, simple_loss=0.2548, pruned_loss=0.04804, over 21628.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3134, pruned_loss=0.0821, over 3706117.31 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:41:00,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1466142.0, ans=0.125 2023-06-23 10:41:04,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1466142.0, ans=0.0 2023-06-23 10:41:07,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1466202.0, ans=0.0 2023-06-23 10:42:07,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1466322.0, ans=0.125 2023-06-23 10:42:24,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1466382.0, ans=0.05 2023-06-23 10:42:25,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1466382.0, ans=0.1 2023-06-23 10:42:35,021 INFO [train.py:996] (1/4) Epoch 9, batch 450, loss[loss=0.2515, simple_loss=0.3161, pruned_loss=0.09349, over 21873.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3112, pruned_loss=0.08099, over 3835766.24 frames. ], batch size: 118, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:43:20,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1466562.0, ans=0.95 2023-06-23 10:43:40,421 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 7.104e+02 9.718e+02 1.338e+03 3.704e+03, threshold=1.944e+03, percent-clipped=17.0 2023-06-23 10:43:49,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466622.0, ans=0.1 2023-06-23 10:44:09,229 INFO [train.py:996] (1/4) Epoch 9, batch 500, loss[loss=0.2553, simple_loss=0.3396, pruned_loss=0.08546, over 21764.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3138, pruned_loss=0.08147, over 3931674.03 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:44:16,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-06-23 10:44:18,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1466742.0, ans=0.125 2023-06-23 10:44:56,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1466862.0, ans=0.125 2023-06-23 10:45:03,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-23 10:45:25,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1466922.0, ans=0.0 2023-06-23 10:45:36,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1466982.0, ans=0.0 2023-06-23 10:45:40,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1466982.0, ans=0.0 2023-06-23 10:45:47,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-23 10:45:47,798 INFO [train.py:996] (1/4) Epoch 9, batch 550, loss[loss=0.2177, simple_loss=0.2906, pruned_loss=0.07241, over 21856.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.317, pruned_loss=0.0804, over 4011435.97 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:45:50,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1467042.0, ans=0.2 2023-06-23 10:46:09,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1467102.0, ans=0.2 2023-06-23 10:46:53,877 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 4.598e+02 6.514e+02 1.038e+03 2.454e+03, threshold=1.303e+03, percent-clipped=6.0 2023-06-23 10:47:22,498 INFO [train.py:996] (1/4) Epoch 9, batch 600, loss[loss=0.2589, simple_loss=0.3222, pruned_loss=0.09776, over 21751.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3184, pruned_loss=0.08032, over 4072516.02 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:47:38,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1467342.0, ans=0.125 2023-06-23 10:47:51,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1467402.0, ans=0.125 2023-06-23 10:48:34,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1467522.0, ans=0.2 2023-06-23 10:48:38,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1467522.0, ans=0.0 2023-06-23 10:48:41,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1467522.0, ans=0.0 2023-06-23 10:49:00,664 INFO [train.py:996] (1/4) Epoch 9, batch 650, loss[loss=0.2011, simple_loss=0.2639, pruned_loss=0.06916, over 19915.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3179, pruned_loss=0.08052, over 4124833.19 frames. ], batch size: 704, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:49:36,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-23 10:49:40,012 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-23 10:49:42,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1467762.0, ans=0.125 2023-06-23 10:49:47,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1467762.0, ans=0.0 2023-06-23 10:50:07,001 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.298e+02 4.773e+02 6.710e+02 1.032e+03 2.196e+03, threshold=1.342e+03, percent-clipped=13.0 2023-06-23 10:50:36,076 INFO [train.py:996] (1/4) Epoch 9, batch 700, loss[loss=0.2629, simple_loss=0.3303, pruned_loss=0.0978, over 21790.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3171, pruned_loss=0.08069, over 4152622.02 frames. ], batch size: 107, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:51:07,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1468002.0, ans=0.125 2023-06-23 10:51:18,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=15.0 2023-06-23 10:51:19,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1468062.0, ans=0.125 2023-06-23 10:51:33,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1468062.0, ans=0.04949747468305833 2023-06-23 10:51:34,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1468062.0, ans=0.0 2023-06-23 10:51:58,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1468182.0, ans=0.035 2023-06-23 10:52:09,589 INFO [train.py:996] (1/4) Epoch 9, batch 750, loss[loss=0.2391, simple_loss=0.2976, pruned_loss=0.09027, over 21710.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3164, pruned_loss=0.08086, over 4187246.92 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:53:15,062 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.543e+02 5.132e+02 8.530e+02 1.237e+03 2.839e+03, threshold=1.706e+03, percent-clipped=17.0 2023-06-23 10:53:44,252 INFO [train.py:996] (1/4) Epoch 9, batch 800, loss[loss=0.2517, simple_loss=0.3078, pruned_loss=0.09783, over 21775.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3141, pruned_loss=0.08157, over 4195466.78 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:54:06,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-23 10:55:16,104 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-23 10:55:19,775 INFO [train.py:996] (1/4) Epoch 9, batch 850, loss[loss=0.2374, simple_loss=0.3097, pruned_loss=0.08254, over 21897.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3127, pruned_loss=0.08229, over 4220686.58 frames. ], batch size: 124, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:55:47,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1468902.0, ans=0.2 2023-06-23 10:56:02,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1468902.0, ans=0.04949747468305833 2023-06-23 10:56:26,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1469022.0, ans=0.125 2023-06-23 10:56:31,573 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.626e+02 5.922e+02 9.431e+02 1.406e+03 2.564e+03, threshold=1.886e+03, percent-clipped=15.0 2023-06-23 10:56:49,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1469082.0, ans=0.125 2023-06-23 10:57:02,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1469082.0, ans=0.125 2023-06-23 10:57:05,028 INFO [train.py:996] (1/4) Epoch 9, batch 900, loss[loss=0.2278, simple_loss=0.3091, pruned_loss=0.07325, over 21797.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3097, pruned_loss=0.08131, over 4234168.18 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:57:37,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1469202.0, ans=0.1 2023-06-23 10:57:59,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1469262.0, ans=0.0 2023-06-23 10:58:45,641 INFO [train.py:996] (1/4) Epoch 9, batch 950, loss[loss=0.2617, simple_loss=0.3292, pruned_loss=0.09711, over 21738.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3099, pruned_loss=0.08128, over 4251641.67 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:59:10,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-23 10:59:30,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1469502.0, ans=0.0 2023-06-23 10:59:53,227 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.620e+02 5.465e+02 8.299e+02 1.252e+03 2.692e+03, threshold=1.660e+03, percent-clipped=4.0 2023-06-23 11:00:25,502 INFO [train.py:996] (1/4) Epoch 9, batch 1000, loss[loss=0.2621, simple_loss=0.3359, pruned_loss=0.09419, over 21371.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3089, pruned_loss=0.08157, over 4262713.49 frames. ], batch size: 131, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:00:28,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-23 11:00:39,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=22.5 2023-06-23 11:01:18,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1469862.0, ans=0.125 2023-06-23 11:02:00,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1469982.0, ans=0.125 2023-06-23 11:02:11,560 INFO [train.py:996] (1/4) Epoch 9, batch 1050, loss[loss=0.1727, simple_loss=0.2508, pruned_loss=0.04736, over 21253.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3108, pruned_loss=0.08203, over 4264996.63 frames. ], batch size: 176, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:03:07,451 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:03:12,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1470222.0, ans=0.125 2023-06-23 11:03:15,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.334e+02 4.887e+02 6.781e+02 8.518e+02 2.404e+03, threshold=1.356e+03, percent-clipped=1.0 2023-06-23 11:03:42,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1470282.0, ans=0.1 2023-06-23 11:03:58,864 INFO [train.py:996] (1/4) Epoch 9, batch 1100, loss[loss=0.2888, simple_loss=0.3585, pruned_loss=0.1095, over 21862.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3112, pruned_loss=0.08159, over 4275207.27 frames. ], batch size: 371, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:04:21,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1470402.0, ans=0.0 2023-06-23 11:04:40,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-23 11:04:58,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1470522.0, ans=0.125 2023-06-23 11:05:43,366 INFO [train.py:996] (1/4) Epoch 9, batch 1150, loss[loss=0.248, simple_loss=0.3206, pruned_loss=0.08771, over 21776.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3116, pruned_loss=0.0813, over 4275557.54 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:05:45,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1470642.0, ans=0.125 2023-06-23 11:05:47,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-23 11:06:10,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1470702.0, ans=0.2 2023-06-23 11:06:20,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1470762.0, ans=0.2 2023-06-23 11:06:43,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.848e+02 5.352e+02 7.597e+02 1.030e+03 2.056e+03, threshold=1.519e+03, percent-clipped=12.0 2023-06-23 11:07:25,433 INFO [train.py:996] (1/4) Epoch 9, batch 1200, loss[loss=0.2725, simple_loss=0.3602, pruned_loss=0.09241, over 21755.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3122, pruned_loss=0.08147, over 4276159.80 frames. ], batch size: 391, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:07:25,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1470942.0, ans=0.0 2023-06-23 11:07:36,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1470942.0, ans=0.125 2023-06-23 11:07:38,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1470942.0, ans=0.0 2023-06-23 11:08:25,777 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:08:51,576 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-23 11:08:57,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-23 11:08:58,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1471182.0, ans=0.2 2023-06-23 11:09:01,920 INFO [train.py:996] (1/4) Epoch 9, batch 1250, loss[loss=0.2411, simple_loss=0.3259, pruned_loss=0.07812, over 21765.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3151, pruned_loss=0.08268, over 4279343.20 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:09:08,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1471242.0, ans=0.1 2023-06-23 11:09:46,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1471362.0, ans=0.2 2023-06-23 11:10:01,584 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.789e+02 4.897e+02 6.693e+02 9.449e+02 1.847e+03, threshold=1.339e+03, percent-clipped=0.0 2023-06-23 11:10:40,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1471542.0, ans=0.125 2023-06-23 11:10:41,437 INFO [train.py:996] (1/4) Epoch 9, batch 1300, loss[loss=0.2425, simple_loss=0.3304, pruned_loss=0.0773, over 21748.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3151, pruned_loss=0.08284, over 4281201.97 frames. ], batch size: 414, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:10:41,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1471542.0, ans=0.125 2023-06-23 11:10:49,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1471542.0, ans=0.125 2023-06-23 11:10:58,082 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-23 11:11:18,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1471662.0, ans=0.2 2023-06-23 11:11:26,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1471662.0, ans=0.125 2023-06-23 11:11:59,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1471782.0, ans=0.125 2023-06-23 11:12:16,946 INFO [train.py:996] (1/4) Epoch 9, batch 1350, loss[loss=0.3312, simple_loss=0.3859, pruned_loss=0.1383, over 21326.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.316, pruned_loss=0.08353, over 4290084.09 frames. ], batch size: 507, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:12:20,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1471842.0, ans=0.2 2023-06-23 11:12:38,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1471902.0, ans=0.125 2023-06-23 11:12:41,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1471902.0, ans=0.125 2023-06-23 11:12:53,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1471962.0, ans=0.0 2023-06-23 11:13:01,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.52 vs. limit=10.0 2023-06-23 11:13:08,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-06-23 11:13:16,515 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.586e+02 4.783e+02 6.688e+02 9.049e+02 1.938e+03, threshold=1.338e+03, percent-clipped=9.0 2023-06-23 11:13:57,419 INFO [train.py:996] (1/4) Epoch 9, batch 1400, loss[loss=0.2104, simple_loss=0.2942, pruned_loss=0.0633, over 21385.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3142, pruned_loss=0.08343, over 4286100.89 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:14:01,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1472142.0, ans=0.125 2023-06-23 11:15:29,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1472382.0, ans=0.125 2023-06-23 11:15:39,599 INFO [train.py:996] (1/4) Epoch 9, batch 1450, loss[loss=0.3783, simple_loss=0.4487, pruned_loss=0.154, over 21525.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3142, pruned_loss=0.0838, over 4289910.86 frames. ], batch size: 507, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:15:52,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1472442.0, ans=0.0 2023-06-23 11:16:03,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.21 vs. limit=8.0 2023-06-23 11:16:38,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1472622.0, ans=0.125 2023-06-23 11:16:44,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.412e+02 5.469e+02 7.594e+02 1.040e+03 1.854e+03, threshold=1.519e+03, percent-clipped=12.0 2023-06-23 11:17:11,629 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-06-23 11:17:20,424 INFO [train.py:996] (1/4) Epoch 9, batch 1500, loss[loss=0.223, simple_loss=0.3027, pruned_loss=0.07164, over 17592.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3159, pruned_loss=0.08578, over 4293339.78 frames. ], batch size: 60, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:17:46,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-23 11:19:03,002 INFO [train.py:996] (1/4) Epoch 9, batch 1550, loss[loss=0.2135, simple_loss=0.3148, pruned_loss=0.05611, over 20820.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.314, pruned_loss=0.08416, over 4293287.13 frames. ], batch size: 607, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:19:40,641 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=12.0 2023-06-23 11:19:46,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1473162.0, ans=0.125 2023-06-23 11:20:14,289 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.435e+02 5.299e+02 6.765e+02 1.096e+03 1.841e+03, threshold=1.353e+03, percent-clipped=3.0 2023-06-23 11:20:22,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-23 11:20:28,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1473282.0, ans=0.035 2023-06-23 11:20:40,608 INFO [train.py:996] (1/4) Epoch 9, batch 1600, loss[loss=0.2535, simple_loss=0.3347, pruned_loss=0.0861, over 21417.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3116, pruned_loss=0.08276, over 4286865.26 frames. ], batch size: 548, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:20:51,819 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-23 11:22:01,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-23 11:22:23,031 INFO [train.py:996] (1/4) Epoch 9, batch 1650, loss[loss=0.247, simple_loss=0.315, pruned_loss=0.08946, over 21336.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3106, pruned_loss=0.08178, over 4276987.43 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:22:24,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1473642.0, ans=15.0 2023-06-23 11:22:48,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-23 11:22:56,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=22.5 2023-06-23 11:23:17,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1473762.0, ans=0.0 2023-06-23 11:23:41,192 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 5.690e+02 7.589e+02 1.047e+03 2.202e+03, threshold=1.518e+03, percent-clipped=10.0 2023-06-23 11:23:41,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1473822.0, ans=0.0 2023-06-23 11:23:46,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1473822.0, ans=0.2 2023-06-23 11:23:55,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1473882.0, ans=0.125 2023-06-23 11:24:00,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1473882.0, ans=0.125 2023-06-23 11:24:06,489 INFO [train.py:996] (1/4) Epoch 9, batch 1700, loss[loss=0.245, simple_loss=0.3115, pruned_loss=0.08929, over 21050.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3146, pruned_loss=0.08378, over 4276839.64 frames. ], batch size: 608, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:24:52,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-23 11:25:53,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1474242.0, ans=0.0 2023-06-23 11:25:54,639 INFO [train.py:996] (1/4) Epoch 9, batch 1750, loss[loss=0.2289, simple_loss=0.3083, pruned_loss=0.07473, over 21538.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3151, pruned_loss=0.08164, over 4276495.77 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:26:29,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1474302.0, ans=0.0 2023-06-23 11:27:13,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.110e+02 6.517e+02 8.827e+02 1.421e+03 2.550e+03, threshold=1.765e+03, percent-clipped=23.0 2023-06-23 11:27:43,586 INFO [train.py:996] (1/4) Epoch 9, batch 1800, loss[loss=0.2778, simple_loss=0.3561, pruned_loss=0.09976, over 21432.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3079, pruned_loss=0.07702, over 4275071.77 frames. ], batch size: 507, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:28:35,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1474662.0, ans=0.0 2023-06-23 11:28:35,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1474662.0, ans=0.0 2023-06-23 11:28:54,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1474722.0, ans=0.125 2023-06-23 11:28:54,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1474722.0, ans=0.1 2023-06-23 11:29:00,609 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:29:25,358 INFO [train.py:996] (1/4) Epoch 9, batch 1850, loss[loss=0.2532, simple_loss=0.3469, pruned_loss=0.07971, over 21455.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3117, pruned_loss=0.07612, over 4275212.71 frames. ], batch size: 471, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:30:34,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475022.0, ans=0.1 2023-06-23 11:30:37,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.456e+02 5.587e+02 8.108e+02 1.184e+03 2.810e+03, threshold=1.622e+03, percent-clipped=5.0 2023-06-23 11:30:49,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-23 11:31:11,689 INFO [train.py:996] (1/4) Epoch 9, batch 1900, loss[loss=0.201, simple_loss=0.2655, pruned_loss=0.06822, over 20301.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3131, pruned_loss=0.07791, over 4276941.57 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:31:13,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1475142.0, ans=0.0 2023-06-23 11:31:26,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1475142.0, ans=0.1 2023-06-23 11:31:28,348 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:31:32,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1475202.0, ans=0.0 2023-06-23 11:31:49,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475262.0, ans=0.1 2023-06-23 11:31:51,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-23 11:32:03,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1475262.0, ans=0.1 2023-06-23 11:32:14,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1475322.0, ans=0.125 2023-06-23 11:32:58,411 INFO [train.py:996] (1/4) Epoch 9, batch 1950, loss[loss=0.2128, simple_loss=0.2753, pruned_loss=0.07512, over 21617.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3109, pruned_loss=0.07826, over 4272964.73 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:33:12,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-23 11:33:20,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1475502.0, ans=0.125 2023-06-23 11:33:23,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1475502.0, ans=0.125 2023-06-23 11:33:44,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1475562.0, ans=0.125 2023-06-23 11:33:51,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1475622.0, ans=0.07 2023-06-23 11:33:59,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1475622.0, ans=0.0 2023-06-23 11:34:00,324 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.571e+02 6.151e+02 9.472e+02 1.342e+03 2.834e+03, threshold=1.894e+03, percent-clipped=13.0 2023-06-23 11:34:40,444 INFO [train.py:996] (1/4) Epoch 9, batch 2000, loss[loss=0.1346, simple_loss=0.1981, pruned_loss=0.0356, over 15797.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3069, pruned_loss=0.07686, over 4264960.02 frames. ], batch size: 60, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:35:02,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1475802.0, ans=0.125 2023-06-23 11:35:13,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1475862.0, ans=0.5 2023-06-23 11:35:32,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475922.0, ans=0.1 2023-06-23 11:35:34,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1475922.0, ans=0.0 2023-06-23 11:35:48,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1475982.0, ans=0.0 2023-06-23 11:35:53,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1475982.0, ans=0.125 2023-06-23 11:36:16,277 INFO [train.py:996] (1/4) Epoch 9, batch 2050, loss[loss=0.2219, simple_loss=0.3027, pruned_loss=0.0705, over 21848.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3063, pruned_loss=0.07659, over 4270700.82 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:37:17,619 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.522e+02 6.873e+02 9.848e+02 2.030e+03, threshold=1.375e+03, percent-clipped=1.0 2023-06-23 11:37:38,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1476282.0, ans=0.2 2023-06-23 11:37:39,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1476282.0, ans=0.0 2023-06-23 11:37:51,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-23 11:37:56,800 INFO [train.py:996] (1/4) Epoch 9, batch 2100, loss[loss=0.2198, simple_loss=0.3017, pruned_loss=0.06896, over 21827.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3098, pruned_loss=0.07863, over 4275581.64 frames. ], batch size: 102, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:38:15,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=15.0 2023-06-23 11:38:41,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1476462.0, ans=0.125 2023-06-23 11:38:49,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1476522.0, ans=0.2 2023-06-23 11:39:27,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1476582.0, ans=0.125 2023-06-23 11:39:39,895 INFO [train.py:996] (1/4) Epoch 9, batch 2150, loss[loss=0.2239, simple_loss=0.2969, pruned_loss=0.0755, over 21207.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3101, pruned_loss=0.08023, over 4274547.52 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:40:05,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1476702.0, ans=0.1 2023-06-23 11:40:42,346 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.083e+02 6.104e+02 8.851e+02 1.376e+03 2.645e+03, threshold=1.770e+03, percent-clipped=25.0 2023-06-23 11:41:16,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1476882.0, ans=0.125 2023-06-23 11:41:21,814 INFO [train.py:996] (1/4) Epoch 9, batch 2200, loss[loss=0.2359, simple_loss=0.3212, pruned_loss=0.07533, over 21735.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3136, pruned_loss=0.08079, over 4270457.15 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:43:02,284 INFO [train.py:996] (1/4) Epoch 9, batch 2250, loss[loss=0.2122, simple_loss=0.2701, pruned_loss=0.07715, over 21832.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3096, pruned_loss=0.07888, over 4264822.25 frames. ], batch size: 98, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:43:32,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1477302.0, ans=0.0 2023-06-23 11:43:53,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1477362.0, ans=0.0 2023-06-23 11:43:56,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1477422.0, ans=0.0 2023-06-23 11:44:09,160 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.634e+02 5.493e+02 8.283e+02 1.333e+03 2.509e+03, threshold=1.657e+03, percent-clipped=6.0 2023-06-23 11:44:37,535 INFO [train.py:996] (1/4) Epoch 9, batch 2300, loss[loss=0.2655, simple_loss=0.3105, pruned_loss=0.1102, over 21306.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3075, pruned_loss=0.07878, over 4257235.75 frames. ], batch size: 473, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:44:38,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1477542.0, ans=0.125 2023-06-23 11:45:28,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1477662.0, ans=0.04949747468305833 2023-06-23 11:45:29,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1477722.0, ans=0.125 2023-06-23 11:46:02,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-23 11:46:18,227 INFO [train.py:996] (1/4) Epoch 9, batch 2350, loss[loss=0.2482, simple_loss=0.3113, pruned_loss=0.09253, over 21437.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3048, pruned_loss=0.07926, over 4259790.97 frames. ], batch size: 389, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:46:38,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1477902.0, ans=0.125 2023-06-23 11:46:46,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1477902.0, ans=0.125 2023-06-23 11:46:50,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1477902.0, ans=0.05 2023-06-23 11:47:03,950 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-23 11:47:34,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1478022.0, ans=0.2 2023-06-23 11:47:36,710 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.626e+02 5.310e+02 7.234e+02 1.027e+03 2.720e+03, threshold=1.447e+03, percent-clipped=6.0 2023-06-23 11:47:49,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-23 11:47:55,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1478082.0, ans=0.125 2023-06-23 11:48:06,162 INFO [train.py:996] (1/4) Epoch 9, batch 2400, loss[loss=0.2026, simple_loss=0.2776, pruned_loss=0.06377, over 21878.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3101, pruned_loss=0.0812, over 4261254.42 frames. ], batch size: 98, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:48:16,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1478142.0, ans=0.125 2023-06-23 11:48:22,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1478202.0, ans=0.125 2023-06-23 11:48:31,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1478202.0, ans=0.125 2023-06-23 11:49:16,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1478322.0, ans=0.1 2023-06-23 11:49:43,587 INFO [train.py:996] (1/4) Epoch 9, batch 2450, loss[loss=0.2532, simple_loss=0.3287, pruned_loss=0.08889, over 21842.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.315, pruned_loss=0.08388, over 4265619.85 frames. ], batch size: 118, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:50:07,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-23 11:50:36,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1478622.0, ans=0.125 2023-06-23 11:50:54,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1478622.0, ans=0.125 2023-06-23 11:50:55,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.454e+02 8.636e+02 1.143e+03 3.101e+03, threshold=1.727e+03, percent-clipped=10.0 2023-06-23 11:51:24,330 INFO [train.py:996] (1/4) Epoch 9, batch 2500, loss[loss=0.2289, simple_loss=0.303, pruned_loss=0.07739, over 21111.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3127, pruned_loss=0.08305, over 4267644.20 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:51:54,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1478802.0, ans=0.1 2023-06-23 11:52:10,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1478862.0, ans=0.125 2023-06-23 11:52:32,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1478922.0, ans=0.0 2023-06-23 11:52:35,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-23 11:52:47,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1478982.0, ans=0.0 2023-06-23 11:53:03,588 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-23 11:53:05,829 INFO [train.py:996] (1/4) Epoch 9, batch 2550, loss[loss=0.2221, simple_loss=0.2918, pruned_loss=0.07621, over 21514.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3119, pruned_loss=0.08189, over 4263164.70 frames. ], batch size: 391, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:53:22,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1479102.0, ans=0.015 2023-06-23 11:54:19,360 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.829e+02 7.204e+02 9.571e+02 1.455e+03 2.660e+03, threshold=1.914e+03, percent-clipped=10.0 2023-06-23 11:54:46,761 INFO [train.py:996] (1/4) Epoch 9, batch 2600, loss[loss=0.2572, simple_loss=0.3248, pruned_loss=0.09484, over 21786.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3126, pruned_loss=0.08269, over 4263036.25 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:54:49,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.07 vs. limit=10.0 2023-06-23 11:54:50,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1479342.0, ans=0.2 2023-06-23 11:55:00,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-23 11:55:06,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1479402.0, ans=0.125 2023-06-23 11:55:32,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1479462.0, ans=0.0 2023-06-23 11:55:32,387 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:55:47,004 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:55:59,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1479522.0, ans=15.0 2023-06-23 11:56:28,274 INFO [train.py:996] (1/4) Epoch 9, batch 2650, loss[loss=0.2331, simple_loss=0.3139, pruned_loss=0.07612, over 21741.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3121, pruned_loss=0.08313, over 4272666.55 frames. ], batch size: 247, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:56:33,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1479642.0, ans=0.0 2023-06-23 11:56:54,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1479702.0, ans=0.2 2023-06-23 11:57:10,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1479762.0, ans=0.125 2023-06-23 11:57:16,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1479822.0, ans=0.125 2023-06-23 11:57:16,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1479822.0, ans=0.0 2023-06-23 11:57:18,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-23 11:57:28,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1479822.0, ans=0.0 2023-06-23 11:57:35,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-23 11:57:37,856 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.783e+02 6.164e+02 7.850e+02 1.193e+03 2.220e+03, threshold=1.570e+03, percent-clipped=3.0 2023-06-23 11:57:52,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1479882.0, ans=0.125 2023-06-23 11:58:05,291 INFO [train.py:996] (1/4) Epoch 9, batch 2700, loss[loss=0.2391, simple_loss=0.3142, pruned_loss=0.08198, over 21315.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3098, pruned_loss=0.08137, over 4261185.95 frames. ], batch size: 549, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:58:54,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1480122.0, ans=0.1 2023-06-23 11:59:14,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1480122.0, ans=0.05 2023-06-23 11:59:43,056 INFO [train.py:996] (1/4) Epoch 9, batch 2750, loss[loss=0.2282, simple_loss=0.3069, pruned_loss=0.07476, over 21471.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3098, pruned_loss=0.08157, over 4262844.90 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:00:25,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1480362.0, ans=0.2 2023-06-23 12:00:58,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.035e+02 5.584e+02 7.738e+02 1.130e+03 2.409e+03, threshold=1.548e+03, percent-clipped=8.0 2023-06-23 12:01:27,107 INFO [train.py:996] (1/4) Epoch 9, batch 2800, loss[loss=0.2997, simple_loss=0.381, pruned_loss=0.1092, over 21802.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3131, pruned_loss=0.08296, over 4260874.58 frames. ], batch size: 316, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 12:01:39,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=22.5 2023-06-23 12:02:39,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-23 12:03:09,779 INFO [train.py:996] (1/4) Epoch 9, batch 2850, loss[loss=0.2281, simple_loss=0.3118, pruned_loss=0.07216, over 21649.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.316, pruned_loss=0.08469, over 4262251.24 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:03:25,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1480902.0, ans=0.125 2023-06-23 12:03:47,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-23 12:04:01,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-23 12:04:02,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1480962.0, ans=0.2 2023-06-23 12:04:25,146 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.732e+02 5.955e+02 8.768e+02 1.383e+03 2.997e+03, threshold=1.754e+03, percent-clipped=21.0 2023-06-23 12:04:41,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1481082.0, ans=0.0 2023-06-23 12:04:50,685 INFO [train.py:996] (1/4) Epoch 9, batch 2900, loss[loss=0.2179, simple_loss=0.2855, pruned_loss=0.07514, over 21263.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3122, pruned_loss=0.08345, over 4269092.69 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:05:09,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1481202.0, ans=0.0 2023-06-23 12:05:17,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1481202.0, ans=0.125 2023-06-23 12:05:38,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1481262.0, ans=0.125 2023-06-23 12:05:54,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1481322.0, ans=0.1 2023-06-23 12:05:54,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1481322.0, ans=0.125 2023-06-23 12:05:58,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1481322.0, ans=0.0 2023-06-23 12:06:00,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1481322.0, ans=0.02 2023-06-23 12:06:03,576 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-23 12:06:17,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1481382.0, ans=0.125 2023-06-23 12:06:17,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1481382.0, ans=0.09899494936611666 2023-06-23 12:06:24,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1481382.0, ans=0.0 2023-06-23 12:06:31,731 INFO [train.py:996] (1/4) Epoch 9, batch 2950, loss[loss=0.2418, simple_loss=0.3438, pruned_loss=0.06989, over 21687.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3144, pruned_loss=0.08415, over 4279016.30 frames. ], batch size: 389, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:07:24,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1481562.0, ans=0.125 2023-06-23 12:07:28,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-23 12:07:48,170 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.698e+02 5.526e+02 7.206e+02 1.005e+03 1.804e+03, threshold=1.441e+03, percent-clipped=1.0 2023-06-23 12:08:01,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1481682.0, ans=0.125 2023-06-23 12:08:08,704 INFO [train.py:996] (1/4) Epoch 9, batch 3000, loss[loss=0.2664, simple_loss=0.3413, pruned_loss=0.09579, over 21479.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3185, pruned_loss=0.08509, over 4281963.77 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:08:08,705 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 12:08:24,855 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2522, simple_loss=0.3459, pruned_loss=0.07924, over 1796401.00 frames. 2023-06-23 12:08:24,855 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 12:09:22,100 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.39 vs. limit=10.0 2023-06-23 12:09:53,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1481982.0, ans=0.1 2023-06-23 12:10:09,603 INFO [train.py:996] (1/4) Epoch 9, batch 3050, loss[loss=0.2648, simple_loss=0.3335, pruned_loss=0.09804, over 21766.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3179, pruned_loss=0.08367, over 4280022.69 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:11:24,324 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.590e+02 6.164e+02 8.184e+02 1.174e+03 2.237e+03, threshold=1.637e+03, percent-clipped=13.0 2023-06-23 12:11:31,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-23 12:11:44,960 INFO [train.py:996] (1/4) Epoch 9, batch 3100, loss[loss=0.2461, simple_loss=0.3521, pruned_loss=0.0701, over 19747.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3178, pruned_loss=0.08204, over 4278752.17 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:12:26,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1482402.0, ans=0.125 2023-06-23 12:13:11,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-23 12:13:36,668 INFO [train.py:996] (1/4) Epoch 9, batch 3150, loss[loss=0.3303, simple_loss=0.4288, pruned_loss=0.1159, over 21184.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3194, pruned_loss=0.08208, over 4276456.87 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:13:46,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482642.0, ans=0.1 2023-06-23 12:14:35,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1482822.0, ans=0.125 2023-06-23 12:14:46,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1482822.0, ans=0.0 2023-06-23 12:14:47,711 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.159e+02 5.985e+02 8.535e+02 1.297e+03 2.485e+03, threshold=1.707e+03, percent-clipped=14.0 2023-06-23 12:14:49,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482822.0, ans=0.1 2023-06-23 12:15:15,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1482882.0, ans=0.2 2023-06-23 12:15:20,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1482882.0, ans=0.0 2023-06-23 12:15:24,483 INFO [train.py:996] (1/4) Epoch 9, batch 3200, loss[loss=0.2179, simple_loss=0.2918, pruned_loss=0.072, over 21229.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3205, pruned_loss=0.08311, over 4275854.58 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:15:36,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1482942.0, ans=0.125 2023-06-23 12:15:57,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1483002.0, ans=0.125 2023-06-23 12:16:09,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1483062.0, ans=0.1 2023-06-23 12:16:35,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1483122.0, ans=0.0 2023-06-23 12:16:55,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1483182.0, ans=0.125 2023-06-23 12:17:00,002 INFO [train.py:996] (1/4) Epoch 9, batch 3250, loss[loss=0.2557, simple_loss=0.3232, pruned_loss=0.09411, over 21761.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3223, pruned_loss=0.08531, over 4281267.90 frames. ], batch size: 124, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:17:00,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1483242.0, ans=0.0 2023-06-23 12:17:01,147 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-23 12:17:11,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1483242.0, ans=0.0 2023-06-23 12:17:57,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1483422.0, ans=0.0 2023-06-23 12:18:19,936 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.506e+02 4.912e+02 6.771e+02 1.025e+03 2.208e+03, threshold=1.354e+03, percent-clipped=1.0 2023-06-23 12:18:44,469 INFO [train.py:996] (1/4) Epoch 9, batch 3300, loss[loss=0.2488, simple_loss=0.3246, pruned_loss=0.0865, over 20787.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.316, pruned_loss=0.08457, over 4272724.02 frames. ], batch size: 611, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:19:00,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1483602.0, ans=0.0 2023-06-23 12:19:02,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1483602.0, ans=0.2 2023-06-23 12:19:08,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1483602.0, ans=0.125 2023-06-23 12:19:23,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1483662.0, ans=0.2 2023-06-23 12:19:56,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1483722.0, ans=0.2 2023-06-23 12:20:01,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1483722.0, ans=0.025 2023-06-23 12:20:24,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-23 12:20:25,117 INFO [train.py:996] (1/4) Epoch 9, batch 3350, loss[loss=0.2396, simple_loss=0.3038, pruned_loss=0.08773, over 21503.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3186, pruned_loss=0.08435, over 4276117.81 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:20:27,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1483842.0, ans=0.0 2023-06-23 12:21:25,910 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-23 12:21:40,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.733e+02 6.031e+02 9.696e+02 1.341e+03 2.502e+03, threshold=1.939e+03, percent-clipped=21.0 2023-06-23 12:21:42,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1484082.0, ans=0.2 2023-06-23 12:22:04,014 INFO [train.py:996] (1/4) Epoch 9, batch 3400, loss[loss=0.2142, simple_loss=0.2821, pruned_loss=0.07311, over 21279.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3192, pruned_loss=0.08543, over 4275770.50 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:23:44,217 INFO [train.py:996] (1/4) Epoch 9, batch 3450, loss[loss=0.231, simple_loss=0.2937, pruned_loss=0.08412, over 21539.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3144, pruned_loss=0.08475, over 4262208.07 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:23:59,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1484502.0, ans=0.0 2023-06-23 12:25:01,677 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.990e+02 5.635e+02 8.079e+02 1.246e+03 2.546e+03, threshold=1.616e+03, percent-clipped=4.0 2023-06-23 12:25:08,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1484682.0, ans=0.125 2023-06-23 12:25:20,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1484742.0, ans=0.1 2023-06-23 12:25:21,113 INFO [train.py:996] (1/4) Epoch 9, batch 3500, loss[loss=0.2711, simple_loss=0.3551, pruned_loss=0.09358, over 21607.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.324, pruned_loss=0.08856, over 4271017.00 frames. ], batch size: 263, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:25:56,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1484802.0, ans=0.0 2023-06-23 12:26:13,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1484862.0, ans=0.04949747468305833 2023-06-23 12:26:39,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1484982.0, ans=0.125 2023-06-23 12:26:50,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1484982.0, ans=0.125 2023-06-23 12:26:53,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1484982.0, ans=0.0 2023-06-23 12:26:55,738 INFO [train.py:996] (1/4) Epoch 9, batch 3550, loss[loss=0.2122, simple_loss=0.2804, pruned_loss=0.07199, over 21860.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3257, pruned_loss=0.08938, over 4274718.66 frames. ], batch size: 372, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:27:18,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485102.0, ans=0.1 2023-06-23 12:27:31,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1485162.0, ans=0.05 2023-06-23 12:27:41,554 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-23 12:28:10,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.810e+02 5.500e+02 7.334e+02 1.032e+03 1.807e+03, threshold=1.467e+03, percent-clipped=3.0 2023-06-23 12:28:22,419 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:28:29,464 INFO [train.py:996] (1/4) Epoch 9, batch 3600, loss[loss=0.2038, simple_loss=0.2714, pruned_loss=0.06813, over 21743.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3197, pruned_loss=0.08844, over 4279005.55 frames. ], batch size: 282, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:29:32,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1485462.0, ans=0.025 2023-06-23 12:29:36,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1485462.0, ans=0.125 2023-06-23 12:29:44,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1485522.0, ans=0.125 2023-06-23 12:29:51,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1485522.0, ans=0.125 2023-06-23 12:30:11,161 INFO [train.py:996] (1/4) Epoch 9, batch 3650, loss[loss=0.2315, simple_loss=0.2896, pruned_loss=0.0867, over 21848.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3204, pruned_loss=0.08838, over 4278645.22 frames. ], batch size: 107, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:30:27,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1485642.0, ans=0.125 2023-06-23 12:30:36,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-23 12:30:44,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485702.0, ans=0.1 2023-06-23 12:30:55,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1485702.0, ans=0.125 2023-06-23 12:30:55,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1485702.0, ans=10.0 2023-06-23 12:31:06,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1485762.0, ans=0.125 2023-06-23 12:31:21,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485822.0, ans=0.1 2023-06-23 12:31:34,332 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.982e+02 5.571e+02 7.883e+02 1.166e+03 2.519e+03, threshold=1.577e+03, percent-clipped=13.0 2023-06-23 12:31:52,186 INFO [train.py:996] (1/4) Epoch 9, batch 3700, loss[loss=0.2427, simple_loss=0.3176, pruned_loss=0.08384, over 21852.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3206, pruned_loss=0.08837, over 4286489.31 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:32:53,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486062.0, ans=0.1 2023-06-23 12:33:41,739 INFO [train.py:996] (1/4) Epoch 9, batch 3750, loss[loss=0.2345, simple_loss=0.3085, pruned_loss=0.08025, over 21858.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3197, pruned_loss=0.08806, over 4290003.77 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:34:16,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1486302.0, ans=0.125 2023-06-23 12:34:26,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-23 12:34:29,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486362.0, ans=0.1 2023-06-23 12:34:54,211 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 5.333e+02 7.689e+02 1.174e+03 2.476e+03, threshold=1.538e+03, percent-clipped=10.0 2023-06-23 12:35:22,202 INFO [train.py:996] (1/4) Epoch 9, batch 3800, loss[loss=0.2187, simple_loss=0.2901, pruned_loss=0.07361, over 21617.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3182, pruned_loss=0.0867, over 4291796.76 frames. ], batch size: 263, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:35:33,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1486542.0, ans=0.035 2023-06-23 12:35:48,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1486602.0, ans=0.125 2023-06-23 12:35:57,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486662.0, ans=0.1 2023-06-23 12:36:08,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1486662.0, ans=0.5 2023-06-23 12:36:36,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1486782.0, ans=0.2 2023-06-23 12:36:55,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1486842.0, ans=0.2 2023-06-23 12:36:56,393 INFO [train.py:996] (1/4) Epoch 9, batch 3850, loss[loss=0.243, simple_loss=0.3019, pruned_loss=0.09207, over 21770.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3157, pruned_loss=0.08709, over 4288752.25 frames. ], batch size: 124, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:36:57,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-23 12:37:08,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-23 12:37:38,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1486962.0, ans=0.125 2023-06-23 12:38:03,155 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.513e+02 4.770e+02 6.158e+02 8.423e+02 1.897e+03, threshold=1.232e+03, percent-clipped=2.0 2023-06-23 12:38:08,795 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:38:18,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1487082.0, ans=0.0 2023-06-23 12:38:23,862 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-23 12:38:25,943 INFO [train.py:996] (1/4) Epoch 9, batch 3900, loss[loss=0.2192, simple_loss=0.2853, pruned_loss=0.07658, over 21466.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3103, pruned_loss=0.08605, over 4294445.72 frames. ], batch size: 211, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:38:33,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1487142.0, ans=0.0 2023-06-23 12:38:38,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.16 vs. limit=15.0 2023-06-23 12:39:03,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1487202.0, ans=0.0 2023-06-23 12:39:22,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-23 12:40:09,697 INFO [train.py:996] (1/4) Epoch 9, batch 3950, loss[loss=0.2104, simple_loss=0.2967, pruned_loss=0.06204, over 21446.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.311, pruned_loss=0.08511, over 4292629.07 frames. ], batch size: 211, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:40:14,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-23 12:40:37,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-23 12:40:39,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1487502.0, ans=0.025 2023-06-23 12:41:16,572 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.422e+02 5.095e+02 8.161e+02 1.017e+03 2.071e+03, threshold=1.632e+03, percent-clipped=17.0 2023-06-23 12:41:32,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-23 12:41:49,084 INFO [train.py:996] (1/4) Epoch 9, batch 4000, loss[loss=0.1959, simple_loss=0.2562, pruned_loss=0.0678, over 21556.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3035, pruned_loss=0.08012, over 4287216.22 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:42:22,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1487802.0, ans=0.125 2023-06-23 12:42:54,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1487922.0, ans=0.125 2023-06-23 12:43:07,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1487982.0, ans=0.125 2023-06-23 12:43:29,602 INFO [train.py:996] (1/4) Epoch 9, batch 4050, loss[loss=0.204, simple_loss=0.2663, pruned_loss=0.07088, over 21414.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3033, pruned_loss=0.07837, over 4275979.31 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:43:51,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1488102.0, ans=0.125 2023-06-23 12:43:51,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1488102.0, ans=0.0 2023-06-23 12:43:51,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1488102.0, ans=0.1 2023-06-23 12:44:48,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.596e+02 4.877e+02 6.874e+02 9.034e+02 2.185e+03, threshold=1.375e+03, percent-clipped=7.0 2023-06-23 12:44:58,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1488282.0, ans=0.125 2023-06-23 12:45:09,747 INFO [train.py:996] (1/4) Epoch 9, batch 4100, loss[loss=0.1951, simple_loss=0.2731, pruned_loss=0.05854, over 21234.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3062, pruned_loss=0.07889, over 4275115.39 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:45:12,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1488342.0, ans=0.125 2023-06-23 12:45:13,655 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:45:24,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-23 12:46:24,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1488522.0, ans=0.125 2023-06-23 12:46:27,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.74 vs. limit=6.0 2023-06-23 12:46:38,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1488582.0, ans=0.125 2023-06-23 12:46:41,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-23 12:46:51,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-23 12:46:54,916 INFO [train.py:996] (1/4) Epoch 9, batch 4150, loss[loss=0.2797, simple_loss=0.3395, pruned_loss=0.11, over 21373.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3069, pruned_loss=0.07712, over 4278794.92 frames. ], batch size: 507, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:47:00,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1488642.0, ans=0.125 2023-06-23 12:47:16,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1488702.0, ans=0.0 2023-06-23 12:47:24,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1488702.0, ans=0.125 2023-06-23 12:47:34,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1488762.0, ans=0.05 2023-06-23 12:47:56,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1488822.0, ans=0.125 2023-06-23 12:48:10,696 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.593e+02 5.949e+02 7.437e+02 1.328e+03 3.049e+03, threshold=1.487e+03, percent-clipped=21.0 2023-06-23 12:48:26,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1488882.0, ans=0.2 2023-06-23 12:48:27,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1488882.0, ans=0.95 2023-06-23 12:48:32,533 INFO [train.py:996] (1/4) Epoch 9, batch 4200, loss[loss=0.2173, simple_loss=0.2897, pruned_loss=0.07244, over 21694.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3058, pruned_loss=0.07638, over 4272081.14 frames. ], batch size: 298, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:48:39,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1488942.0, ans=0.125 2023-06-23 12:49:00,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1489002.0, ans=0.2 2023-06-23 12:49:54,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1489122.0, ans=0.1 2023-06-23 12:50:14,814 INFO [train.py:996] (1/4) Epoch 9, batch 4250, loss[loss=0.2784, simple_loss=0.3565, pruned_loss=0.1002, over 21768.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.314, pruned_loss=0.07884, over 4275135.14 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:50:18,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1489242.0, ans=0.125 2023-06-23 12:51:36,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1489422.0, ans=0.0 2023-06-23 12:51:43,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.972e+02 6.108e+02 8.578e+02 1.174e+03 2.664e+03, threshold=1.716e+03, percent-clipped=12.0 2023-06-23 12:51:58,848 INFO [train.py:996] (1/4) Epoch 9, batch 4300, loss[loss=0.2525, simple_loss=0.38, pruned_loss=0.0625, over 19702.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3196, pruned_loss=0.08023, over 4270136.41 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:52:40,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1489602.0, ans=0.1 2023-06-23 12:53:22,231 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-23 12:53:37,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1489782.0, ans=0.125 2023-06-23 12:53:40,718 INFO [train.py:996] (1/4) Epoch 9, batch 4350, loss[loss=0.2388, simple_loss=0.3029, pruned_loss=0.08736, over 21674.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3199, pruned_loss=0.0802, over 4270437.07 frames. ], batch size: 299, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:54:40,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1489962.0, ans=0.125 2023-06-23 12:55:05,712 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.397e+02 5.616e+02 8.582e+02 1.448e+03 3.184e+03, threshold=1.716e+03, percent-clipped=15.0 2023-06-23 12:55:30,739 INFO [train.py:996] (1/4) Epoch 9, batch 4400, loss[loss=0.2075, simple_loss=0.2845, pruned_loss=0.06523, over 21549.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3163, pruned_loss=0.07977, over 4271412.65 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:57:12,861 INFO [train.py:996] (1/4) Epoch 9, batch 4450, loss[loss=0.3185, simple_loss=0.398, pruned_loss=0.1195, over 21654.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3264, pruned_loss=0.08273, over 4275273.80 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:57:15,472 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-23 12:57:18,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1490442.0, ans=0.07 2023-06-23 12:57:23,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1490442.0, ans=0.0 2023-06-23 12:57:36,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-23 12:57:51,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1490502.0, ans=0.0 2023-06-23 12:58:01,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1490562.0, ans=0.2 2023-06-23 12:58:01,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1490562.0, ans=0.1 2023-06-23 12:58:14,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1490622.0, ans=0.1 2023-06-23 12:58:23,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1490622.0, ans=0.125 2023-06-23 12:58:39,560 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.863e+02 6.461e+02 1.015e+03 1.659e+03 5.524e+03, threshold=2.029e+03, percent-clipped=20.0 2023-06-23 12:58:54,113 INFO [train.py:996] (1/4) Epoch 9, batch 4500, loss[loss=0.2172, simple_loss=0.2957, pruned_loss=0.06937, over 21210.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3259, pruned_loss=0.08405, over 4283182.70 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:59:42,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1490862.0, ans=0.0 2023-06-23 12:59:44,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1490862.0, ans=0.0 2023-06-23 12:59:55,768 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-23 13:00:42,567 INFO [train.py:996] (1/4) Epoch 9, batch 4550, loss[loss=0.2851, simple_loss=0.3626, pruned_loss=0.1039, over 21858.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3297, pruned_loss=0.08542, over 4283266.46 frames. ], batch size: 124, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 13:00:43,937 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.79 vs. limit=5.0 2023-06-23 13:01:09,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1491102.0, ans=0.035 2023-06-23 13:01:20,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1491162.0, ans=0.2 2023-06-23 13:01:34,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1491162.0, ans=0.125 2023-06-23 13:01:44,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1491222.0, ans=0.125 2023-06-23 13:02:03,742 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 4.940e+02 6.297e+02 8.588e+02 1.962e+03, threshold=1.259e+03, percent-clipped=0.0 2023-06-23 13:02:14,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1491282.0, ans=0.125 2023-06-23 13:02:14,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1491282.0, ans=0.0 2023-06-23 13:02:27,882 INFO [train.py:996] (1/4) Epoch 9, batch 4600, loss[loss=0.2597, simple_loss=0.3336, pruned_loss=0.09293, over 21877.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3307, pruned_loss=0.08657, over 4289707.67 frames. ], batch size: 124, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:02:33,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1491342.0, ans=0.1 2023-06-23 13:02:41,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1491342.0, ans=0.0 2023-06-23 13:03:01,602 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-23 13:03:34,634 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.59 vs. limit=22.5 2023-06-23 13:03:59,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1491582.0, ans=0.2 2023-06-23 13:04:01,961 INFO [train.py:996] (1/4) Epoch 9, batch 4650, loss[loss=0.1778, simple_loss=0.2463, pruned_loss=0.05469, over 21509.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3231, pruned_loss=0.08386, over 4284244.35 frames. ], batch size: 195, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:04:31,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1491702.0, ans=0.0 2023-06-23 13:04:33,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1491762.0, ans=0.125 2023-06-23 13:04:51,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1491762.0, ans=0.0 2023-06-23 13:04:56,558 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:05:16,751 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.483e+02 4.910e+02 6.100e+02 8.336e+02 1.525e+03, threshold=1.220e+03, percent-clipped=3.0 2023-06-23 13:05:35,276 INFO [train.py:996] (1/4) Epoch 9, batch 4700, loss[loss=0.2325, simple_loss=0.3089, pruned_loss=0.07803, over 20689.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3126, pruned_loss=0.0812, over 4279839.79 frames. ], batch size: 608, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:07:13,921 INFO [train.py:996] (1/4) Epoch 9, batch 4750, loss[loss=0.2379, simple_loss=0.2979, pruned_loss=0.08893, over 21340.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3059, pruned_loss=0.08049, over 4279829.90 frames. ], batch size: 159, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:07:15,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1492242.0, ans=0.125 2023-06-23 13:07:31,886 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:07:41,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1492302.0, ans=0.125 2023-06-23 13:08:13,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-06-23 13:08:31,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-23 13:08:34,849 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.263e+02 4.594e+02 6.434e+02 8.978e+02 1.748e+03, threshold=1.287e+03, percent-clipped=12.0 2023-06-23 13:08:54,124 INFO [train.py:996] (1/4) Epoch 9, batch 4800, loss[loss=0.2315, simple_loss=0.324, pruned_loss=0.06945, over 21684.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3069, pruned_loss=0.08131, over 4287562.44 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:10:10,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1492782.0, ans=0.125 2023-06-23 13:10:32,163 INFO [train.py:996] (1/4) Epoch 9, batch 4850, loss[loss=0.2541, simple_loss=0.3168, pruned_loss=0.09569, over 21316.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3059, pruned_loss=0.08111, over 4286262.40 frames. ], batch size: 143, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:11:33,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1493022.0, ans=0.2 2023-06-23 13:11:35,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1493022.0, ans=0.0 2023-06-23 13:11:37,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1493022.0, ans=0.1 2023-06-23 13:11:46,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1493022.0, ans=0.1 2023-06-23 13:11:52,603 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 5.446e+02 6.914e+02 1.031e+03 2.241e+03, threshold=1.383e+03, percent-clipped=12.0 2023-06-23 13:12:10,733 INFO [train.py:996] (1/4) Epoch 9, batch 4900, loss[loss=0.3033, simple_loss=0.3816, pruned_loss=0.1125, over 21492.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3087, pruned_loss=0.0822, over 4285515.60 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:12:15,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1493142.0, ans=0.125 2023-06-23 13:12:40,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-23 13:13:33,934 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-23 13:13:50,343 INFO [train.py:996] (1/4) Epoch 9, batch 4950, loss[loss=0.2104, simple_loss=0.321, pruned_loss=0.04985, over 21188.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3121, pruned_loss=0.08038, over 4277205.18 frames. ], batch size: 548, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:14:06,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1493502.0, ans=0.125 2023-06-23 13:14:22,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1493502.0, ans=0.125 2023-06-23 13:14:36,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1493562.0, ans=0.125 2023-06-23 13:14:41,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1493562.0, ans=0.1 2023-06-23 13:15:16,666 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.074e+02 4.885e+02 7.048e+02 1.090e+03 2.586e+03, threshold=1.410e+03, percent-clipped=12.0 2023-06-23 13:15:19,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1493682.0, ans=15.0 2023-06-23 13:15:19,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.69 vs. limit=15.0 2023-06-23 13:15:25,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1493682.0, ans=0.125 2023-06-23 13:15:29,191 INFO [train.py:996] (1/4) Epoch 9, batch 5000, loss[loss=0.26, simple_loss=0.3357, pruned_loss=0.09217, over 21849.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3109, pruned_loss=0.07727, over 4282872.84 frames. ], batch size: 371, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:15:46,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1493802.0, ans=0.125 2023-06-23 13:15:55,534 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=12.0 2023-06-23 13:16:12,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-23 13:17:02,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=22.5 2023-06-23 13:17:04,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1493982.0, ans=0.07 2023-06-23 13:17:07,610 INFO [train.py:996] (1/4) Epoch 9, batch 5050, loss[loss=0.2314, simple_loss=0.3032, pruned_loss=0.07982, over 21493.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3107, pruned_loss=0.07928, over 4281244.28 frames. ], batch size: 131, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:17:20,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1494042.0, ans=0.05 2023-06-23 13:17:30,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1494102.0, ans=0.125 2023-06-23 13:18:05,623 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-23 13:18:28,090 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.800e+02 5.012e+02 6.560e+02 1.020e+03 2.026e+03, threshold=1.312e+03, percent-clipped=12.0 2023-06-23 13:18:42,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1494282.0, ans=0.1 2023-06-23 13:18:45,167 INFO [train.py:996] (1/4) Epoch 9, batch 5100, loss[loss=0.1729, simple_loss=0.2523, pruned_loss=0.04679, over 21657.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3106, pruned_loss=0.07984, over 4278389.84 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:19:38,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1494462.0, ans=0.1 2023-06-23 13:20:25,883 INFO [train.py:996] (1/4) Epoch 9, batch 5150, loss[loss=0.2392, simple_loss=0.311, pruned_loss=0.08369, over 21897.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3094, pruned_loss=0.08021, over 4283632.76 frames. ], batch size: 107, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:20:28,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-06-23 13:20:55,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1494702.0, ans=0.125 2023-06-23 13:21:13,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1494762.0, ans=0.125 2023-06-23 13:21:21,287 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-23 13:21:54,189 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.473e+02 5.169e+02 7.852e+02 1.261e+03 2.554e+03, threshold=1.570e+03, percent-clipped=23.0 2023-06-23 13:22:07,043 INFO [train.py:996] (1/4) Epoch 9, batch 5200, loss[loss=0.2008, simple_loss=0.2827, pruned_loss=0.05939, over 21588.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3138, pruned_loss=0.08057, over 4276730.41 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:22:15,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1494942.0, ans=0.1 2023-06-23 13:22:23,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1494942.0, ans=0.0 2023-06-23 13:23:28,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1495182.0, ans=0.1 2023-06-23 13:23:36,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1495182.0, ans=0.125 2023-06-23 13:23:36,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1495182.0, ans=0.125 2023-06-23 13:23:45,418 INFO [train.py:996] (1/4) Epoch 9, batch 5250, loss[loss=0.2838, simple_loss=0.3593, pruned_loss=0.1041, over 21560.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3179, pruned_loss=0.07957, over 4278459.37 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:24:11,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1495302.0, ans=0.2 2023-06-23 13:24:13,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1495302.0, ans=0.035 2023-06-23 13:24:33,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1495362.0, ans=0.2 2023-06-23 13:24:45,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1495422.0, ans=0.1 2023-06-23 13:24:55,393 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:25:10,473 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.544e+02 5.638e+02 7.794e+02 1.189e+03 2.542e+03, threshold=1.559e+03, percent-clipped=12.0 2023-06-23 13:25:27,942 INFO [train.py:996] (1/4) Epoch 9, batch 5300, loss[loss=0.2237, simple_loss=0.2803, pruned_loss=0.08361, over 20730.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3172, pruned_loss=0.08033, over 4278437.58 frames. ], batch size: 607, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:25:32,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=12.0 2023-06-23 13:25:34,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1495542.0, ans=0.025 2023-06-23 13:26:44,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1495722.0, ans=0.125 2023-06-23 13:27:01,900 INFO [train.py:996] (1/4) Epoch 9, batch 5350, loss[loss=0.263, simple_loss=0.3899, pruned_loss=0.06801, over 19821.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3164, pruned_loss=0.08205, over 4280989.63 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:27:13,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1495842.0, ans=0.5 2023-06-23 13:27:14,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1495842.0, ans=0.125 2023-06-23 13:27:37,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.89 vs. limit=10.0 2023-06-23 13:27:46,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1495962.0, ans=0.125 2023-06-23 13:28:15,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1496022.0, ans=0.0 2023-06-23 13:28:20,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1496022.0, ans=0.04949747468305833 2023-06-23 13:28:28,214 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 6.687e+02 9.729e+02 1.336e+03 3.211e+03, threshold=1.946e+03, percent-clipped=15.0 2023-06-23 13:28:46,030 INFO [train.py:996] (1/4) Epoch 9, batch 5400, loss[loss=0.2648, simple_loss=0.3135, pruned_loss=0.108, over 21770.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3156, pruned_loss=0.08329, over 4290255.84 frames. ], batch size: 508, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:29:33,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=12.0 2023-06-23 13:29:42,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1496262.0, ans=0.125 2023-06-23 13:29:50,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1496322.0, ans=0.125 2023-06-23 13:29:51,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1496322.0, ans=10.0 2023-06-23 13:30:30,287 INFO [train.py:996] (1/4) Epoch 9, batch 5450, loss[loss=0.2957, simple_loss=0.4018, pruned_loss=0.09477, over 21656.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3155, pruned_loss=0.08202, over 4287719.62 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:31:12,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2023-06-23 13:31:33,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1496622.0, ans=0.0 2023-06-23 13:31:38,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1496622.0, ans=0.1 2023-06-23 13:31:53,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-23 13:31:53,799 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.097e+02 7.378e+02 1.209e+03 3.523e+03, threshold=1.476e+03, percent-clipped=4.0 2023-06-23 13:32:09,886 INFO [train.py:996] (1/4) Epoch 9, batch 5500, loss[loss=0.2215, simple_loss=0.3221, pruned_loss=0.0605, over 21780.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3177, pruned_loss=0.07815, over 4291467.58 frames. ], batch size: 371, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:32:17,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1496742.0, ans=0.04949747468305833 2023-06-23 13:32:41,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1496802.0, ans=0.5 2023-06-23 13:33:50,879 INFO [train.py:996] (1/4) Epoch 9, batch 5550, loss[loss=0.2856, simple_loss=0.3783, pruned_loss=0.0965, over 21489.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3183, pruned_loss=0.07631, over 4281124.00 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:34:28,425 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-23 13:35:17,416 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.428e+02 4.782e+02 7.322e+02 1.097e+03 2.363e+03, threshold=1.464e+03, percent-clipped=11.0 2023-06-23 13:35:38,015 INFO [train.py:996] (1/4) Epoch 9, batch 5600, loss[loss=0.2831, simple_loss=0.3807, pruned_loss=0.09278, over 21686.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3169, pruned_loss=0.07423, over 4280753.11 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:35:40,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.75 vs. limit=22.5 2023-06-23 13:36:19,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1497462.0, ans=0.2 2023-06-23 13:36:24,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1497462.0, ans=0.1 2023-06-23 13:36:53,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1497582.0, ans=0.125 2023-06-23 13:36:54,469 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=22.5 2023-06-23 13:37:10,937 INFO [train.py:996] (1/4) Epoch 9, batch 5650, loss[loss=0.2948, simple_loss=0.3504, pruned_loss=0.1196, over 21614.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3169, pruned_loss=0.07566, over 4274804.65 frames. ], batch size: 471, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:38:32,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1497882.0, ans=0.5 2023-06-23 13:38:34,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.803e+02 5.906e+02 7.888e+02 1.290e+03 2.997e+03, threshold=1.578e+03, percent-clipped=20.0 2023-06-23 13:38:46,245 INFO [train.py:996] (1/4) Epoch 9, batch 5700, loss[loss=0.254, simple_loss=0.3146, pruned_loss=0.09669, over 21292.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3165, pruned_loss=0.0775, over 4281712.14 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:38:49,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-23 13:39:18,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-23 13:39:39,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-23 13:40:20,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1498182.0, ans=0.0 2023-06-23 13:40:31,247 INFO [train.py:996] (1/4) Epoch 9, batch 5750, loss[loss=0.157, simple_loss=0.2387, pruned_loss=0.03765, over 21330.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3114, pruned_loss=0.07499, over 4275454.05 frames. ], batch size: 131, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:40:59,816 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-23 13:41:52,307 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.170e+02 4.651e+02 7.542e+02 1.104e+03 3.145e+03, threshold=1.508e+03, percent-clipped=9.0 2023-06-23 13:41:52,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1498482.0, ans=0.1 2023-06-23 13:42:06,762 INFO [train.py:996] (1/4) Epoch 9, batch 5800, loss[loss=0.2382, simple_loss=0.3418, pruned_loss=0.06732, over 21673.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3111, pruned_loss=0.07341, over 4270912.07 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:42:31,219 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:42:31,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1498602.0, ans=0.125 2023-06-23 13:42:39,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1498602.0, ans=0.2 2023-06-23 13:42:55,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1498662.0, ans=0.1 2023-06-23 13:43:52,457 INFO [train.py:996] (1/4) Epoch 9, batch 5850, loss[loss=0.177, simple_loss=0.2853, pruned_loss=0.03436, over 21767.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.309, pruned_loss=0.06871, over 4274633.32 frames. ], batch size: 332, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:44:29,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1498902.0, ans=0.125 2023-06-23 13:44:36,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1498962.0, ans=0.125 2023-06-23 13:44:43,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-23 13:44:53,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-23 13:45:02,365 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.98 vs. limit=15.0 2023-06-23 13:45:18,562 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.169e+02 5.972e+02 8.890e+02 1.873e+03, threshold=1.194e+03, percent-clipped=6.0 2023-06-23 13:45:31,297 INFO [train.py:996] (1/4) Epoch 9, batch 5900, loss[loss=0.1825, simple_loss=0.2635, pruned_loss=0.05071, over 21769.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.302, pruned_loss=0.06465, over 4275836.36 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:46:20,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1499262.0, ans=0.0 2023-06-23 13:46:26,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1499262.0, ans=0.2 2023-06-23 13:47:07,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1499382.0, ans=0.125 2023-06-23 13:47:10,329 INFO [train.py:996] (1/4) Epoch 9, batch 5950, loss[loss=0.2215, simple_loss=0.2833, pruned_loss=0.07984, over 22019.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3013, pruned_loss=0.06712, over 4271569.37 frames. ], batch size: 103, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:47:36,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1499502.0, ans=0.125 2023-06-23 13:48:40,992 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.364e+02 5.592e+02 7.986e+02 1.183e+03 2.385e+03, threshold=1.597e+03, percent-clipped=25.0 2023-06-23 13:48:48,906 INFO [train.py:996] (1/4) Epoch 9, batch 6000, loss[loss=0.2164, simple_loss=0.2763, pruned_loss=0.07824, over 21798.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2981, pruned_loss=0.07026, over 4275013.21 frames. ], batch size: 124, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:48:48,906 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 13:49:04,386 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.6350, 1.5156, 2.3181, 2.0449, 1.4388, 2.4371, 2.2968, 1.1657], device='cuda:1') 2023-06-23 13:49:10,220 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2648, simple_loss=0.3557, pruned_loss=0.08691, over 1796401.00 frames. 2023-06-23 13:49:10,221 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 13:49:20,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1499742.0, ans=0.125 2023-06-23 13:50:12,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1499922.0, ans=0.125 2023-06-23 13:50:37,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-23 13:50:50,972 INFO [train.py:996] (1/4) Epoch 9, batch 6050, loss[loss=0.2762, simple_loss=0.3838, pruned_loss=0.08426, over 20859.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2935, pruned_loss=0.07102, over 4273850.25 frames. ], batch size: 608, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:50:56,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1500042.0, ans=0.125 2023-06-23 13:51:05,036 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-23 13:51:13,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1500102.0, ans=0.125 2023-06-23 13:52:15,751 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.157e+02 5.140e+02 6.887e+02 9.775e+02 3.553e+03, threshold=1.377e+03, percent-clipped=5.0 2023-06-23 13:52:21,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1500282.0, ans=0.125 2023-06-23 13:52:28,821 INFO [train.py:996] (1/4) Epoch 9, batch 6100, loss[loss=0.2462, simple_loss=0.3074, pruned_loss=0.09252, over 21579.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2926, pruned_loss=0.07055, over 4273404.49 frames. ], batch size: 195, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:54:01,351 INFO [train.py:996] (1/4) Epoch 9, batch 6150, loss[loss=0.2391, simple_loss=0.3079, pruned_loss=0.0852, over 21567.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2958, pruned_loss=0.07251, over 4280739.36 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:54:48,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1500762.0, ans=0.1 2023-06-23 13:55:32,772 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.545e+02 5.300e+02 7.269e+02 1.178e+03 2.947e+03, threshold=1.454e+03, percent-clipped=13.0 2023-06-23 13:55:35,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1500882.0, ans=0.125 2023-06-23 13:55:46,199 INFO [train.py:996] (1/4) Epoch 9, batch 6200, loss[loss=0.2523, simple_loss=0.3238, pruned_loss=0.09035, over 21865.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2978, pruned_loss=0.07341, over 4276464.43 frames. ], batch size: 107, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:56:13,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1501002.0, ans=0.0 2023-06-23 13:56:41,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1501062.0, ans=0.125 2023-06-23 13:56:46,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1501122.0, ans=0.125 2023-06-23 13:57:02,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1501122.0, ans=0.0 2023-06-23 13:57:25,632 INFO [train.py:996] (1/4) Epoch 9, batch 6250, loss[loss=0.2256, simple_loss=0.3345, pruned_loss=0.05839, over 21644.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3047, pruned_loss=0.07399, over 4282958.34 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:57:54,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1501302.0, ans=0.125 2023-06-23 13:58:41,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1501422.0, ans=0.0 2023-06-23 13:58:52,956 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-23 13:58:56,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.754e+02 9.579e+02 1.636e+03 2.645e+03, threshold=1.916e+03, percent-clipped=27.0 2023-06-23 13:59:04,191 INFO [train.py:996] (1/4) Epoch 9, batch 6300, loss[loss=0.2664, simple_loss=0.3291, pruned_loss=0.1019, over 21744.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3089, pruned_loss=0.07358, over 4287937.90 frames. ], batch size: 112, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:59:37,255 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-23 13:59:58,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1501662.0, ans=0.125 2023-06-23 14:00:07,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1501722.0, ans=0.125 2023-06-23 14:00:14,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.21 vs. limit=10.0 2023-06-23 14:00:30,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1501782.0, ans=0.1 2023-06-23 14:00:49,580 INFO [train.py:996] (1/4) Epoch 9, batch 6350, loss[loss=0.2736, simple_loss=0.3363, pruned_loss=0.1054, over 21436.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3126, pruned_loss=0.07773, over 4289734.82 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:00:54,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1501842.0, ans=0.2 2023-06-23 14:00:58,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1501842.0, ans=0.0 2023-06-23 14:01:20,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1501902.0, ans=0.125 2023-06-23 14:01:23,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1501902.0, ans=0.0 2023-06-23 14:01:26,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-23 14:01:36,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1501962.0, ans=0.0 2023-06-23 14:01:56,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1502022.0, ans=0.0 2023-06-23 14:02:03,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1502022.0, ans=0.0 2023-06-23 14:02:13,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1502022.0, ans=0.0 2023-06-23 14:02:23,999 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.408e+02 6.300e+02 8.863e+02 1.224e+03 2.908e+03, threshold=1.773e+03, percent-clipped=5.0 2023-06-23 14:02:32,236 INFO [train.py:996] (1/4) Epoch 9, batch 6400, loss[loss=0.2425, simple_loss=0.3224, pruned_loss=0.08126, over 21448.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3187, pruned_loss=0.08106, over 4286163.00 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:02:35,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1502142.0, ans=0.1 2023-06-23 14:02:36,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1502142.0, ans=0.0 2023-06-23 14:02:38,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.04 vs. limit=15.0 2023-06-23 14:03:34,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1502322.0, ans=0.125 2023-06-23 14:03:37,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1502322.0, ans=0.125 2023-06-23 14:03:53,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1502382.0, ans=0.125 2023-06-23 14:03:53,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1502382.0, ans=0.1 2023-06-23 14:04:10,692 INFO [train.py:996] (1/4) Epoch 9, batch 6450, loss[loss=0.2657, simple_loss=0.3514, pruned_loss=0.09002, over 21329.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3199, pruned_loss=0.08099, over 4279509.12 frames. ], batch size: 549, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:04:22,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1502442.0, ans=0.0 2023-06-23 14:05:42,802 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 5.544e+02 7.373e+02 1.174e+03 2.232e+03, threshold=1.475e+03, percent-clipped=4.0 2023-06-23 14:05:45,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1502682.0, ans=0.0 2023-06-23 14:05:48,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1502682.0, ans=0.1 2023-06-23 14:05:51,140 INFO [train.py:996] (1/4) Epoch 9, batch 6500, loss[loss=0.2417, simple_loss=0.2953, pruned_loss=0.09411, over 21262.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3132, pruned_loss=0.07958, over 4281501.08 frames. ], batch size: 471, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:06:00,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1502742.0, ans=0.125 2023-06-23 14:06:03,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1502742.0, ans=0.07 2023-06-23 14:06:10,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-23 14:07:27,886 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-23 14:07:35,329 INFO [train.py:996] (1/4) Epoch 9, batch 6550, loss[loss=0.2123, simple_loss=0.2826, pruned_loss=0.07101, over 21139.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3135, pruned_loss=0.07928, over 4281579.33 frames. ], batch size: 143, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:08:19,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1503162.0, ans=0.125 2023-06-23 14:08:24,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1503162.0, ans=0.2 2023-06-23 14:08:44,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1503222.0, ans=0.125 2023-06-23 14:09:01,020 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:09:02,150 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.667e+02 5.627e+02 7.547e+02 1.040e+03 2.189e+03, threshold=1.509e+03, percent-clipped=8.0 2023-06-23 14:09:15,140 INFO [train.py:996] (1/4) Epoch 9, batch 6600, loss[loss=0.1811, simple_loss=0.2433, pruned_loss=0.05948, over 21229.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3082, pruned_loss=0.07869, over 4271251.04 frames. ], batch size: 548, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:09:25,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1503342.0, ans=0.025 2023-06-23 14:09:49,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1503402.0, ans=0.125 2023-06-23 14:10:10,378 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-23 14:10:19,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-23 14:10:46,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1503582.0, ans=0.1 2023-06-23 14:10:47,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1503582.0, ans=0.125 2023-06-23 14:10:47,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1503582.0, ans=0.125 2023-06-23 14:10:55,774 INFO [train.py:996] (1/4) Epoch 9, batch 6650, loss[loss=0.1913, simple_loss=0.2559, pruned_loss=0.06338, over 21538.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3031, pruned_loss=0.07555, over 4274250.07 frames. ], batch size: 230, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:10:57,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1503642.0, ans=0.1 2023-06-23 14:11:31,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1503762.0, ans=0.125 2023-06-23 14:11:44,237 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.08 vs. limit=15.0 2023-06-23 14:12:20,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1503882.0, ans=0.0 2023-06-23 14:12:29,799 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 5.915e+02 8.562e+02 1.227e+03 3.234e+03, threshold=1.712e+03, percent-clipped=18.0 2023-06-23 14:12:33,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1503882.0, ans=0.0 2023-06-23 14:12:36,199 INFO [train.py:996] (1/4) Epoch 9, batch 6700, loss[loss=0.2139, simple_loss=0.2679, pruned_loss=0.07988, over 21273.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2989, pruned_loss=0.07557, over 4275210.30 frames. ], batch size: 144, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:13:26,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1504062.0, ans=0.2 2023-06-23 14:14:14,349 INFO [train.py:996] (1/4) Epoch 9, batch 6750, loss[loss=0.2546, simple_loss=0.3168, pruned_loss=0.0962, over 21845.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2971, pruned_loss=0.0765, over 4271918.81 frames. ], batch size: 371, lr: 3.34e-03, grad_scale: 8.0 2023-06-23 14:14:18,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1504242.0, ans=0.125 2023-06-23 14:14:37,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1504302.0, ans=0.125 2023-06-23 14:15:25,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1504422.0, ans=0.125 2023-06-23 14:15:46,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1504482.0, ans=0.05 2023-06-23 14:15:48,513 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.896e+02 6.568e+02 9.733e+02 1.340e+03 2.605e+03, threshold=1.947e+03, percent-clipped=12.0 2023-06-23 14:15:53,580 INFO [train.py:996] (1/4) Epoch 9, batch 6800, loss[loss=0.245, simple_loss=0.31, pruned_loss=0.08999, over 21756.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3006, pruned_loss=0.07915, over 4278768.71 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:16:01,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1504542.0, ans=0.95 2023-06-23 14:16:11,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1504542.0, ans=0.125 2023-06-23 14:17:08,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1504722.0, ans=0.1 2023-06-23 14:17:31,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-23 14:17:32,357 INFO [train.py:996] (1/4) Epoch 9, batch 6850, loss[loss=0.2959, simple_loss=0.3374, pruned_loss=0.1272, over 21714.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.2983, pruned_loss=0.08035, over 4289117.57 frames. ], batch size: 508, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:18:56,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1505082.0, ans=0.0 2023-06-23 14:18:59,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1505082.0, ans=0.05 2023-06-23 14:19:07,212 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 4.770e+02 6.261e+02 9.211e+02 1.923e+03, threshold=1.252e+03, percent-clipped=0.0 2023-06-23 14:19:12,188 INFO [train.py:996] (1/4) Epoch 9, batch 6900, loss[loss=0.2352, simple_loss=0.3172, pruned_loss=0.07664, over 21819.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2987, pruned_loss=0.07973, over 4293411.56 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:19:27,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1505202.0, ans=0.125 2023-06-23 14:20:20,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1505322.0, ans=0.0 2023-06-23 14:20:38,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-23 14:20:51,940 INFO [train.py:996] (1/4) Epoch 9, batch 6950, loss[loss=0.2182, simple_loss=0.2947, pruned_loss=0.07087, over 21252.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2993, pruned_loss=0.07617, over 4290987.29 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:20:54,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1505442.0, ans=0.125 2023-06-23 14:21:09,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-23 14:21:29,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1505502.0, ans=0.1 2023-06-23 14:22:01,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1505622.0, ans=0.125 2023-06-23 14:22:22,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1505682.0, ans=0.0 2023-06-23 14:22:26,508 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 5.568e+02 8.095e+02 1.122e+03 2.896e+03, threshold=1.619e+03, percent-clipped=20.0 2023-06-23 14:22:31,452 INFO [train.py:996] (1/4) Epoch 9, batch 7000, loss[loss=0.2264, simple_loss=0.29, pruned_loss=0.08141, over 21214.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3007, pruned_loss=0.07841, over 4286902.26 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:23:51,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1505922.0, ans=0.125 2023-06-23 14:23:58,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1505982.0, ans=0.125 2023-06-23 14:24:00,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1505982.0, ans=0.0 2023-06-23 14:24:16,411 INFO [train.py:996] (1/4) Epoch 9, batch 7050, loss[loss=0.2043, simple_loss=0.2874, pruned_loss=0.06058, over 21611.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2992, pruned_loss=0.07702, over 4280418.71 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:24:38,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.97 vs. limit=10.0 2023-06-23 14:25:50,189 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 5.176e+02 7.948e+02 1.176e+03 2.286e+03, threshold=1.590e+03, percent-clipped=9.0 2023-06-23 14:25:55,059 INFO [train.py:996] (1/4) Epoch 9, batch 7100, loss[loss=0.1984, simple_loss=0.2744, pruned_loss=0.06126, over 21718.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3041, pruned_loss=0.07867, over 4277492.84 frames. ], batch size: 247, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:26:03,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1506342.0, ans=0.125 2023-06-23 14:26:26,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1506402.0, ans=0.5 2023-06-23 14:26:35,848 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:26:57,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1506522.0, ans=0.2 2023-06-23 14:26:58,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1506522.0, ans=0.2 2023-06-23 14:27:19,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-23 14:27:22,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1506582.0, ans=0.1 2023-06-23 14:27:35,254 INFO [train.py:996] (1/4) Epoch 9, batch 7150, loss[loss=0.2451, simple_loss=0.3258, pruned_loss=0.0822, over 21548.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3045, pruned_loss=0.0777, over 4279372.95 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:27:45,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1506642.0, ans=0.125 2023-06-23 14:29:10,903 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.602e+02 5.826e+02 7.818e+02 1.087e+03 2.405e+03, threshold=1.564e+03, percent-clipped=10.0 2023-06-23 14:29:21,021 INFO [train.py:996] (1/4) Epoch 9, batch 7200, loss[loss=0.2486, simple_loss=0.3113, pruned_loss=0.09293, over 21313.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3069, pruned_loss=0.0799, over 4267856.93 frames. ], batch size: 471, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:29:27,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1506942.0, ans=0.125 2023-06-23 14:29:50,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1507002.0, ans=0.125 2023-06-23 14:29:59,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507002.0, ans=0.1 2023-06-23 14:30:08,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1507062.0, ans=0.125 2023-06-23 14:31:00,728 INFO [train.py:996] (1/4) Epoch 9, batch 7250, loss[loss=0.2061, simple_loss=0.2786, pruned_loss=0.06682, over 14994.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3013, pruned_loss=0.07966, over 4257735.24 frames. ], batch size: 60, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:31:04,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507242.0, ans=0.1 2023-06-23 14:31:23,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-23 14:31:46,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-23 14:32:37,021 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.651e+02 4.870e+02 5.610e+02 7.177e+02 1.494e+03, threshold=1.122e+03, percent-clipped=0.0 2023-06-23 14:32:44,946 INFO [train.py:996] (1/4) Epoch 9, batch 7300, loss[loss=0.2166, simple_loss=0.2728, pruned_loss=0.08022, over 21383.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2945, pruned_loss=0.07832, over 4265976.13 frames. ], batch size: 144, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:33:03,021 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:33:03,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1507602.0, ans=0.0 2023-06-23 14:33:53,285 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-23 14:33:56,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1507722.0, ans=0.125 2023-06-23 14:34:25,799 INFO [train.py:996] (1/4) Epoch 9, batch 7350, loss[loss=0.2716, simple_loss=0.3426, pruned_loss=0.1003, over 21827.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2948, pruned_loss=0.07979, over 4269766.65 frames. ], batch size: 118, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:34:27,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1507842.0, ans=0.2 2023-06-23 14:34:58,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1507902.0, ans=0.95 2023-06-23 14:35:26,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-23 14:35:51,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1508082.0, ans=0.0 2023-06-23 14:36:02,849 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.713e+02 6.185e+02 8.335e+02 1.224e+03 2.285e+03, threshold=1.667e+03, percent-clipped=37.0 2023-06-23 14:36:06,158 INFO [train.py:996] (1/4) Epoch 9, batch 7400, loss[loss=0.2439, simple_loss=0.3137, pruned_loss=0.08708, over 21427.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3016, pruned_loss=0.08143, over 4275746.14 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:36:18,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-23 14:36:26,862 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:36:38,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1508202.0, ans=0.125 2023-06-23 14:37:47,587 INFO [train.py:996] (1/4) Epoch 9, batch 7450, loss[loss=0.2387, simple_loss=0.3036, pruned_loss=0.08686, over 21834.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.2993, pruned_loss=0.07943, over 4277824.92 frames. ], batch size: 98, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:38:05,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.14 vs. limit=15.0 2023-06-23 14:38:30,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1508562.0, ans=0.0 2023-06-23 14:38:39,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1508562.0, ans=0.5 2023-06-23 14:39:04,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1508622.0, ans=0.0 2023-06-23 14:39:09,780 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-06-23 14:39:14,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1508682.0, ans=0.95 2023-06-23 14:39:26,566 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.876e+02 5.448e+02 8.462e+02 1.438e+03 2.608e+03, threshold=1.692e+03, percent-clipped=12.0 2023-06-23 14:39:35,293 INFO [train.py:996] (1/4) Epoch 9, batch 7500, loss[loss=0.2528, simple_loss=0.3297, pruned_loss=0.08798, over 21385.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3061, pruned_loss=0.08192, over 4282751.87 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:39:44,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-23 14:40:10,434 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2023-06-23 14:40:39,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1508922.0, ans=0.125 2023-06-23 14:41:16,348 INFO [train.py:996] (1/4) Epoch 9, batch 7550, loss[loss=0.1855, simple_loss=0.2573, pruned_loss=0.05684, over 16387.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3113, pruned_loss=0.08076, over 4277420.25 frames. ], batch size: 61, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:41:23,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1509042.0, ans=0.0 2023-06-23 14:41:37,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1509102.0, ans=0.125 2023-06-23 14:41:44,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-23 14:41:59,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1509162.0, ans=0.1 2023-06-23 14:42:18,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1509222.0, ans=0.0 2023-06-23 14:42:52,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.575e+02 5.410e+02 7.103e+02 1.048e+03 2.085e+03, threshold=1.421e+03, percent-clipped=3.0 2023-06-23 14:42:53,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-23 14:42:56,200 INFO [train.py:996] (1/4) Epoch 9, batch 7600, loss[loss=0.2489, simple_loss=0.3199, pruned_loss=0.08898, over 21881.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.311, pruned_loss=0.08067, over 4272897.63 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:43:25,361 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.89 vs. limit=22.5 2023-06-23 14:43:33,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1509462.0, ans=0.1 2023-06-23 14:43:47,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-23 14:43:54,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1509522.0, ans=0.1 2023-06-23 14:44:11,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-23 14:44:37,258 INFO [train.py:996] (1/4) Epoch 9, batch 7650, loss[loss=0.2416, simple_loss=0.3141, pruned_loss=0.08454, over 21864.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3094, pruned_loss=0.08192, over 4277869.25 frames. ], batch size: 124, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:44:37,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1509642.0, ans=0.125 2023-06-23 14:44:43,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1509642.0, ans=0.04949747468305833 2023-06-23 14:45:01,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-23 14:46:07,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1509882.0, ans=0.1 2023-06-23 14:46:07,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1509882.0, ans=0.0 2023-06-23 14:46:15,409 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.771e+02 5.554e+02 6.899e+02 1.039e+03 2.407e+03, threshold=1.380e+03, percent-clipped=12.0 2023-06-23 14:46:18,630 INFO [train.py:996] (1/4) Epoch 9, batch 7700, loss[loss=0.2025, simple_loss=0.2603, pruned_loss=0.07234, over 21058.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3135, pruned_loss=0.08582, over 4284788.58 frames. ], batch size: 608, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:46:47,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-23 14:47:05,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-23 14:47:16,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1510062.0, ans=0.0 2023-06-23 14:47:27,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1510122.0, ans=0.2 2023-06-23 14:47:51,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1510182.0, ans=0.1 2023-06-23 14:47:58,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-23 14:48:05,193 INFO [train.py:996] (1/4) Epoch 9, batch 7750, loss[loss=0.2052, simple_loss=0.2752, pruned_loss=0.06758, over 20227.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3174, pruned_loss=0.08478, over 4282763.57 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:48:49,598 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-23 14:49:23,993 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:49:44,539 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.753e+02 6.023e+02 8.792e+02 1.462e+03 2.647e+03, threshold=1.758e+03, percent-clipped=26.0 2023-06-23 14:49:46,188 INFO [train.py:996] (1/4) Epoch 9, batch 7800, loss[loss=0.2833, simple_loss=0.3512, pruned_loss=0.1077, over 21548.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.319, pruned_loss=0.08474, over 4285151.37 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:49:59,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1510542.0, ans=0.2 2023-06-23 14:50:07,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1510602.0, ans=0.125 2023-06-23 14:50:09,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1510602.0, ans=0.0 2023-06-23 14:50:32,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1510662.0, ans=0.2 2023-06-23 14:50:50,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1510722.0, ans=0.1 2023-06-23 14:51:25,390 INFO [train.py:996] (1/4) Epoch 9, batch 7850, loss[loss=0.2246, simple_loss=0.2834, pruned_loss=0.08286, over 21414.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3125, pruned_loss=0.08348, over 4283897.55 frames. ], batch size: 195, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:51:38,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1510842.0, ans=0.0 2023-06-23 14:52:50,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1511082.0, ans=0.125 2023-06-23 14:52:56,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1511082.0, ans=0.0 2023-06-23 14:53:05,533 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.699e+02 5.401e+02 8.491e+02 1.335e+03 3.211e+03, threshold=1.698e+03, percent-clipped=14.0 2023-06-23 14:53:06,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-23 14:53:07,044 INFO [train.py:996] (1/4) Epoch 9, batch 7900, loss[loss=0.2661, simple_loss=0.3753, pruned_loss=0.07845, over 21639.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3066, pruned_loss=0.08106, over 4284675.01 frames. ], batch size: 414, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:53:34,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.53 vs. limit=15.0 2023-06-23 14:54:36,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1511382.0, ans=0.0 2023-06-23 14:54:46,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1511382.0, ans=0.0 2023-06-23 14:54:49,002 INFO [train.py:996] (1/4) Epoch 9, batch 7950, loss[loss=0.2754, simple_loss=0.3492, pruned_loss=0.1008, over 21756.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.309, pruned_loss=0.08076, over 4282971.12 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:54:56,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1511442.0, ans=0.1 2023-06-23 14:55:52,393 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-23 14:56:40,203 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.215e+02 6.156e+02 9.056e+02 1.636e+03 2.892e+03, threshold=1.811e+03, percent-clipped=22.0 2023-06-23 14:56:41,921 INFO [train.py:996] (1/4) Epoch 9, batch 8000, loss[loss=0.2444, simple_loss=0.3428, pruned_loss=0.07299, over 21303.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3142, pruned_loss=0.08231, over 4280418.35 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:57:30,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-23 14:57:31,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1511862.0, ans=0.1 2023-06-23 14:58:32,842 INFO [train.py:996] (1/4) Epoch 9, batch 8050, loss[loss=0.2014, simple_loss=0.257, pruned_loss=0.07295, over 21837.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3187, pruned_loss=0.08331, over 4279989.48 frames. ], batch size: 107, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:59:09,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=1512162.0, ans=12.0 2023-06-23 15:00:10,917 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.139e+02 6.350e+02 8.777e+02 1.241e+03 2.449e+03, threshold=1.755e+03, percent-clipped=9.0 2023-06-23 15:00:12,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-06-23 15:00:12,609 INFO [train.py:996] (1/4) Epoch 9, batch 8100, loss[loss=0.1897, simple_loss=0.2644, pruned_loss=0.05752, over 21822.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3165, pruned_loss=0.08354, over 4282084.51 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:00:18,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1512342.0, ans=0.0 2023-06-23 15:01:08,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1512462.0, ans=0.2 2023-06-23 15:01:23,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.84 vs. limit=10.0 2023-06-23 15:01:59,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1512582.0, ans=0.5 2023-06-23 15:02:01,823 INFO [train.py:996] (1/4) Epoch 9, batch 8150, loss[loss=0.2052, simple_loss=0.293, pruned_loss=0.05868, over 19940.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3246, pruned_loss=0.0854, over 4279938.83 frames. ], batch size: 703, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:02:54,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1512762.0, ans=0.125 2023-06-23 15:03:07,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1512822.0, ans=0.125 2023-06-23 15:03:10,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1512822.0, ans=0.125 2023-06-23 15:03:13,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1512822.0, ans=0.2 2023-06-23 15:03:18,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1512882.0, ans=0.125 2023-06-23 15:03:40,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.771e+02 6.551e+02 1.054e+03 1.725e+03 4.751e+03, threshold=2.109e+03, percent-clipped=24.0 2023-06-23 15:03:40,815 INFO [train.py:996] (1/4) Epoch 9, batch 8200, loss[loss=0.1989, simple_loss=0.262, pruned_loss=0.06792, over 21665.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3166, pruned_loss=0.08203, over 4274202.56 frames. ], batch size: 248, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:03:46,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1512942.0, ans=0.125 2023-06-23 15:04:35,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1513062.0, ans=0.0 2023-06-23 15:05:13,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1513182.0, ans=0.0 2023-06-23 15:05:22,560 INFO [train.py:996] (1/4) Epoch 9, batch 8250, loss[loss=0.2687, simple_loss=0.3892, pruned_loss=0.07415, over 20816.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3164, pruned_loss=0.08262, over 4273746.65 frames. ], batch size: 607, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:05:24,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1513242.0, ans=0.125 2023-06-23 15:05:44,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-06-23 15:06:05,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-23 15:06:34,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1513422.0, ans=0.0 2023-06-23 15:06:50,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1513482.0, ans=0.125 2023-06-23 15:07:03,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1513542.0, ans=0.125 2023-06-23 15:07:04,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.470e+02 6.606e+02 8.935e+02 1.467e+03 2.616e+03, threshold=1.787e+03, percent-clipped=8.0 2023-06-23 15:07:04,515 INFO [train.py:996] (1/4) Epoch 9, batch 8300, loss[loss=0.2691, simple_loss=0.3543, pruned_loss=0.09189, over 21654.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3157, pruned_loss=0.08046, over 4281579.65 frames. ], batch size: 414, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:07:18,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1513542.0, ans=0.1 2023-06-23 15:07:47,568 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-06-23 15:07:49,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1513662.0, ans=0.0 2023-06-23 15:07:51,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1513662.0, ans=0.0 2023-06-23 15:08:16,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1513722.0, ans=0.125 2023-06-23 15:08:22,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513722.0, ans=0.1 2023-06-23 15:08:37,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1513782.0, ans=0.1 2023-06-23 15:08:49,805 INFO [train.py:996] (1/4) Epoch 9, batch 8350, loss[loss=0.2492, simple_loss=0.3163, pruned_loss=0.09105, over 21333.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3136, pruned_loss=0.07841, over 4271088.08 frames. ], batch size: 471, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:09:45,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1513962.0, ans=0.125 2023-06-23 15:10:27,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1514082.0, ans=0.125 2023-06-23 15:10:30,571 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.376e+02 4.465e+02 5.586e+02 8.616e+02 2.675e+03, threshold=1.117e+03, percent-clipped=3.0 2023-06-23 15:10:30,603 INFO [train.py:996] (1/4) Epoch 9, batch 8400, loss[loss=0.2056, simple_loss=0.2874, pruned_loss=0.06192, over 21671.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3115, pruned_loss=0.07635, over 4263396.56 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:11:07,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-23 15:11:23,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-23 15:11:38,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1514322.0, ans=0.1 2023-06-23 15:12:09,835 INFO [train.py:996] (1/4) Epoch 9, batch 8450, loss[loss=0.2527, simple_loss=0.3187, pruned_loss=0.0933, over 21898.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3099, pruned_loss=0.07628, over 4273353.86 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:12:11,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1514442.0, ans=0.125 2023-06-23 15:12:16,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-23 15:12:59,297 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:13:00,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1514562.0, ans=0.125 2023-06-23 15:13:31,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1514682.0, ans=0.0 2023-06-23 15:13:34,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1514682.0, ans=0.125 2023-06-23 15:13:49,166 INFO [train.py:996] (1/4) Epoch 9, batch 8500, loss[loss=0.222, simple_loss=0.2885, pruned_loss=0.07778, over 21529.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.307, pruned_loss=0.07708, over 4270475.88 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:13:50,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.348e+02 5.860e+02 7.972e+02 1.284e+03 3.475e+03, threshold=1.594e+03, percent-clipped=30.0 2023-06-23 15:14:23,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1514802.0, ans=0.0 2023-06-23 15:14:24,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1514862.0, ans=0.125 2023-06-23 15:15:02,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1514922.0, ans=0.035 2023-06-23 15:15:29,030 INFO [train.py:996] (1/4) Epoch 9, batch 8550, loss[loss=0.2303, simple_loss=0.3177, pruned_loss=0.07146, over 21738.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3103, pruned_loss=0.0793, over 4273012.73 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:15:29,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1515042.0, ans=0.125 2023-06-23 15:15:42,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1515042.0, ans=0.125 2023-06-23 15:16:23,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1515162.0, ans=0.09899494936611666 2023-06-23 15:16:37,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1515222.0, ans=0.125 2023-06-23 15:16:47,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1515222.0, ans=0.1 2023-06-23 15:16:50,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1515282.0, ans=0.0 2023-06-23 15:17:16,099 INFO [train.py:996] (1/4) Epoch 9, batch 8600, loss[loss=0.2754, simple_loss=0.3554, pruned_loss=0.09774, over 21396.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3174, pruned_loss=0.08116, over 4272515.74 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:17:17,701 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.589e+02 6.156e+02 8.850e+02 1.190e+03 2.823e+03, threshold=1.770e+03, percent-clipped=15.0 2023-06-23 15:17:24,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1515342.0, ans=0.125 2023-06-23 15:18:02,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1515462.0, ans=0.125 2023-06-23 15:18:58,342 INFO [train.py:996] (1/4) Epoch 9, batch 8650, loss[loss=0.2097, simple_loss=0.3048, pruned_loss=0.0573, over 21634.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3234, pruned_loss=0.08258, over 4279918.20 frames. ], batch size: 263, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:19:08,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1515642.0, ans=0.125 2023-06-23 15:20:04,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1515822.0, ans=0.2 2023-06-23 15:20:22,481 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.36 vs. limit=22.5 2023-06-23 15:20:37,575 INFO [train.py:996] (1/4) Epoch 9, batch 8700, loss[loss=0.1763, simple_loss=0.2399, pruned_loss=0.05638, over 21477.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.313, pruned_loss=0.07929, over 4276172.39 frames. ], batch size: 195, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:20:39,036 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 5.219e+02 7.580e+02 1.289e+03 2.063e+03, threshold=1.516e+03, percent-clipped=5.0 2023-06-23 15:20:57,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1516002.0, ans=0.0 2023-06-23 15:21:32,048 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:22:16,402 INFO [train.py:996] (1/4) Epoch 9, batch 8750, loss[loss=0.229, simple_loss=0.2963, pruned_loss=0.08079, over 21165.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3087, pruned_loss=0.07988, over 4282462.30 frames. ], batch size: 176, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:22:16,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1516242.0, ans=0.0 2023-06-23 15:22:20,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1516242.0, ans=0.125 2023-06-23 15:22:42,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1516302.0, ans=0.0 2023-06-23 15:23:11,549 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:23:11,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1516362.0, ans=0.125 2023-06-23 15:23:30,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1516422.0, ans=0.05 2023-06-23 15:23:42,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1516482.0, ans=0.0 2023-06-23 15:23:55,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=15.0 2023-06-23 15:23:56,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1516482.0, ans=0.0 2023-06-23 15:23:59,313 INFO [train.py:996] (1/4) Epoch 9, batch 8800, loss[loss=0.2979, simple_loss=0.3989, pruned_loss=0.09846, over 19822.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3202, pruned_loss=0.083, over 4282117.38 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:24:00,930 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.692e+02 5.630e+02 7.362e+02 1.054e+03 2.858e+03, threshold=1.472e+03, percent-clipped=8.0 2023-06-23 15:24:03,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1516542.0, ans=0.2 2023-06-23 15:24:07,013 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-23 15:24:27,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1516602.0, ans=0.125 2023-06-23 15:24:37,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1516602.0, ans=0.125 2023-06-23 15:24:45,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.28 vs. limit=10.0 2023-06-23 15:25:10,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1516722.0, ans=0.125 2023-06-23 15:25:27,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1516782.0, ans=6.0 2023-06-23 15:25:43,311 INFO [train.py:996] (1/4) Epoch 9, batch 8850, loss[loss=0.2124, simple_loss=0.2996, pruned_loss=0.06259, over 21684.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3269, pruned_loss=0.08471, over 4277523.24 frames. ], batch size: 298, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:25:45,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1516842.0, ans=0.0 2023-06-23 15:26:17,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1516902.0, ans=0.125 2023-06-23 15:26:17,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1516902.0, ans=0.125 2023-06-23 15:27:23,370 INFO [train.py:996] (1/4) Epoch 9, batch 8900, loss[loss=0.2178, simple_loss=0.307, pruned_loss=0.06431, over 21734.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3232, pruned_loss=0.08309, over 4272934.26 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:27:29,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-23 15:27:30,222 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.861e+02 8.789e+02 1.394e+03 2.613e+03, threshold=1.758e+03, percent-clipped=19.0 2023-06-23 15:27:40,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1517142.0, ans=0.0 2023-06-23 15:28:32,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1517322.0, ans=0.2 2023-06-23 15:29:06,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1517382.0, ans=0.125 2023-06-23 15:29:10,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-23 15:29:10,584 INFO [train.py:996] (1/4) Epoch 9, batch 8950, loss[loss=0.2468, simple_loss=0.3194, pruned_loss=0.08705, over 21685.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3256, pruned_loss=0.08339, over 4271370.27 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:30:05,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1517562.0, ans=0.125 2023-06-23 15:30:05,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1517562.0, ans=0.125 2023-06-23 15:30:06,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1517622.0, ans=0.95 2023-06-23 15:30:39,072 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-23 15:30:40,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1517682.0, ans=0.2 2023-06-23 15:30:49,434 INFO [train.py:996] (1/4) Epoch 9, batch 9000, loss[loss=0.2446, simple_loss=0.3032, pruned_loss=0.09299, over 21584.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3189, pruned_loss=0.08298, over 4270621.85 frames. ], batch size: 415, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:30:49,435 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 15:31:03,322 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.0772, 2.2279, 1.9930, 3.0361], device='cuda:1') 2023-06-23 15:31:06,582 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.258, simple_loss=0.3541, pruned_loss=0.08091, over 1796401.00 frames. 2023-06-23 15:31:06,583 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 15:31:08,199 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.792e+02 6.929e+02 1.126e+03 1.882e+03 3.988e+03, threshold=2.252e+03, percent-clipped=24.0 2023-06-23 15:31:24,091 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-23 15:31:33,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1517802.0, ans=0.04949747468305833 2023-06-23 15:31:56,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1517862.0, ans=0.0 2023-06-23 15:32:07,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1517862.0, ans=0.125 2023-06-23 15:32:53,766 INFO [train.py:996] (1/4) Epoch 9, batch 9050, loss[loss=0.2699, simple_loss=0.3889, pruned_loss=0.07539, over 19848.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3153, pruned_loss=0.07979, over 4262854.39 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:33:48,579 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-23 15:34:01,797 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-23 15:34:05,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-23 15:34:07,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1518222.0, ans=0.125 2023-06-23 15:34:39,773 INFO [train.py:996] (1/4) Epoch 9, batch 9100, loss[loss=0.2337, simple_loss=0.3309, pruned_loss=0.06826, over 21791.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3203, pruned_loss=0.08213, over 4267856.82 frames. ], batch size: 316, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:34:42,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.536e+02 5.248e+02 7.167e+02 1.150e+03 2.223e+03, threshold=1.433e+03, percent-clipped=0.0 2023-06-23 15:34:45,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1518342.0, ans=0.125 2023-06-23 15:35:32,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1518462.0, ans=0.125 2023-06-23 15:35:38,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1518522.0, ans=0.2 2023-06-23 15:35:47,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1518522.0, ans=0.035 2023-06-23 15:36:20,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1518642.0, ans=0.125 2023-06-23 15:36:21,762 INFO [train.py:996] (1/4) Epoch 9, batch 9150, loss[loss=0.2391, simple_loss=0.3183, pruned_loss=0.07992, over 21413.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.32, pruned_loss=0.07909, over 4270096.10 frames. ], batch size: 194, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:37:32,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1518822.0, ans=0.125 2023-06-23 15:37:36,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=12.0 2023-06-23 15:37:57,531 INFO [train.py:996] (1/4) Epoch 9, batch 9200, loss[loss=0.2459, simple_loss=0.3339, pruned_loss=0.07891, over 21642.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3227, pruned_loss=0.0781, over 4266408.79 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 32.0 2023-06-23 15:37:58,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1518942.0, ans=0.0 2023-06-23 15:38:01,470 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 6.542e+02 9.064e+02 1.359e+03 2.938e+03, threshold=1.813e+03, percent-clipped=21.0 2023-06-23 15:38:20,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-23 15:38:38,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=15.0 2023-06-23 15:38:42,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1519062.0, ans=0.0 2023-06-23 15:38:55,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1519122.0, ans=0.0 2023-06-23 15:38:55,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1519122.0, ans=0.125 2023-06-23 15:38:56,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.43 vs. limit=6.0 2023-06-23 15:39:09,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=12.0 2023-06-23 15:39:10,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1519122.0, ans=0.125 2023-06-23 15:39:33,169 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-23 15:39:33,632 INFO [train.py:996] (1/4) Epoch 9, batch 9250, loss[loss=0.2544, simple_loss=0.3231, pruned_loss=0.09278, over 21786.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3246, pruned_loss=0.08203, over 4271401.09 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:40:12,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1519302.0, ans=0.0 2023-06-23 15:40:27,142 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=15.0 2023-06-23 15:41:07,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1519482.0, ans=0.125 2023-06-23 15:41:16,033 INFO [train.py:996] (1/4) Epoch 9, batch 9300, loss[loss=0.2488, simple_loss=0.3483, pruned_loss=0.07462, over 21730.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3176, pruned_loss=0.08169, over 4279094.20 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:41:20,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.633e+02 6.631e+02 9.639e+02 1.652e+03 4.303e+03, threshold=1.928e+03, percent-clipped=19.0 2023-06-23 15:41:39,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1519542.0, ans=0.125 2023-06-23 15:42:14,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1519662.0, ans=0.1 2023-06-23 15:42:15,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1519662.0, ans=0.125 2023-06-23 15:42:27,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1519722.0, ans=0.125 2023-06-23 15:42:42,892 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.91 vs. limit=6.0 2023-06-23 15:43:03,541 INFO [train.py:996] (1/4) Epoch 9, batch 9350, loss[loss=0.2075, simple_loss=0.28, pruned_loss=0.0675, over 20713.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.324, pruned_loss=0.08254, over 4270038.75 frames. ], batch size: 607, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:43:12,688 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-23 15:43:36,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-23 15:44:08,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1520022.0, ans=0.125 2023-06-23 15:44:10,469 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-23 15:44:21,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1520022.0, ans=0.07 2023-06-23 15:44:22,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-23 15:44:50,899 INFO [train.py:996] (1/4) Epoch 9, batch 9400, loss[loss=0.2892, simple_loss=0.3299, pruned_loss=0.1243, over 21339.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.324, pruned_loss=0.08358, over 4265754.36 frames. ], batch size: 507, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:44:57,843 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.099e+02 5.147e+02 6.319e+02 1.049e+03 2.062e+03, threshold=1.264e+03, percent-clipped=1.0 2023-06-23 15:45:11,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1520202.0, ans=0.04949747468305833 2023-06-23 15:45:32,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1520262.0, ans=0.0 2023-06-23 15:46:02,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1520322.0, ans=0.1 2023-06-23 15:46:12,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-23 15:46:15,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1520382.0, ans=0.07 2023-06-23 15:46:30,674 INFO [train.py:996] (1/4) Epoch 9, batch 9450, loss[loss=0.2694, simple_loss=0.4178, pruned_loss=0.06053, over 19838.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3185, pruned_loss=0.08314, over 4257376.79 frames. ], batch size: 702, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:46:32,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1520442.0, ans=12.0 2023-06-23 15:46:48,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.26 vs. limit=6.0 2023-06-23 15:47:03,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1520502.0, ans=0.125 2023-06-23 15:47:09,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1520562.0, ans=0.1 2023-06-23 15:47:21,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-23 15:47:47,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.21 vs. limit=6.0 2023-06-23 15:48:06,779 INFO [train.py:996] (1/4) Epoch 9, batch 9500, loss[loss=0.2292, simple_loss=0.2892, pruned_loss=0.08467, over 21832.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3101, pruned_loss=0.08083, over 4254525.88 frames. ], batch size: 107, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:48:13,341 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.451e+02 6.669e+02 1.059e+03 1.542e+03 2.765e+03, threshold=2.119e+03, percent-clipped=38.0 2023-06-23 15:48:33,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1520802.0, ans=0.0 2023-06-23 15:48:45,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-23 15:49:38,017 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.53 vs. limit=10.0 2023-06-23 15:49:48,343 INFO [train.py:996] (1/4) Epoch 9, batch 9550, loss[loss=0.2757, simple_loss=0.3593, pruned_loss=0.0961, over 21831.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3156, pruned_loss=0.08319, over 4250902.81 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:49:58,754 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-23 15:50:01,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1521042.0, ans=0.125 2023-06-23 15:50:15,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1521102.0, ans=0.0 2023-06-23 15:50:25,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-23 15:50:48,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1521222.0, ans=0.125 2023-06-23 15:50:50,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1521222.0, ans=0.2 2023-06-23 15:51:05,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-23 15:51:22,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1521282.0, ans=0.5 2023-06-23 15:51:25,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1521282.0, ans=0.125 2023-06-23 15:51:28,290 INFO [train.py:996] (1/4) Epoch 9, batch 9600, loss[loss=0.2472, simple_loss=0.3172, pruned_loss=0.08859, over 21485.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3172, pruned_loss=0.08419, over 4258286.83 frames. ], batch size: 548, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:51:35,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.691e+02 5.650e+02 7.031e+02 8.940e+02 1.543e+03, threshold=1.406e+03, percent-clipped=0.0 2023-06-23 15:51:49,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1521402.0, ans=0.0 2023-06-23 15:52:15,218 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:52:40,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1521522.0, ans=0.0 2023-06-23 15:52:51,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1521522.0, ans=0.0 2023-06-23 15:52:51,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1521522.0, ans=0.0 2023-06-23 15:52:54,486 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:53:10,519 INFO [train.py:996] (1/4) Epoch 9, batch 9650, loss[loss=0.234, simple_loss=0.3084, pruned_loss=0.07984, over 21729.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3162, pruned_loss=0.08302, over 4252448.40 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:54:10,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-23 15:54:51,611 INFO [train.py:996] (1/4) Epoch 9, batch 9700, loss[loss=0.237, simple_loss=0.3091, pruned_loss=0.08248, over 21894.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3195, pruned_loss=0.0839, over 4257973.99 frames. ], batch size: 124, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:55:02,834 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.621e+02 5.546e+02 7.387e+02 1.131e+03 2.841e+03, threshold=1.477e+03, percent-clipped=15.0 2023-06-23 15:55:05,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1521942.0, ans=0.125 2023-06-23 15:55:16,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1522002.0, ans=0.125 2023-06-23 15:55:46,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1522062.0, ans=0.125 2023-06-23 15:56:32,932 INFO [train.py:996] (1/4) Epoch 9, batch 9750, loss[loss=0.2925, simple_loss=0.3615, pruned_loss=0.1117, over 21244.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3136, pruned_loss=0.08285, over 4263711.98 frames. ], batch size: 143, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:57:40,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1522422.0, ans=0.125 2023-06-23 15:57:42,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-23 15:57:52,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=22.5 2023-06-23 15:58:09,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-23 15:58:11,394 INFO [train.py:996] (1/4) Epoch 9, batch 9800, loss[loss=0.2587, simple_loss=0.3212, pruned_loss=0.09813, over 21894.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3143, pruned_loss=0.08306, over 4263986.05 frames. ], batch size: 107, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:58:18,333 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.522e+02 5.907e+02 7.792e+02 1.093e+03 2.144e+03, threshold=1.558e+03, percent-clipped=9.0 2023-06-23 15:59:03,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1522662.0, ans=0.2 2023-06-23 15:59:12,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-23 15:59:53,363 INFO [train.py:996] (1/4) Epoch 9, batch 9850, loss[loss=0.2224, simple_loss=0.29, pruned_loss=0.07745, over 21794.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3112, pruned_loss=0.08363, over 4274001.73 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:59:57,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1522842.0, ans=0.0 2023-06-23 16:00:15,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1522842.0, ans=0.1 2023-06-23 16:01:02,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1523022.0, ans=0.125 2023-06-23 16:01:34,838 INFO [train.py:996] (1/4) Epoch 9, batch 9900, loss[loss=0.2496, simple_loss=0.3184, pruned_loss=0.09036, over 21727.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3085, pruned_loss=0.08278, over 4262896.92 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:01:45,576 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 5.699e+02 7.870e+02 1.232e+03 3.104e+03, threshold=1.574e+03, percent-clipped=11.0 2023-06-23 16:01:59,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1523202.0, ans=0.0 2023-06-23 16:02:37,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1523322.0, ans=0.2 2023-06-23 16:03:15,850 INFO [train.py:996] (1/4) Epoch 9, batch 9950, loss[loss=0.2373, simple_loss=0.3, pruned_loss=0.08724, over 21533.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3106, pruned_loss=0.08517, over 4269548.69 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:03:41,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1523502.0, ans=0.0 2023-06-23 16:04:07,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1523562.0, ans=0.2 2023-06-23 16:04:25,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1523622.0, ans=0.125 2023-06-23 16:04:26,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1523622.0, ans=0.125 2023-06-23 16:05:02,533 INFO [train.py:996] (1/4) Epoch 9, batch 10000, loss[loss=0.2505, simple_loss=0.3166, pruned_loss=0.09219, over 21226.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3053, pruned_loss=0.08336, over 4267968.35 frames. ], batch size: 143, lr: 3.32e-03, grad_scale: 32.0 2023-06-23 16:05:14,532 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.772e+02 5.460e+02 7.211e+02 1.053e+03 2.107e+03, threshold=1.442e+03, percent-clipped=5.0 2023-06-23 16:05:21,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1523742.0, ans=0.125 2023-06-23 16:05:35,401 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-23 16:06:01,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1523922.0, ans=15.0 2023-06-23 16:06:50,059 INFO [train.py:996] (1/4) Epoch 9, batch 10050, loss[loss=0.2209, simple_loss=0.3011, pruned_loss=0.07032, over 21598.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.307, pruned_loss=0.08341, over 4275727.47 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:06:52,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1524042.0, ans=0.125 2023-06-23 16:06:59,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.45 vs. limit=10.0 2023-06-23 16:07:02,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-23 16:07:15,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1524102.0, ans=0.125 2023-06-23 16:07:16,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-23 16:07:24,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1524102.0, ans=0.125 2023-06-23 16:07:58,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1524222.0, ans=0.125 2023-06-23 16:08:25,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1524282.0, ans=0.125 2023-06-23 16:08:33,206 INFO [train.py:996] (1/4) Epoch 9, batch 10100, loss[loss=0.2425, simple_loss=0.3259, pruned_loss=0.07949, over 21643.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3061, pruned_loss=0.08151, over 4273437.03 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:08:41,443 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.724e+02 5.845e+02 8.901e+02 1.389e+03 2.930e+03, threshold=1.780e+03, percent-clipped=23.0 2023-06-23 16:08:50,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-23 16:09:13,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1524462.0, ans=0.125 2023-06-23 16:09:19,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1524462.0, ans=0.125 2023-06-23 16:09:32,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1524522.0, ans=0.125 2023-06-23 16:09:59,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1524582.0, ans=0.1 2023-06-23 16:10:08,510 INFO [train.py:996] (1/4) Epoch 9, batch 10150, loss[loss=0.241, simple_loss=0.3103, pruned_loss=0.08591, over 21345.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3119, pruned_loss=0.08372, over 4273969.54 frames. ], batch size: 548, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:10:30,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1524702.0, ans=0.125 2023-06-23 16:11:19,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524822.0, ans=0.1 2023-06-23 16:11:28,620 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.88 vs. limit=10.0 2023-06-23 16:11:41,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2023-06-23 16:11:48,574 INFO [train.py:996] (1/4) Epoch 9, batch 10200, loss[loss=0.1981, simple_loss=0.2987, pruned_loss=0.0488, over 20740.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3095, pruned_loss=0.08086, over 4269791.42 frames. ], batch size: 608, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:11:49,520 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-23 16:12:03,220 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.213e+02 5.208e+02 7.016e+02 1.136e+03 3.363e+03, threshold=1.403e+03, percent-clipped=6.0 2023-06-23 16:12:45,908 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:12:50,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1525122.0, ans=0.0 2023-06-23 16:13:00,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-23 16:13:04,273 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=15.0 2023-06-23 16:13:24,906 INFO [train.py:996] (1/4) Epoch 9, batch 10250, loss[loss=0.2595, simple_loss=0.3333, pruned_loss=0.09283, over 21315.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3052, pruned_loss=0.07523, over 4271311.73 frames. ], batch size: 159, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:13:49,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1525302.0, ans=0.0 2023-06-23 16:15:09,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1525482.0, ans=0.1 2023-06-23 16:15:14,145 INFO [train.py:996] (1/4) Epoch 9, batch 10300, loss[loss=0.3001, simple_loss=0.367, pruned_loss=0.1166, over 21386.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3091, pruned_loss=0.0766, over 4268558.10 frames. ], batch size: 131, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:15:23,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1525542.0, ans=0.125 2023-06-23 16:15:24,056 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.918e+02 5.852e+02 8.943e+02 1.203e+03 2.933e+03, threshold=1.789e+03, percent-clipped=17.0 2023-06-23 16:16:28,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525722.0, ans=0.1 2023-06-23 16:16:56,879 INFO [train.py:996] (1/4) Epoch 9, batch 10350, loss[loss=0.2028, simple_loss=0.2705, pruned_loss=0.06756, over 21677.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3098, pruned_loss=0.07642, over 4271484.72 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:17:05,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1525842.0, ans=0.1 2023-06-23 16:17:05,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1525842.0, ans=0.125 2023-06-23 16:17:29,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-23 16:17:51,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1525962.0, ans=0.0 2023-06-23 16:18:27,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1526082.0, ans=0.0 2023-06-23 16:18:29,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1526082.0, ans=0.125 2023-06-23 16:18:37,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1526082.0, ans=0.0 2023-06-23 16:18:45,383 INFO [train.py:996] (1/4) Epoch 9, batch 10400, loss[loss=0.2368, simple_loss=0.315, pruned_loss=0.07929, over 21602.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3026, pruned_loss=0.07535, over 4269533.32 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:18:55,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.672e+02 5.585e+02 9.781e+02 1.543e+03 3.065e+03, threshold=1.956e+03, percent-clipped=20.0 2023-06-23 16:19:36,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-23 16:20:17,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1526382.0, ans=0.1 2023-06-23 16:20:28,487 INFO [train.py:996] (1/4) Epoch 9, batch 10450, loss[loss=0.2936, simple_loss=0.3737, pruned_loss=0.1068, over 21613.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3084, pruned_loss=0.07901, over 4272643.51 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:21:16,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1526562.0, ans=0.07 2023-06-23 16:21:21,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1526562.0, ans=0.125 2023-06-23 16:21:41,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-23 16:21:47,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1526682.0, ans=0.125 2023-06-23 16:22:09,078 INFO [train.py:996] (1/4) Epoch 9, batch 10500, loss[loss=0.2021, simple_loss=0.2701, pruned_loss=0.06708, over 21208.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.306, pruned_loss=0.07771, over 4268933.36 frames. ], batch size: 548, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:22:23,398 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.525e+02 6.343e+02 8.149e+02 1.174e+03 2.736e+03, threshold=1.630e+03, percent-clipped=6.0 2023-06-23 16:23:53,915 INFO [train.py:996] (1/4) Epoch 9, batch 10550, loss[loss=0.2247, simple_loss=0.2877, pruned_loss=0.08083, over 21580.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2999, pruned_loss=0.07723, over 4275870.57 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:23:54,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1527042.0, ans=0.125 2023-06-23 16:24:09,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1527042.0, ans=0.125 2023-06-23 16:24:16,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-23 16:24:18,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1527102.0, ans=0.0 2023-06-23 16:24:24,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1527102.0, ans=0.125 2023-06-23 16:24:50,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1527162.0, ans=0.05 2023-06-23 16:24:50,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1527162.0, ans=0.0 2023-06-23 16:25:11,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1527282.0, ans=0.2 2023-06-23 16:25:35,500 INFO [train.py:996] (1/4) Epoch 9, batch 10600, loss[loss=0.1796, simple_loss=0.2484, pruned_loss=0.05534, over 21069.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.295, pruned_loss=0.07562, over 4267067.06 frames. ], batch size: 143, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:25:50,379 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.855e+02 5.123e+02 6.754e+02 9.468e+02 2.113e+03, threshold=1.351e+03, percent-clipped=4.0 2023-06-23 16:26:00,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1527402.0, ans=0.0 2023-06-23 16:26:06,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1527402.0, ans=0.0 2023-06-23 16:26:38,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1527522.0, ans=0.125 2023-06-23 16:26:40,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1527522.0, ans=0.0 2023-06-23 16:27:22,981 INFO [train.py:996] (1/4) Epoch 9, batch 10650, loss[loss=0.2072, simple_loss=0.2949, pruned_loss=0.05976, over 21560.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2981, pruned_loss=0.07505, over 4261661.36 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:27:50,959 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:28:17,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1527822.0, ans=0.125 2023-06-23 16:29:03,472 INFO [train.py:996] (1/4) Epoch 9, batch 10700, loss[loss=0.2419, simple_loss=0.3498, pruned_loss=0.06704, over 20797.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2994, pruned_loss=0.07537, over 4259263.21 frames. ], batch size: 608, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:29:10,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1527942.0, ans=0.125 2023-06-23 16:29:12,821 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.925e+02 6.514e+02 1.117e+03 1.445e+03 3.043e+03, threshold=2.235e+03, percent-clipped=29.0 2023-06-23 16:29:49,834 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.08 vs. limit=10.0 2023-06-23 16:30:09,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1528122.0, ans=10.0 2023-06-23 16:30:47,165 INFO [train.py:996] (1/4) Epoch 9, batch 10750, loss[loss=0.2171, simple_loss=0.3149, pruned_loss=0.05966, over 20843.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3092, pruned_loss=0.07937, over 4260394.91 frames. ], batch size: 608, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:30:56,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1528242.0, ans=0.0 2023-06-23 16:30:59,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1528242.0, ans=0.04949747468305833 2023-06-23 16:31:15,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1528302.0, ans=0.0 2023-06-23 16:32:33,896 INFO [train.py:996] (1/4) Epoch 9, batch 10800, loss[loss=0.2601, simple_loss=0.3339, pruned_loss=0.09312, over 21213.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3127, pruned_loss=0.07938, over 4268869.63 frames. ], batch size: 143, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 16:32:43,306 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.788e+02 5.066e+02 7.349e+02 1.067e+03 2.269e+03, threshold=1.470e+03, percent-clipped=1.0 2023-06-23 16:32:43,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1528542.0, ans=0.125 2023-06-23 16:32:47,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1528542.0, ans=0.0 2023-06-23 16:33:08,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1528602.0, ans=10.0 2023-06-23 16:33:08,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1528602.0, ans=0.09899494936611666 2023-06-23 16:33:48,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1528722.0, ans=0.125 2023-06-23 16:33:55,665 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-23 16:34:08,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1528782.0, ans=0.0 2023-06-23 16:34:14,720 INFO [train.py:996] (1/4) Epoch 9, batch 10850, loss[loss=0.2543, simple_loss=0.3273, pruned_loss=0.09066, over 21320.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3138, pruned_loss=0.07991, over 4264887.21 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:34:20,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1528842.0, ans=0.125 2023-06-23 16:35:24,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-23 16:35:27,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-23 16:35:56,341 INFO [train.py:996] (1/4) Epoch 9, batch 10900, loss[loss=0.2052, simple_loss=0.3029, pruned_loss=0.05372, over 21748.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.308, pruned_loss=0.07911, over 4264022.40 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:36:08,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1529142.0, ans=0.125 2023-06-23 16:36:12,639 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.519e+02 5.159e+02 7.524e+02 1.150e+03 2.135e+03, threshold=1.505e+03, percent-clipped=11.0 2023-06-23 16:37:36,167 INFO [train.py:996] (1/4) Epoch 9, batch 10950, loss[loss=0.2083, simple_loss=0.2848, pruned_loss=0.06585, over 21774.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3049, pruned_loss=0.07769, over 4264841.97 frames. ], batch size: 124, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:37:56,464 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-06-23 16:39:08,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1529682.0, ans=0.0 2023-06-23 16:39:12,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1529682.0, ans=0.125 2023-06-23 16:39:16,297 INFO [train.py:996] (1/4) Epoch 9, batch 11000, loss[loss=0.2905, simple_loss=0.3448, pruned_loss=0.1181, over 21634.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3047, pruned_loss=0.07884, over 4266617.01 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:39:32,532 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.640e+02 5.350e+02 8.050e+02 1.212e+03 3.028e+03, threshold=1.610e+03, percent-clipped=11.0 2023-06-23 16:40:03,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1529862.0, ans=0.125 2023-06-23 16:40:43,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1529982.0, ans=0.2 2023-06-23 16:40:54,217 INFO [train.py:996] (1/4) Epoch 9, batch 11050, loss[loss=0.2407, simple_loss=0.2957, pruned_loss=0.09286, over 21830.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3017, pruned_loss=0.07968, over 4264146.05 frames. ], batch size: 107, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:41:30,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1530102.0, ans=0.0 2023-06-23 16:41:32,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1530102.0, ans=0.125 2023-06-23 16:41:36,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.88 vs. limit=15.0 2023-06-23 16:41:52,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1530222.0, ans=0.125 2023-06-23 16:41:53,686 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-23 16:42:38,872 INFO [train.py:996] (1/4) Epoch 9, batch 11100, loss[loss=0.2166, simple_loss=0.2959, pruned_loss=0.06869, over 21385.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3, pruned_loss=0.07858, over 4250180.97 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:42:40,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1530342.0, ans=0.2 2023-06-23 16:42:51,398 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.448e+02 5.087e+02 6.616e+02 8.877e+02 2.244e+03, threshold=1.323e+03, percent-clipped=5.0 2023-06-23 16:42:56,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1530402.0, ans=0.125 2023-06-23 16:43:08,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1530402.0, ans=0.0 2023-06-23 16:44:06,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1530582.0, ans=0.0 2023-06-23 16:44:18,614 INFO [train.py:996] (1/4) Epoch 9, batch 11150, loss[loss=0.2225, simple_loss=0.3103, pruned_loss=0.06738, over 21799.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2982, pruned_loss=0.07827, over 4250151.14 frames. ], batch size: 317, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:44:24,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.62 vs. limit=15.0 2023-06-23 16:44:40,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=15.0 2023-06-23 16:45:03,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1530762.0, ans=0.025 2023-06-23 16:45:15,828 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.05 vs. limit=6.0 2023-06-23 16:45:58,171 INFO [train.py:996] (1/4) Epoch 9, batch 11200, loss[loss=0.2352, simple_loss=0.2945, pruned_loss=0.08801, over 22019.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2971, pruned_loss=0.0784, over 4255786.58 frames. ], batch size: 103, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:46:06,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1530942.0, ans=0.04949747468305833 2023-06-23 16:46:10,781 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 5.536e+02 7.546e+02 1.213e+03 2.221e+03, threshold=1.509e+03, percent-clipped=19.0 2023-06-23 16:47:01,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-23 16:47:08,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1531122.0, ans=0.2 2023-06-23 16:47:18,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1531182.0, ans=0.125 2023-06-23 16:47:28,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1531182.0, ans=0.125 2023-06-23 16:47:32,770 INFO [train.py:996] (1/4) Epoch 9, batch 11250, loss[loss=0.2086, simple_loss=0.2934, pruned_loss=0.06188, over 21814.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2968, pruned_loss=0.07851, over 4247941.50 frames. ], batch size: 333, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:47:52,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1531302.0, ans=0.125 2023-06-23 16:48:03,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1531302.0, ans=0.125 2023-06-23 16:49:11,424 INFO [train.py:996] (1/4) Epoch 9, batch 11300, loss[loss=0.1922, simple_loss=0.2777, pruned_loss=0.05334, over 21868.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2984, pruned_loss=0.07925, over 4257120.65 frames. ], batch size: 316, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:49:12,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1531542.0, ans=0.125 2023-06-23 16:49:28,321 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 5.299e+02 7.050e+02 1.034e+03 1.810e+03, threshold=1.410e+03, percent-clipped=1.0 2023-06-23 16:49:35,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1531602.0, ans=0.1 2023-06-23 16:49:49,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1531602.0, ans=0.125 2023-06-23 16:50:21,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1531722.0, ans=0.125 2023-06-23 16:50:31,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1531722.0, ans=0.0 2023-06-23 16:50:56,818 INFO [train.py:996] (1/4) Epoch 9, batch 11350, loss[loss=0.2577, simple_loss=0.3327, pruned_loss=0.09137, over 21707.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3008, pruned_loss=0.07882, over 4256120.74 frames. ], batch size: 298, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:51:53,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1531962.0, ans=0.0 2023-06-23 16:52:00,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-23 16:52:01,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1532022.0, ans=0.0 2023-06-23 16:52:38,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1532142.0, ans=15.0 2023-06-23 16:52:39,051 INFO [train.py:996] (1/4) Epoch 9, batch 11400, loss[loss=0.2797, simple_loss=0.348, pruned_loss=0.1057, over 21432.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3074, pruned_loss=0.08158, over 4258182.63 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:52:56,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 6.591e+02 8.859e+02 1.390e+03 3.018e+03, threshold=1.772e+03, percent-clipped=23.0 2023-06-23 16:53:35,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1532262.0, ans=0.125 2023-06-23 16:54:12,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1532382.0, ans=0.05 2023-06-23 16:54:20,216 INFO [train.py:996] (1/4) Epoch 9, batch 11450, loss[loss=0.2354, simple_loss=0.3197, pruned_loss=0.07559, over 21861.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3101, pruned_loss=0.08101, over 4266085.74 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:54:24,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1532442.0, ans=0.125 2023-06-23 16:55:19,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1532562.0, ans=0.125 2023-06-23 16:55:22,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1532562.0, ans=0.05 2023-06-23 16:55:25,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1532622.0, ans=10.0 2023-06-23 16:56:02,578 INFO [train.py:996] (1/4) Epoch 9, batch 11500, loss[loss=0.2258, simple_loss=0.3226, pruned_loss=0.0645, over 21872.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.312, pruned_loss=0.08081, over 4259565.38 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:56:03,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1532742.0, ans=0.125 2023-06-23 16:56:12,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1532742.0, ans=0.125 2023-06-23 16:56:13,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-23 16:56:19,887 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.487e+02 5.481e+02 7.380e+02 1.202e+03 2.850e+03, threshold=1.476e+03, percent-clipped=9.0 2023-06-23 16:56:25,981 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-23 16:56:57,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-23 16:57:01,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1532862.0, ans=0.125 2023-06-23 16:57:19,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1532922.0, ans=0.0 2023-06-23 16:57:33,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1532982.0, ans=0.2 2023-06-23 16:57:49,167 INFO [train.py:996] (1/4) Epoch 9, batch 11550, loss[loss=0.3409, simple_loss=0.4416, pruned_loss=0.1201, over 21652.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3183, pruned_loss=0.08122, over 4266432.34 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:57:53,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1533042.0, ans=0.125 2023-06-23 16:58:45,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1533162.0, ans=0.1 2023-06-23 16:59:08,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-23 16:59:19,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1533282.0, ans=0.0 2023-06-23 16:59:31,706 INFO [train.py:996] (1/4) Epoch 9, batch 11600, loss[loss=0.3125, simple_loss=0.4142, pruned_loss=0.1054, over 21673.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3297, pruned_loss=0.08192, over 4255634.73 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 16:59:50,462 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.192e+02 7.053e+02 9.279e+02 1.499e+03 3.190e+03, threshold=1.856e+03, percent-clipped=25.0 2023-06-23 17:00:24,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1533462.0, ans=0.125 2023-06-23 17:00:37,253 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:00:58,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1533582.0, ans=0.125 2023-06-23 17:01:12,931 INFO [train.py:996] (1/4) Epoch 9, batch 11650, loss[loss=0.2164, simple_loss=0.2833, pruned_loss=0.07478, over 20843.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3339, pruned_loss=0.08232, over 4254951.12 frames. ], batch size: 609, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:01:15,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1533642.0, ans=0.125 2023-06-23 17:01:58,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1533762.0, ans=0.0 2023-06-23 17:02:26,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1533822.0, ans=0.05 2023-06-23 17:02:52,859 INFO [train.py:996] (1/4) Epoch 9, batch 11700, loss[loss=0.1996, simple_loss=0.2682, pruned_loss=0.06543, over 21603.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.324, pruned_loss=0.08127, over 4238114.04 frames. ], batch size: 332, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:03:01,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1533942.0, ans=0.0 2023-06-23 17:03:10,681 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.713e+02 7.514e+02 1.058e+03 1.633e+03 4.255e+03, threshold=2.116e+03, percent-clipped=16.0 2023-06-23 17:03:18,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-23 17:04:31,759 INFO [train.py:996] (1/4) Epoch 9, batch 11750, loss[loss=0.2168, simple_loss=0.2787, pruned_loss=0.07747, over 21861.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3153, pruned_loss=0.08102, over 4242562.75 frames. ], batch size: 98, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:06:17,874 INFO [train.py:996] (1/4) Epoch 9, batch 11800, loss[loss=0.2319, simple_loss=0.335, pruned_loss=0.06441, over 21884.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3184, pruned_loss=0.08372, over 4254416.39 frames. ], batch size: 372, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:06:32,025 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.688e+02 5.572e+02 8.368e+02 1.434e+03 3.192e+03, threshold=1.674e+03, percent-clipped=11.0 2023-06-23 17:06:33,374 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.98 vs. limit=10.0 2023-06-23 17:06:45,307 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-06-23 17:07:58,042 INFO [train.py:996] (1/4) Epoch 9, batch 11850, loss[loss=0.2418, simple_loss=0.3173, pruned_loss=0.0832, over 21620.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3204, pruned_loss=0.08302, over 4265157.78 frames. ], batch size: 230, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:08:37,636 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:09:32,324 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-23 17:09:39,315 INFO [train.py:996] (1/4) Epoch 9, batch 11900, loss[loss=0.2288, simple_loss=0.3099, pruned_loss=0.07389, over 20685.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3194, pruned_loss=0.08037, over 4253074.50 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:09:44,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1535142.0, ans=0.025 2023-06-23 17:09:56,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1535142.0, ans=0.025 2023-06-23 17:09:59,009 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.827e+02 5.472e+02 7.234e+02 9.480e+02 2.463e+03, threshold=1.447e+03, percent-clipped=1.0 2023-06-23 17:10:01,965 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-23 17:10:22,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1535262.0, ans=0.2 2023-06-23 17:11:26,118 INFO [train.py:996] (1/4) Epoch 9, batch 11950, loss[loss=0.1262, simple_loss=0.1797, pruned_loss=0.03633, over 16260.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3205, pruned_loss=0.07801, over 4248148.78 frames. ], batch size: 60, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:12:04,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-23 17:12:12,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1535562.0, ans=0.0 2023-06-23 17:12:36,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1535622.0, ans=15.0 2023-06-23 17:12:47,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1535682.0, ans=0.1 2023-06-23 17:13:03,666 INFO [train.py:996] (1/4) Epoch 9, batch 12000, loss[loss=0.2128, simple_loss=0.2758, pruned_loss=0.07488, over 21828.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3134, pruned_loss=0.07582, over 4248861.59 frames. ], batch size: 98, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:13:03,667 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 17:13:24,465 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2567, simple_loss=0.3528, pruned_loss=0.08029, over 1796401.00 frames. 2023-06-23 17:13:24,466 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 17:13:38,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.383e+02 5.788e+02 7.844e+02 1.305e+03 3.845e+03, threshold=1.569e+03, percent-clipped=19.0 2023-06-23 17:14:13,755 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-23 17:14:14,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1535862.0, ans=0.2 2023-06-23 17:14:29,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1535922.0, ans=0.125 2023-06-23 17:14:42,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1535982.0, ans=0.125 2023-06-23 17:15:03,719 INFO [train.py:996] (1/4) Epoch 9, batch 12050, loss[loss=0.2189, simple_loss=0.2844, pruned_loss=0.07673, over 21514.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3101, pruned_loss=0.07711, over 4253695.12 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:15:09,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1536042.0, ans=0.1 2023-06-23 17:15:31,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1536102.0, ans=0.1 2023-06-23 17:15:31,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1536102.0, ans=0.04949747468305833 2023-06-23 17:16:01,241 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:16:13,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1536222.0, ans=0.0 2023-06-23 17:16:22,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1536282.0, ans=0.125 2023-06-23 17:16:45,374 INFO [train.py:996] (1/4) Epoch 9, batch 12100, loss[loss=0.2799, simple_loss=0.3483, pruned_loss=0.1058, over 21805.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3162, pruned_loss=0.08212, over 4264147.62 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:17:01,190 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 6.749e+02 9.796e+02 1.461e+03 3.096e+03, threshold=1.959e+03, percent-clipped=20.0 2023-06-23 17:17:13,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1536402.0, ans=0.125 2023-06-23 17:17:29,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1536462.0, ans=0.0 2023-06-23 17:18:31,922 INFO [train.py:996] (1/4) Epoch 9, batch 12150, loss[loss=0.2473, simple_loss=0.3489, pruned_loss=0.07281, over 21854.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3208, pruned_loss=0.08163, over 4259767.39 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:18:37,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1536642.0, ans=0.125 2023-06-23 17:19:00,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-23 17:19:09,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1536702.0, ans=0.125 2023-06-23 17:19:25,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1536762.0, ans=0.1 2023-06-23 17:19:57,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1536882.0, ans=0.125 2023-06-23 17:19:59,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1536882.0, ans=0.125 2023-06-23 17:20:02,998 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-23 17:20:11,296 INFO [train.py:996] (1/4) Epoch 9, batch 12200, loss[loss=0.2403, simple_loss=0.2991, pruned_loss=0.09072, over 21545.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3153, pruned_loss=0.08067, over 4250785.00 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:20:17,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1536942.0, ans=0.125 2023-06-23 17:20:32,017 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.485e+02 6.911e+02 1.120e+03 1.509e+03 3.105e+03, threshold=2.240e+03, percent-clipped=12.0 2023-06-23 17:20:45,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1537002.0, ans=0.2 2023-06-23 17:21:45,653 INFO [train.py:996] (1/4) Epoch 9, batch 12250, loss[loss=0.1636, simple_loss=0.2397, pruned_loss=0.0438, over 21336.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3077, pruned_loss=0.07729, over 4256134.86 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:23:24,464 INFO [train.py:996] (1/4) Epoch 9, batch 12300, loss[loss=0.2774, simple_loss=0.3679, pruned_loss=0.09342, over 20824.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3032, pruned_loss=0.07237, over 4252384.17 frames. ], batch size: 607, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:23:45,608 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.150e+02 7.519e+02 1.212e+03 3.138e+03, threshold=1.504e+03, percent-clipped=3.0 2023-06-23 17:24:25,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1537722.0, ans=0.125 2023-06-23 17:24:27,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1537722.0, ans=0.0 2023-06-23 17:24:38,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1537722.0, ans=0.2 2023-06-23 17:25:02,691 INFO [train.py:996] (1/4) Epoch 9, batch 12350, loss[loss=0.247, simple_loss=0.3294, pruned_loss=0.08226, over 21617.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.307, pruned_loss=0.07315, over 4254847.59 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:25:39,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1537902.0, ans=0.125 2023-06-23 17:25:41,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1537902.0, ans=0.125 2023-06-23 17:26:21,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1538022.0, ans=0.0 2023-06-23 17:26:26,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1538082.0, ans=0.2 2023-06-23 17:26:40,791 INFO [train.py:996] (1/4) Epoch 9, batch 12400, loss[loss=0.2375, simple_loss=0.3034, pruned_loss=0.08584, over 21891.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.309, pruned_loss=0.07663, over 4269217.01 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:26:41,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-23 17:26:59,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1538142.0, ans=0.04949747468305833 2023-06-23 17:27:01,909 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.603e+02 7.484e+02 1.004e+03 2.626e+03, threshold=1.497e+03, percent-clipped=10.0 2023-06-23 17:27:09,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-23 17:27:49,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1538322.0, ans=0.125 2023-06-23 17:27:53,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1538322.0, ans=0.125 2023-06-23 17:28:11,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1538382.0, ans=0.1 2023-06-23 17:28:19,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1538382.0, ans=0.1 2023-06-23 17:28:25,786 INFO [train.py:996] (1/4) Epoch 9, batch 12450, loss[loss=0.2948, simple_loss=0.3566, pruned_loss=0.1165, over 21620.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3121, pruned_loss=0.07963, over 4272260.40 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:28:46,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1538502.0, ans=0.2 2023-06-23 17:29:14,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1538562.0, ans=0.1 2023-06-23 17:29:43,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1538682.0, ans=0.1 2023-06-23 17:30:02,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.03 vs. limit=5.0 2023-06-23 17:30:11,455 INFO [train.py:996] (1/4) Epoch 9, batch 12500, loss[loss=0.2468, simple_loss=0.3437, pruned_loss=0.07496, over 21323.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3234, pruned_loss=0.08358, over 4279523.04 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:30:33,994 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 5.952e+02 7.744e+02 1.112e+03 2.842e+03, threshold=1.549e+03, percent-clipped=7.0 2023-06-23 17:30:34,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1538802.0, ans=0.125 2023-06-23 17:31:09,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=22.5 2023-06-23 17:31:23,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1538922.0, ans=0.125 2023-06-23 17:31:32,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1538922.0, ans=0.125 2023-06-23 17:31:35,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1538982.0, ans=0.1 2023-06-23 17:31:42,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1538982.0, ans=0.025 2023-06-23 17:31:57,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1539042.0, ans=0.1 2023-06-23 17:31:58,478 INFO [train.py:996] (1/4) Epoch 9, batch 12550, loss[loss=0.2235, simple_loss=0.3528, pruned_loss=0.04711, over 20781.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3281, pruned_loss=0.08607, over 4280797.93 frames. ], batch size: 608, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:32:25,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-23 17:33:17,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1539222.0, ans=0.125 2023-06-23 17:33:24,965 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-23 17:33:29,867 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-23 17:33:45,023 INFO [train.py:996] (1/4) Epoch 9, batch 12600, loss[loss=0.2444, simple_loss=0.3472, pruned_loss=0.07076, over 21225.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3274, pruned_loss=0.08422, over 4274660.17 frames. ], batch size: 549, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:34:01,991 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:34:03,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.580e+02 5.911e+02 8.328e+02 1.277e+03 2.400e+03, threshold=1.666e+03, percent-clipped=14.0 2023-06-23 17:34:05,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1539402.0, ans=0.0 2023-06-23 17:34:54,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1539522.0, ans=0.0 2023-06-23 17:35:10,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-23 17:35:25,037 INFO [train.py:996] (1/4) Epoch 9, batch 12650, loss[loss=0.2118, simple_loss=0.301, pruned_loss=0.06133, over 21640.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3182, pruned_loss=0.07929, over 4269437.88 frames. ], batch size: 389, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:35:36,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1539642.0, ans=0.0 2023-06-23 17:35:38,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-23 17:36:04,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1539762.0, ans=0.0 2023-06-23 17:36:50,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1539882.0, ans=0.035 2023-06-23 17:36:52,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1539882.0, ans=0.125 2023-06-23 17:37:05,732 INFO [train.py:996] (1/4) Epoch 9, batch 12700, loss[loss=0.2545, simple_loss=0.3263, pruned_loss=0.09134, over 21452.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3184, pruned_loss=0.08155, over 4275036.26 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:37:23,859 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.729e+02 5.436e+02 7.219e+02 1.107e+03 2.161e+03, threshold=1.444e+03, percent-clipped=5.0 2023-06-23 17:37:27,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1540002.0, ans=0.0 2023-06-23 17:37:38,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-23 17:37:47,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1540062.0, ans=0.1 2023-06-23 17:38:00,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1540062.0, ans=0.125 2023-06-23 17:38:11,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1540122.0, ans=0.0 2023-06-23 17:38:46,111 INFO [train.py:996] (1/4) Epoch 9, batch 12750, loss[loss=0.2472, simple_loss=0.3336, pruned_loss=0.08037, over 21361.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3206, pruned_loss=0.08285, over 4271453.20 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:38:47,387 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-23 17:38:58,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1540242.0, ans=0.125 2023-06-23 17:39:07,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1540302.0, ans=0.1 2023-06-23 17:39:38,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1540362.0, ans=0.125 2023-06-23 17:40:17,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-23 17:40:26,702 INFO [train.py:996] (1/4) Epoch 9, batch 12800, loss[loss=0.2846, simple_loss=0.3395, pruned_loss=0.1148, over 21741.00 frames. ], tot_loss[loss=0.244, simple_loss=0.32, pruned_loss=0.084, over 4278044.25 frames. ], batch size: 508, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:40:51,648 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.701e+02 5.271e+02 6.331e+02 9.056e+02 1.664e+03, threshold=1.266e+03, percent-clipped=3.0 2023-06-23 17:40:57,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1540602.0, ans=0.125 2023-06-23 17:41:09,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-23 17:41:13,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-23 17:41:30,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1540722.0, ans=0.0 2023-06-23 17:41:47,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1540782.0, ans=0.125 2023-06-23 17:42:08,031 INFO [train.py:996] (1/4) Epoch 9, batch 12850, loss[loss=0.2066, simple_loss=0.3133, pruned_loss=0.04997, over 21218.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3228, pruned_loss=0.08529, over 4278954.33 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:42:55,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1540962.0, ans=0.0 2023-06-23 17:42:56,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1540962.0, ans=0.0 2023-06-23 17:43:01,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1540962.0, ans=0.125 2023-06-23 17:43:52,485 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-23 17:43:54,699 INFO [train.py:996] (1/4) Epoch 9, batch 12900, loss[loss=0.1904, simple_loss=0.2667, pruned_loss=0.05704, over 21222.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3181, pruned_loss=0.08092, over 4282262.73 frames. ], batch size: 159, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:44:24,420 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 5.349e+02 7.787e+02 1.135e+03 3.186e+03, threshold=1.557e+03, percent-clipped=18.0 2023-06-23 17:45:20,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1541382.0, ans=0.0 2023-06-23 17:45:21,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-23 17:45:41,772 INFO [train.py:996] (1/4) Epoch 9, batch 12950, loss[loss=0.2397, simple_loss=0.316, pruned_loss=0.08171, over 21491.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3164, pruned_loss=0.07916, over 4280101.52 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:46:16,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1541502.0, ans=0.07 2023-06-23 17:46:58,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.70 vs. limit=10.0 2023-06-23 17:47:27,847 INFO [train.py:996] (1/4) Epoch 9, batch 13000, loss[loss=0.1855, simple_loss=0.264, pruned_loss=0.05353, over 21341.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3195, pruned_loss=0.07956, over 4272009.54 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:47:46,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.593e+02 5.869e+02 8.632e+02 1.298e+03 2.714e+03, threshold=1.726e+03, percent-clipped=15.0 2023-06-23 17:47:56,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1541802.0, ans=0.125 2023-06-23 17:48:24,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1541922.0, ans=0.2 2023-06-23 17:49:01,582 INFO [train.py:996] (1/4) Epoch 9, batch 13050, loss[loss=0.2178, simple_loss=0.2836, pruned_loss=0.07603, over 21800.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3122, pruned_loss=0.07649, over 4273875.25 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:49:03,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1542042.0, ans=0.2 2023-06-23 17:49:31,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1542102.0, ans=0.2 2023-06-23 17:49:40,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1542162.0, ans=0.0 2023-06-23 17:49:56,038 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-23 17:50:00,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1542222.0, ans=0.125 2023-06-23 17:50:22,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1542282.0, ans=0.125 2023-06-23 17:50:46,382 INFO [train.py:996] (1/4) Epoch 9, batch 13100, loss[loss=0.2588, simple_loss=0.342, pruned_loss=0.0878, over 21756.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.314, pruned_loss=0.07737, over 4283906.14 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:51:06,465 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 5.747e+02 7.827e+02 1.039e+03 1.771e+03, threshold=1.565e+03, percent-clipped=1.0 2023-06-23 17:51:51,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1542522.0, ans=0.125 2023-06-23 17:52:01,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1542522.0, ans=0.125 2023-06-23 17:52:16,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-23 17:52:26,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.11 vs. limit=10.0 2023-06-23 17:52:28,879 INFO [train.py:996] (1/4) Epoch 9, batch 13150, loss[loss=0.2439, simple_loss=0.307, pruned_loss=0.09038, over 21777.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3149, pruned_loss=0.07952, over 4280800.45 frames. ], batch size: 316, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:52:35,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1542642.0, ans=0.125 2023-06-23 17:53:30,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1542762.0, ans=0.09899494936611666 2023-06-23 17:54:10,310 INFO [train.py:996] (1/4) Epoch 9, batch 13200, loss[loss=0.2774, simple_loss=0.345, pruned_loss=0.1049, over 21566.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3132, pruned_loss=0.07971, over 4278941.11 frames. ], batch size: 389, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:54:32,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1543002.0, ans=0.0 2023-06-23 17:54:33,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.479e+02 5.951e+02 7.570e+02 1.042e+03 3.191e+03, threshold=1.514e+03, percent-clipped=13.0 2023-06-23 17:55:11,779 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:55:34,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1543182.0, ans=0.2 2023-06-23 17:55:50,017 INFO [train.py:996] (1/4) Epoch 9, batch 13250, loss[loss=0.2571, simple_loss=0.3363, pruned_loss=0.08897, over 21757.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.314, pruned_loss=0.08117, over 4277106.58 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:56:08,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1543242.0, ans=0.125 2023-06-23 17:56:08,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1543242.0, ans=0.04949747468305833 2023-06-23 17:56:10,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1543302.0, ans=0.0 2023-06-23 17:56:52,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-23 17:56:58,580 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-23 17:57:01,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1543422.0, ans=0.125 2023-06-23 17:57:11,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-23 17:57:36,354 INFO [train.py:996] (1/4) Epoch 9, batch 13300, loss[loss=0.2311, simple_loss=0.3102, pruned_loss=0.07599, over 21492.00 frames. ], tot_loss[loss=0.24, simple_loss=0.317, pruned_loss=0.08146, over 4276109.20 frames. ], batch size: 211, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:58:08,192 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.720e+02 5.402e+02 7.318e+02 1.029e+03 1.964e+03, threshold=1.464e+03, percent-clipped=5.0 2023-06-23 17:58:10,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-23 17:59:18,319 INFO [train.py:996] (1/4) Epoch 9, batch 13350, loss[loss=0.2758, simple_loss=0.3579, pruned_loss=0.09686, over 21620.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.322, pruned_loss=0.08384, over 4276788.06 frames. ], batch size: 389, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:59:19,226 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-23 17:59:41,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1543842.0, ans=0.125 2023-06-23 17:59:43,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=15.0 2023-06-23 17:59:53,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1543902.0, ans=0.1 2023-06-23 17:59:54,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.18 vs. limit=22.5 2023-06-23 18:00:24,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-23 18:00:28,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1544022.0, ans=0.125 2023-06-23 18:00:57,216 INFO [train.py:996] (1/4) Epoch 9, batch 13400, loss[loss=0.2548, simple_loss=0.3092, pruned_loss=0.1002, over 21851.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.323, pruned_loss=0.08498, over 4282243.40 frames. ], batch size: 98, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:01:15,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1544142.0, ans=0.125 2023-06-23 18:01:35,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.876e+02 6.088e+02 8.910e+02 1.105e+03 2.382e+03, threshold=1.782e+03, percent-clipped=11.0 2023-06-23 18:02:09,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1544322.0, ans=0.125 2023-06-23 18:02:22,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1544382.0, ans=0.2 2023-06-23 18:02:50,802 INFO [train.py:996] (1/4) Epoch 9, batch 13450, loss[loss=0.2537, simple_loss=0.3366, pruned_loss=0.08539, over 21414.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3231, pruned_loss=0.0868, over 4285590.86 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:03:13,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1544502.0, ans=0.5 2023-06-23 18:03:53,104 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-23 18:04:07,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1544622.0, ans=0.2 2023-06-23 18:04:30,775 INFO [train.py:996] (1/4) Epoch 9, batch 13500, loss[loss=0.2866, simple_loss=0.358, pruned_loss=0.1076, over 21469.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3132, pruned_loss=0.08359, over 4268954.54 frames. ], batch size: 509, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:04:39,518 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:04:39,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1544742.0, ans=0.0 2023-06-23 18:04:43,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1544742.0, ans=0.0 2023-06-23 18:04:54,273 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.791e+02 5.275e+02 7.519e+02 1.324e+03 2.778e+03, threshold=1.504e+03, percent-clipped=14.0 2023-06-23 18:04:56,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1544802.0, ans=0.125 2023-06-23 18:06:13,533 INFO [train.py:996] (1/4) Epoch 9, batch 13550, loss[loss=0.2923, simple_loss=0.3969, pruned_loss=0.09388, over 21272.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3153, pruned_loss=0.08194, over 4268003.87 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:06:39,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1545102.0, ans=0.125 2023-06-23 18:06:39,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-23 18:06:53,745 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-23 18:07:05,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-23 18:07:33,643 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:07:55,236 INFO [train.py:996] (1/4) Epoch 9, batch 13600, loss[loss=0.2251, simple_loss=0.2978, pruned_loss=0.07624, over 21250.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.319, pruned_loss=0.08296, over 4272557.86 frames. ], batch size: 143, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:07:55,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1545342.0, ans=0.1 2023-06-23 18:08:02,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1545342.0, ans=0.1 2023-06-23 18:08:18,105 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.790e+02 6.429e+02 9.165e+02 1.553e+03 3.162e+03, threshold=1.833e+03, percent-clipped=25.0 2023-06-23 18:09:30,321 INFO [train.py:996] (1/4) Epoch 9, batch 13650, loss[loss=0.1908, simple_loss=0.2553, pruned_loss=0.06316, over 21542.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3129, pruned_loss=0.07977, over 4273363.25 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:10:09,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=22.5 2023-06-23 18:10:45,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1545822.0, ans=0.1 2023-06-23 18:10:48,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1545822.0, ans=0.125 2023-06-23 18:11:04,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1545882.0, ans=0.0 2023-06-23 18:11:07,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1545882.0, ans=0.125 2023-06-23 18:11:13,784 INFO [train.py:996] (1/4) Epoch 9, batch 13700, loss[loss=0.247, simple_loss=0.3323, pruned_loss=0.08091, over 21620.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3098, pruned_loss=0.07958, over 4259692.33 frames. ], batch size: 389, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:11:41,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.658e+02 5.677e+02 7.972e+02 1.070e+03 2.613e+03, threshold=1.594e+03, percent-clipped=4.0 2023-06-23 18:11:42,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1546002.0, ans=0.07 2023-06-23 18:11:46,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1546002.0, ans=0.0 2023-06-23 18:12:32,256 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=12.0 2023-06-23 18:12:51,930 INFO [train.py:996] (1/4) Epoch 9, batch 13750, loss[loss=0.1887, simple_loss=0.238, pruned_loss=0.06968, over 21193.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3096, pruned_loss=0.0796, over 4256062.70 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:13:19,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=12.0 2023-06-23 18:13:41,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1546362.0, ans=0.125 2023-06-23 18:13:43,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-23 18:13:48,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1546362.0, ans=0.1 2023-06-23 18:14:05,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1546422.0, ans=12.0 2023-06-23 18:14:33,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1546482.0, ans=0.0 2023-06-23 18:14:35,715 INFO [train.py:996] (1/4) Epoch 9, batch 13800, loss[loss=0.2813, simple_loss=0.3982, pruned_loss=0.08222, over 21236.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3149, pruned_loss=0.07917, over 4255221.83 frames. ], batch size: 549, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:15:00,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-23 18:15:01,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1546542.0, ans=0.125 2023-06-23 18:15:16,356 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.059e+02 5.824e+02 9.603e+02 1.417e+03 3.093e+03, threshold=1.921e+03, percent-clipped=19.0 2023-06-23 18:15:17,406 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-23 18:15:20,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1546602.0, ans=0.1 2023-06-23 18:16:21,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1546782.0, ans=0.2 2023-06-23 18:16:23,589 INFO [train.py:996] (1/4) Epoch 9, batch 13850, loss[loss=0.3483, simple_loss=0.4126, pruned_loss=0.142, over 21502.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3222, pruned_loss=0.07994, over 4258414.13 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:17:53,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.37 vs. limit=22.5 2023-06-23 18:17:57,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1547082.0, ans=0.125 2023-06-23 18:18:14,860 INFO [train.py:996] (1/4) Epoch 9, batch 13900, loss[loss=0.2594, simple_loss=0.3262, pruned_loss=0.09635, over 21438.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3253, pruned_loss=0.08371, over 4268142.93 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:18:30,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1547202.0, ans=0.0 2023-06-23 18:18:41,698 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.964e+02 6.028e+02 8.450e+02 1.187e+03 2.483e+03, threshold=1.690e+03, percent-clipped=4.0 2023-06-23 18:18:53,945 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-23 18:19:32,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1547382.0, ans=0.125 2023-06-23 18:19:49,971 INFO [train.py:996] (1/4) Epoch 9, batch 13950, loss[loss=0.2374, simple_loss=0.3081, pruned_loss=0.08331, over 21923.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3251, pruned_loss=0.08609, over 4279520.39 frames. ], batch size: 316, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:20:13,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1547502.0, ans=15.0 2023-06-23 18:20:27,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1547562.0, ans=0.0 2023-06-23 18:20:28,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1547562.0, ans=0.0 2023-06-23 18:20:45,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1547622.0, ans=10.0 2023-06-23 18:21:29,640 INFO [train.py:996] (1/4) Epoch 9, batch 14000, loss[loss=0.212, simple_loss=0.2883, pruned_loss=0.06779, over 21643.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3199, pruned_loss=0.0831, over 4269076.82 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 32.0 2023-06-23 18:21:50,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1547802.0, ans=0.0 2023-06-23 18:21:56,239 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.630e+02 5.065e+02 8.726e+02 1.240e+03 2.803e+03, threshold=1.745e+03, percent-clipped=8.0 2023-06-23 18:22:21,360 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:22:37,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1547922.0, ans=0.2 2023-06-23 18:22:41,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-23 18:23:03,121 INFO [train.py:996] (1/4) Epoch 9, batch 14050, loss[loss=0.1849, simple_loss=0.2592, pruned_loss=0.05525, over 16503.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3136, pruned_loss=0.07883, over 4267464.41 frames. ], batch size: 63, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:23:11,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1548042.0, ans=0.125 2023-06-23 18:23:24,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1548102.0, ans=0.125 2023-06-23 18:23:42,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1548162.0, ans=0.2 2023-06-23 18:24:20,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1548282.0, ans=0.025 2023-06-23 18:24:42,573 INFO [train.py:996] (1/4) Epoch 9, batch 14100, loss[loss=0.2063, simple_loss=0.2724, pruned_loss=0.07011, over 15755.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3068, pruned_loss=0.07819, over 4260814.93 frames. ], batch size: 60, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:25:10,284 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.343e+02 6.298e+02 9.143e+02 1.408e+03 2.663e+03, threshold=1.829e+03, percent-clipped=10.0 2023-06-23 18:25:30,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1548462.0, ans=0.0 2023-06-23 18:26:09,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1548642.0, ans=0.2 2023-06-23 18:26:10,613 INFO [train.py:996] (1/4) Epoch 9, batch 14150, loss[loss=0.2374, simple_loss=0.321, pruned_loss=0.07691, over 21364.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3118, pruned_loss=0.07957, over 4263906.00 frames. ], batch size: 194, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:26:12,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1548642.0, ans=0.0 2023-06-23 18:26:38,469 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=12.0 2023-06-23 18:26:39,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1548702.0, ans=0.125 2023-06-23 18:27:36,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1548882.0, ans=0.125 2023-06-23 18:27:44,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1548882.0, ans=0.2 2023-06-23 18:27:48,778 INFO [train.py:996] (1/4) Epoch 9, batch 14200, loss[loss=0.2288, simple_loss=0.2951, pruned_loss=0.08122, over 21591.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3113, pruned_loss=0.0788, over 4265093.65 frames. ], batch size: 391, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:28:22,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.420e+02 7.650e+02 1.190e+03 2.098e+03, threshold=1.530e+03, percent-clipped=4.0 2023-06-23 18:29:20,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1549182.0, ans=0.2 2023-06-23 18:29:25,138 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:29:27,865 INFO [train.py:996] (1/4) Epoch 9, batch 14250, loss[loss=0.2003, simple_loss=0.2634, pruned_loss=0.06867, over 21202.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3061, pruned_loss=0.07877, over 4248594.78 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:29:28,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1549242.0, ans=0.0 2023-06-23 18:30:21,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1549362.0, ans=10.0 2023-06-23 18:30:28,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1549362.0, ans=0.0 2023-06-23 18:30:33,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-23 18:30:45,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1549422.0, ans=0.125 2023-06-23 18:30:50,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1549422.0, ans=0.125 2023-06-23 18:31:09,756 INFO [train.py:996] (1/4) Epoch 9, batch 14300, loss[loss=0.3406, simple_loss=0.4338, pruned_loss=0.1237, over 21606.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.308, pruned_loss=0.07855, over 4259939.01 frames. ], batch size: 441, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:31:10,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1549542.0, ans=0.125 2023-06-23 18:31:14,988 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:31:36,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1549602.0, ans=0.1 2023-06-23 18:31:49,043 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.696e+02 4.681e+02 6.476e+02 1.240e+03 3.295e+03, threshold=1.295e+03, percent-clipped=18.0 2023-06-23 18:32:12,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1549722.0, ans=0.0 2023-06-23 18:32:49,941 INFO [train.py:996] (1/4) Epoch 9, batch 14350, loss[loss=0.2254, simple_loss=0.3116, pruned_loss=0.06964, over 21453.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3111, pruned_loss=0.078, over 4257217.11 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:33:27,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1549902.0, ans=0.125 2023-06-23 18:33:36,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1549962.0, ans=0.04949747468305833 2023-06-23 18:34:27,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1550082.0, ans=0.1 2023-06-23 18:34:34,672 INFO [train.py:996] (1/4) Epoch 9, batch 14400, loss[loss=0.2311, simple_loss=0.3023, pruned_loss=0.07991, over 21820.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3083, pruned_loss=0.07896, over 4251560.83 frames. ], batch size: 351, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:35:01,973 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:35:09,620 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 4.844e+02 6.439e+02 1.111e+03 2.671e+03, threshold=1.288e+03, percent-clipped=19.0 2023-06-23 18:35:16,873 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.52 vs. limit=10.0 2023-06-23 18:35:24,506 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.47 vs. limit=22.5 2023-06-23 18:35:28,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1550262.0, ans=0.1 2023-06-23 18:35:55,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1550382.0, ans=0.125 2023-06-23 18:36:07,900 INFO [train.py:996] (1/4) Epoch 9, batch 14450, loss[loss=0.1932, simple_loss=0.2598, pruned_loss=0.06332, over 21801.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3027, pruned_loss=0.079, over 4253859.30 frames. ], batch size: 283, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:36:19,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1550442.0, ans=0.125 2023-06-23 18:36:25,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1550442.0, ans=0.2 2023-06-23 18:36:35,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-06-23 18:37:05,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1550622.0, ans=0.125 2023-06-23 18:37:43,337 INFO [train.py:996] (1/4) Epoch 9, batch 14500, loss[loss=0.208, simple_loss=0.2809, pruned_loss=0.06758, over 22018.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3005, pruned_loss=0.07931, over 4259391.35 frames. ], batch size: 103, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:38:12,982 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:38:23,482 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.792e+02 5.208e+02 6.817e+02 8.713e+02 1.535e+03, threshold=1.363e+03, percent-clipped=1.0 2023-06-23 18:38:49,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1550922.0, ans=0.0 2023-06-23 18:39:27,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-23 18:39:29,524 INFO [train.py:996] (1/4) Epoch 9, batch 14550, loss[loss=0.2686, simple_loss=0.3435, pruned_loss=0.09681, over 21297.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3046, pruned_loss=0.08091, over 4259390.34 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:39:39,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1551042.0, ans=0.125 2023-06-23 18:39:46,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1551042.0, ans=0.125 2023-06-23 18:39:48,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1551042.0, ans=0.125 2023-06-23 18:39:56,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1551102.0, ans=0.0 2023-06-23 18:40:36,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1551222.0, ans=0.0 2023-06-23 18:40:50,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1551282.0, ans=0.125 2023-06-23 18:40:50,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1551282.0, ans=0.0 2023-06-23 18:41:10,903 INFO [train.py:996] (1/4) Epoch 9, batch 14600, loss[loss=0.259, simple_loss=0.3297, pruned_loss=0.09419, over 21662.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3127, pruned_loss=0.0842, over 4261681.16 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:41:26,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1551342.0, ans=0.0 2023-06-23 18:41:42,076 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.404e+02 6.083e+02 8.730e+02 1.243e+03 2.471e+03, threshold=1.746e+03, percent-clipped=17.0 2023-06-23 18:42:25,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1551582.0, ans=0.5 2023-06-23 18:42:38,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=12.0 2023-06-23 18:42:45,959 INFO [train.py:996] (1/4) Epoch 9, batch 14650, loss[loss=0.1701, simple_loss=0.257, pruned_loss=0.04156, over 21622.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3167, pruned_loss=0.08377, over 4259496.84 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:42:52,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1551642.0, ans=0.04949747468305833 2023-06-23 18:43:09,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1551702.0, ans=0.2 2023-06-23 18:43:47,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1551822.0, ans=0.2 2023-06-23 18:44:04,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1551882.0, ans=0.0 2023-06-23 18:44:21,216 INFO [train.py:996] (1/4) Epoch 9, batch 14700, loss[loss=0.2031, simple_loss=0.2876, pruned_loss=0.05924, over 21233.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3103, pruned_loss=0.0781, over 4257131.90 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:44:31,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1551942.0, ans=0.125 2023-06-23 18:44:33,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1551942.0, ans=0.05 2023-06-23 18:44:39,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1551942.0, ans=0.0 2023-06-23 18:44:54,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-23 18:44:58,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1552002.0, ans=0.0 2023-06-23 18:44:59,146 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 5.196e+02 7.542e+02 1.109e+03 2.941e+03, threshold=1.508e+03, percent-clipped=7.0 2023-06-23 18:45:06,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1552062.0, ans=0.04949747468305833 2023-06-23 18:45:20,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1552062.0, ans=0.2 2023-06-23 18:45:29,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1552122.0, ans=0.0 2023-06-23 18:46:08,870 INFO [train.py:996] (1/4) Epoch 9, batch 14750, loss[loss=0.2119, simple_loss=0.2987, pruned_loss=0.06255, over 19936.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3155, pruned_loss=0.08103, over 4261597.46 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:46:14,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1552242.0, ans=0.09899494936611666 2023-06-23 18:46:51,495 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.15 vs. limit=10.0 2023-06-23 18:46:52,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1552362.0, ans=0.125 2023-06-23 18:47:25,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1552482.0, ans=0.1 2023-06-23 18:47:26,069 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-23 18:47:45,576 INFO [train.py:996] (1/4) Epoch 9, batch 14800, loss[loss=0.2539, simple_loss=0.3066, pruned_loss=0.1006, over 21299.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3278, pruned_loss=0.08634, over 4267294.50 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:48:16,741 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.500e+02 8.733e+02 1.311e+03 2.731e+03, threshold=1.747e+03, percent-clipped=18.0 2023-06-23 18:48:48,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1552722.0, ans=0.125 2023-06-23 18:49:32,209 INFO [train.py:996] (1/4) Epoch 9, batch 14850, loss[loss=0.218, simple_loss=0.2782, pruned_loss=0.07895, over 21258.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3218, pruned_loss=0.08604, over 4266900.23 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:49:44,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-23 18:50:09,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1552962.0, ans=0.125 2023-06-23 18:50:42,531 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-23 18:50:54,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1553082.0, ans=0.125 2023-06-23 18:51:15,506 INFO [train.py:996] (1/4) Epoch 9, batch 14900, loss[loss=0.2652, simple_loss=0.3301, pruned_loss=0.1001, over 21434.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3256, pruned_loss=0.08827, over 4262093.23 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:51:37,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1553202.0, ans=0.125 2023-06-23 18:51:39,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1553202.0, ans=0.125 2023-06-23 18:51:46,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-23 18:51:54,863 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.598e+02 5.568e+02 9.380e+02 1.428e+03 3.360e+03, threshold=1.876e+03, percent-clipped=13.0 2023-06-23 18:52:06,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1553262.0, ans=0.125 2023-06-23 18:52:53,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-23 18:52:55,919 INFO [train.py:996] (1/4) Epoch 9, batch 14950, loss[loss=0.2216, simple_loss=0.3018, pruned_loss=0.07067, over 21295.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3253, pruned_loss=0.08667, over 4261785.48 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:53:24,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1553502.0, ans=0.0 2023-06-23 18:53:49,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1553562.0, ans=0.0 2023-06-23 18:54:23,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1553682.0, ans=0.125 2023-06-23 18:54:37,628 INFO [train.py:996] (1/4) Epoch 9, batch 15000, loss[loss=0.258, simple_loss=0.3305, pruned_loss=0.09279, over 21382.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3287, pruned_loss=0.08905, over 4266820.84 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:54:37,629 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 18:54:58,176 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2574, simple_loss=0.352, pruned_loss=0.08137, over 1796401.00 frames. 2023-06-23 18:54:58,177 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 18:55:07,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1553742.0, ans=0.125 2023-06-23 18:55:17,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1553742.0, ans=0.1 2023-06-23 18:55:32,916 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.760e+02 5.829e+02 9.207e+02 1.364e+03 3.991e+03, threshold=1.841e+03, percent-clipped=17.0 2023-06-23 18:55:37,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-23 18:55:42,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1553862.0, ans=0.0 2023-06-23 18:55:42,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1553862.0, ans=0.125 2023-06-23 18:56:06,471 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=12.0 2023-06-23 18:56:28,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1553982.0, ans=0.2 2023-06-23 18:56:39,797 INFO [train.py:996] (1/4) Epoch 9, batch 15050, loss[loss=0.3022, simple_loss=0.3866, pruned_loss=0.1089, over 21526.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3292, pruned_loss=0.08945, over 4269893.19 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:57:21,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1554162.0, ans=0.125 2023-06-23 18:57:59,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1554282.0, ans=0.125 2023-06-23 18:58:09,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1554282.0, ans=0.1 2023-06-23 18:58:16,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1554282.0, ans=0.2 2023-06-23 18:58:20,519 INFO [train.py:996] (1/4) Epoch 9, batch 15100, loss[loss=0.275, simple_loss=0.3445, pruned_loss=0.1027, over 21242.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3312, pruned_loss=0.0888, over 4268861.69 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:58:37,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1554342.0, ans=10.0 2023-06-23 18:58:38,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1554342.0, ans=0.1 2023-06-23 18:58:59,274 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 5.498e+02 7.540e+02 1.313e+03 2.793e+03, threshold=1.508e+03, percent-clipped=8.0 2023-06-23 18:59:18,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1554522.0, ans=0.2 2023-06-23 18:59:23,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1554522.0, ans=0.1 2023-06-23 19:00:04,653 INFO [train.py:996] (1/4) Epoch 9, batch 15150, loss[loss=0.2733, simple_loss=0.3543, pruned_loss=0.09614, over 19928.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3273, pruned_loss=0.08939, over 4267946.88 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 19:00:17,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-23 19:00:39,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1554702.0, ans=0.125 2023-06-23 19:01:45,749 INFO [train.py:996] (1/4) Epoch 9, batch 15200, loss[loss=0.1821, simple_loss=0.2543, pruned_loss=0.05496, over 21565.00 frames. ], tot_loss[loss=0.244, simple_loss=0.318, pruned_loss=0.085, over 4262652.96 frames. ], batch size: 195, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:02:19,112 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.584e+02 6.669e+02 9.281e+02 1.408e+03 4.015e+03, threshold=1.856e+03, percent-clipped=19.0 2023-06-23 19:02:21,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1555062.0, ans=0.125 2023-06-23 19:02:32,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-23 19:03:04,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.27 vs. limit=15.0 2023-06-23 19:03:13,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1555182.0, ans=0.125 2023-06-23 19:03:27,235 INFO [train.py:996] (1/4) Epoch 9, batch 15250, loss[loss=0.1961, simple_loss=0.2665, pruned_loss=0.06289, over 21634.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3118, pruned_loss=0.08303, over 4264649.76 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:03:49,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1555302.0, ans=0.1 2023-06-23 19:03:53,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-23 19:04:05,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1555362.0, ans=0.0 2023-06-23 19:04:23,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1555422.0, ans=0.05 2023-06-23 19:04:32,387 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=12.0 2023-06-23 19:05:06,546 INFO [train.py:996] (1/4) Epoch 9, batch 15300, loss[loss=0.2384, simple_loss=0.3169, pruned_loss=0.07997, over 21143.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3128, pruned_loss=0.08429, over 4266794.28 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:05:41,062 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.812e+02 5.913e+02 8.241e+02 1.222e+03 2.288e+03, threshold=1.648e+03, percent-clipped=6.0 2023-06-23 19:05:56,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1555662.0, ans=0.1 2023-06-23 19:06:52,451 INFO [train.py:996] (1/4) Epoch 9, batch 15350, loss[loss=0.2453, simple_loss=0.3359, pruned_loss=0.07734, over 21699.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3179, pruned_loss=0.08741, over 4267329.21 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:07:16,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1555902.0, ans=10.0 2023-06-23 19:07:40,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1555962.0, ans=0.2 2023-06-23 19:08:26,435 INFO [train.py:996] (1/4) Epoch 9, batch 15400, loss[loss=0.2033, simple_loss=0.2806, pruned_loss=0.06301, over 21527.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3186, pruned_loss=0.08589, over 4260604.21 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:08:44,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-23 19:08:58,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 6.075e+02 7.840e+02 1.049e+03 1.941e+03, threshold=1.568e+03, percent-clipped=4.0 2023-06-23 19:09:29,794 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:09:36,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1556322.0, ans=0.1 2023-06-23 19:10:04,884 INFO [train.py:996] (1/4) Epoch 9, batch 15450, loss[loss=0.3018, simple_loss=0.3746, pruned_loss=0.1145, over 21590.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.317, pruned_loss=0.08574, over 4269263.61 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:10:05,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1556442.0, ans=0.2 2023-06-23 19:10:12,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1556442.0, ans=0.2 2023-06-23 19:10:20,930 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-23 19:10:21,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-06-23 19:10:44,748 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:11:46,673 INFO [train.py:996] (1/4) Epoch 9, batch 15500, loss[loss=0.2792, simple_loss=0.3521, pruned_loss=0.1032, over 21300.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3202, pruned_loss=0.08476, over 4259110.31 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:11:59,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1556742.0, ans=0.125 2023-06-23 19:12:03,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1556742.0, ans=0.0 2023-06-23 19:12:17,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1556802.0, ans=0.0 2023-06-23 19:12:24,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1556802.0, ans=0.125 2023-06-23 19:12:26,945 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 5.144e+02 7.122e+02 1.016e+03 2.468e+03, threshold=1.424e+03, percent-clipped=4.0 2023-06-23 19:13:21,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1556982.0, ans=0.1 2023-06-23 19:13:34,010 INFO [train.py:996] (1/4) Epoch 9, batch 15550, loss[loss=0.1925, simple_loss=0.2602, pruned_loss=0.06239, over 21202.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3175, pruned_loss=0.08269, over 4258161.99 frames. ], batch size: 608, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:13:34,469 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:13:48,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1557102.0, ans=0.125 2023-06-23 19:13:52,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1557102.0, ans=0.0 2023-06-23 19:13:54,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-23 19:14:37,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1557222.0, ans=0.0 2023-06-23 19:15:10,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1557282.0, ans=0.0 2023-06-23 19:15:14,773 INFO [train.py:996] (1/4) Epoch 9, batch 15600, loss[loss=0.2107, simple_loss=0.2757, pruned_loss=0.07286, over 21758.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3107, pruned_loss=0.08073, over 4268988.89 frames. ], batch size: 112, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:15:18,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1557342.0, ans=0.0 2023-06-23 19:15:30,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1557402.0, ans=0.0 2023-06-23 19:15:49,484 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.952e+02 5.285e+02 6.916e+02 1.084e+03 2.169e+03, threshold=1.383e+03, percent-clipped=9.0 2023-06-23 19:16:12,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1557462.0, ans=0.125 2023-06-23 19:16:55,698 INFO [train.py:996] (1/4) Epoch 9, batch 15650, loss[loss=0.2746, simple_loss=0.3161, pruned_loss=0.1165, over 21350.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3099, pruned_loss=0.08002, over 4271435.97 frames. ], batch size: 508, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:17:45,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1557762.0, ans=0.1 2023-06-23 19:17:51,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1557762.0, ans=0.125 2023-06-23 19:18:15,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1557882.0, ans=0.125 2023-06-23 19:18:31,508 INFO [train.py:996] (1/4) Epoch 9, batch 15700, loss[loss=0.2069, simple_loss=0.2817, pruned_loss=0.06601, over 21612.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3067, pruned_loss=0.07963, over 4274737.67 frames. ], batch size: 247, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:18:32,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-23 19:18:48,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1557942.0, ans=10.0 2023-06-23 19:18:53,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1558002.0, ans=0.125 2023-06-23 19:19:04,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1558002.0, ans=0.0 2023-06-23 19:19:07,042 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.566e+02 5.534e+02 7.672e+02 1.120e+03 2.103e+03, threshold=1.534e+03, percent-clipped=13.0 2023-06-23 19:19:32,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1558062.0, ans=0.0 2023-06-23 19:19:54,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1558182.0, ans=0.125 2023-06-23 19:19:54,778 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.61 vs. limit=10.0 2023-06-23 19:20:09,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1558242.0, ans=0.5 2023-06-23 19:20:11,357 INFO [train.py:996] (1/4) Epoch 9, batch 15750, loss[loss=0.2938, simple_loss=0.3418, pruned_loss=0.1229, over 21373.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3028, pruned_loss=0.07936, over 4269781.04 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:21:41,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1558482.0, ans=0.0 2023-06-23 19:21:45,298 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-23 19:21:50,641 INFO [train.py:996] (1/4) Epoch 9, batch 15800, loss[loss=0.2691, simple_loss=0.3064, pruned_loss=0.1159, over 21514.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2997, pruned_loss=0.07975, over 4264252.38 frames. ], batch size: 512, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:21:54,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1558542.0, ans=0.125 2023-06-23 19:22:26,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.286e+02 6.867e+02 8.896e+02 1.872e+03, threshold=1.373e+03, percent-clipped=1.0 2023-06-23 19:22:46,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1558662.0, ans=0.1 2023-06-23 19:22:59,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1558722.0, ans=0.1 2023-06-23 19:23:18,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1558782.0, ans=0.0 2023-06-23 19:23:30,824 INFO [train.py:996] (1/4) Epoch 9, batch 15850, loss[loss=0.3192, simple_loss=0.3653, pruned_loss=0.1365, over 21411.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3017, pruned_loss=0.08192, over 4263037.27 frames. ], batch size: 510, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:25:10,614 INFO [train.py:996] (1/4) Epoch 9, batch 15900, loss[loss=0.2237, simple_loss=0.3098, pruned_loss=0.06885, over 21505.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3001, pruned_loss=0.08236, over 4252637.29 frames. ], batch size: 389, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:25:11,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1559142.0, ans=0.0 2023-06-23 19:25:15,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1559142.0, ans=0.125 2023-06-23 19:25:46,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-23 19:25:46,399 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.755e+02 5.068e+02 6.374e+02 9.133e+02 1.940e+03, threshold=1.275e+03, percent-clipped=6.0 2023-06-23 19:26:27,755 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-23 19:26:31,200 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-23 19:26:51,993 INFO [train.py:996] (1/4) Epoch 9, batch 15950, loss[loss=0.1797, simple_loss=0.2788, pruned_loss=0.04024, over 21796.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3014, pruned_loss=0.07919, over 4254819.85 frames. ], batch size: 332, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:27:13,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1559502.0, ans=0.125 2023-06-23 19:27:24,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1559502.0, ans=0.125 2023-06-23 19:27:59,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1559622.0, ans=0.0 2023-06-23 19:28:31,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1559742.0, ans=0.125 2023-06-23 19:28:32,490 INFO [train.py:996] (1/4) Epoch 9, batch 16000, loss[loss=0.2132, simple_loss=0.311, pruned_loss=0.05765, over 21740.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3022, pruned_loss=0.07718, over 4259606.56 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:28:37,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1559742.0, ans=0.125 2023-06-23 19:28:58,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1559802.0, ans=0.125 2023-06-23 19:29:07,947 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.766e+02 5.425e+02 7.645e+02 1.261e+03 2.910e+03, threshold=1.529e+03, percent-clipped=25.0 2023-06-23 19:29:10,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1559862.0, ans=0.0 2023-06-23 19:29:18,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1559862.0, ans=0.035 2023-06-23 19:29:34,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1559922.0, ans=0.2 2023-06-23 19:30:09,383 INFO [train.py:996] (1/4) Epoch 9, batch 16050, loss[loss=0.2352, simple_loss=0.3415, pruned_loss=0.06449, over 21809.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3065, pruned_loss=0.07546, over 4267162.08 frames. ], batch size: 282, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:30:13,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1560042.0, ans=0.2 2023-06-23 19:30:19,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1560042.0, ans=0.07 2023-06-23 19:30:36,298 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-23 19:31:00,113 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-23 19:31:48,823 INFO [train.py:996] (1/4) Epoch 9, batch 16100, loss[loss=0.2436, simple_loss=0.3099, pruned_loss=0.08869, over 21837.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3131, pruned_loss=0.0774, over 4273334.95 frames. ], batch size: 282, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:32:22,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-23 19:32:24,537 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.839e+02 1.039e+03 1.501e+03 2.959e+03, threshold=2.078e+03, percent-clipped=23.0 2023-06-23 19:33:29,459 INFO [train.py:996] (1/4) Epoch 9, batch 16150, loss[loss=0.244, simple_loss=0.3687, pruned_loss=0.05963, over 20840.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3126, pruned_loss=0.08, over 4283434.00 frames. ], batch size: 608, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:33:29,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1560642.0, ans=0.125 2023-06-23 19:33:29,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1560642.0, ans=0.0 2023-06-23 19:33:38,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1560642.0, ans=0.2 2023-06-23 19:33:57,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1560702.0, ans=0.2 2023-06-23 19:34:02,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1560702.0, ans=0.1 2023-06-23 19:34:02,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1560702.0, ans=0.2 2023-06-23 19:34:02,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1560702.0, ans=0.0 2023-06-23 19:34:59,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1560882.0, ans=0.1 2023-06-23 19:35:08,343 INFO [train.py:996] (1/4) Epoch 9, batch 16200, loss[loss=0.3014, simple_loss=0.3591, pruned_loss=0.1218, over 21243.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3143, pruned_loss=0.08098, over 4281672.16 frames. ], batch size: 143, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:35:12,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-23 19:35:23,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1560942.0, ans=0.0 2023-06-23 19:35:27,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1560942.0, ans=0.125 2023-06-23 19:35:34,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1561002.0, ans=0.0 2023-06-23 19:35:34,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1561002.0, ans=0.0 2023-06-23 19:35:45,989 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.932e+02 6.489e+02 9.592e+02 1.250e+03 2.736e+03, threshold=1.918e+03, percent-clipped=7.0 2023-06-23 19:36:07,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1561062.0, ans=0.04949747468305833 2023-06-23 19:36:23,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1561122.0, ans=0.125 2023-06-23 19:36:36,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-23 19:36:55,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1561242.0, ans=0.1 2023-06-23 19:36:56,515 INFO [train.py:996] (1/4) Epoch 9, batch 16250, loss[loss=0.268, simple_loss=0.3264, pruned_loss=0.1049, over 21460.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3149, pruned_loss=0.08154, over 4285064.92 frames. ], batch size: 509, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:37:40,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1561362.0, ans=0.125 2023-06-23 19:38:36,189 INFO [train.py:996] (1/4) Epoch 9, batch 16300, loss[loss=0.1918, simple_loss=0.2863, pruned_loss=0.04861, over 21704.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3076, pruned_loss=0.07752, over 4283698.80 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:38:46,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1561542.0, ans=0.0 2023-06-23 19:39:03,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1561602.0, ans=0.125 2023-06-23 19:39:18,115 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.285e+02 4.893e+02 6.873e+02 9.809e+02 2.054e+03, threshold=1.375e+03, percent-clipped=1.0 2023-06-23 19:39:38,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1561722.0, ans=0.0 2023-06-23 19:40:09,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1561782.0, ans=0.0 2023-06-23 19:40:15,725 INFO [train.py:996] (1/4) Epoch 9, batch 16350, loss[loss=0.2343, simple_loss=0.3028, pruned_loss=0.08297, over 21419.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3074, pruned_loss=0.0774, over 4287328.23 frames. ], batch size: 194, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:40:16,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1561842.0, ans=0.125 2023-06-23 19:41:00,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1561962.0, ans=0.125 2023-06-23 19:41:24,969 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.17 vs. limit=12.0 2023-06-23 19:41:26,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1562022.0, ans=0.125 2023-06-23 19:41:54,457 INFO [train.py:996] (1/4) Epoch 9, batch 16400, loss[loss=0.2102, simple_loss=0.2795, pruned_loss=0.07041, over 21461.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3098, pruned_loss=0.0782, over 4282231.76 frames. ], batch size: 211, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:41:59,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1562142.0, ans=0.2 2023-06-23 19:42:37,807 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 4.704e+02 7.326e+02 1.027e+03 2.811e+03, threshold=1.465e+03, percent-clipped=10.0 2023-06-23 19:43:34,533 INFO [train.py:996] (1/4) Epoch 9, batch 16450, loss[loss=0.2279, simple_loss=0.2965, pruned_loss=0.07966, over 17510.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3108, pruned_loss=0.07886, over 4284148.66 frames. ], batch size: 63, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:43:37,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1562442.0, ans=0.125 2023-06-23 19:43:43,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1562442.0, ans=0.0 2023-06-23 19:43:48,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1562442.0, ans=0.125 2023-06-23 19:44:03,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.30 vs. limit=12.0 2023-06-23 19:44:19,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.01 vs. limit=10.0 2023-06-23 19:44:25,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1562562.0, ans=0.125 2023-06-23 19:44:42,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-23 19:45:15,108 INFO [train.py:996] (1/4) Epoch 9, batch 16500, loss[loss=0.2264, simple_loss=0.3076, pruned_loss=0.07263, over 21845.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3098, pruned_loss=0.07946, over 4284918.80 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:45:15,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1562742.0, ans=0.0 2023-06-23 19:46:03,634 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 6.770e+02 9.881e+02 1.345e+03 3.319e+03, threshold=1.976e+03, percent-clipped=17.0 2023-06-23 19:46:18,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1562922.0, ans=0.125 2023-06-23 19:46:25,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1562922.0, ans=0.125 2023-06-23 19:46:40,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1562982.0, ans=0.0 2023-06-23 19:46:56,226 INFO [train.py:996] (1/4) Epoch 9, batch 16550, loss[loss=0.2472, simple_loss=0.329, pruned_loss=0.08271, over 21851.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3103, pruned_loss=0.07806, over 4286261.77 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:47:04,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1563042.0, ans=0.2 2023-06-23 19:47:12,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1563042.0, ans=0.0 2023-06-23 19:47:13,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=12.0 2023-06-23 19:47:33,161 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-23 19:47:54,393 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:48:06,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1563222.0, ans=0.125 2023-06-23 19:48:07,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1563222.0, ans=0.2 2023-06-23 19:48:17,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1563222.0, ans=0.5 2023-06-23 19:48:47,692 INFO [train.py:996] (1/4) Epoch 9, batch 16600, loss[loss=0.3023, simple_loss=0.4012, pruned_loss=0.1017, over 21890.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3197, pruned_loss=0.08158, over 4288177.05 frames. ], batch size: 372, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:48:50,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.91 vs. limit=10.0 2023-06-23 19:49:26,433 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.870e+02 6.782e+02 8.603e+02 1.169e+03 2.865e+03, threshold=1.721e+03, percent-clipped=6.0 2023-06-23 19:49:46,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1563522.0, ans=0.0 2023-06-23 19:49:49,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1563522.0, ans=0.125 2023-06-23 19:50:28,641 INFO [train.py:996] (1/4) Epoch 9, batch 16650, loss[loss=0.2381, simple_loss=0.3178, pruned_loss=0.07919, over 21814.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3275, pruned_loss=0.08338, over 4280749.85 frames. ], batch size: 247, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:50:50,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1563702.0, ans=0.125 2023-06-23 19:52:05,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=12.0 2023-06-23 19:52:08,239 INFO [train.py:996] (1/4) Epoch 9, batch 16700, loss[loss=0.2614, simple_loss=0.3456, pruned_loss=0.08864, over 21670.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3294, pruned_loss=0.08403, over 4282116.67 frames. ], batch size: 389, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:52:23,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1563942.0, ans=0.0 2023-06-23 19:52:25,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1563942.0, ans=0.125 2023-06-23 19:52:31,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1564002.0, ans=0.1 2023-06-23 19:52:57,954 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.483e+02 5.925e+02 7.840e+02 1.052e+03 2.518e+03, threshold=1.568e+03, percent-clipped=7.0 2023-06-23 19:53:05,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1564062.0, ans=0.2 2023-06-23 19:53:07,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1564062.0, ans=0.125 2023-06-23 19:53:27,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-23 19:53:47,163 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=22.5 2023-06-23 19:53:58,335 INFO [train.py:996] (1/4) Epoch 9, batch 16750, loss[loss=0.3254, simple_loss=0.3864, pruned_loss=0.1322, over 21788.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3313, pruned_loss=0.08705, over 4282829.97 frames. ], batch size: 441, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:54:02,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1564242.0, ans=0.125 2023-06-23 19:54:03,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1564242.0, ans=15.0 2023-06-23 19:55:01,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1564422.0, ans=0.025 2023-06-23 19:55:20,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1564422.0, ans=0.125 2023-06-23 19:55:43,495 INFO [train.py:996] (1/4) Epoch 9, batch 16800, loss[loss=0.2355, simple_loss=0.3114, pruned_loss=0.07981, over 21855.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3361, pruned_loss=0.08736, over 4276195.32 frames. ], batch size: 298, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:56:26,918 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 6.649e+02 8.449e+02 1.121e+03 2.457e+03, threshold=1.690e+03, percent-clipped=14.0 2023-06-23 19:57:23,023 INFO [train.py:996] (1/4) Epoch 9, batch 16850, loss[loss=0.2473, simple_loss=0.3867, pruned_loss=0.05389, over 20760.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3315, pruned_loss=0.08705, over 4279073.51 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:57:25,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1564842.0, ans=0.0 2023-06-23 19:57:36,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1564842.0, ans=0.125 2023-06-23 19:57:50,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1564902.0, ans=0.125 2023-06-23 19:58:52,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1565082.0, ans=0.04949747468305833 2023-06-23 19:59:05,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-23 19:59:07,908 INFO [train.py:996] (1/4) Epoch 9, batch 16900, loss[loss=0.2123, simple_loss=0.2812, pruned_loss=0.07171, over 21753.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3243, pruned_loss=0.08433, over 4275467.06 frames. ], batch size: 316, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 19:59:12,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1565142.0, ans=0.07 2023-06-23 19:59:17,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1565142.0, ans=0.1 2023-06-23 19:59:45,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1565262.0, ans=0.0 2023-06-23 19:59:46,596 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.622e+02 4.767e+02 6.726e+02 1.260e+03 2.714e+03, threshold=1.345e+03, percent-clipped=10.0 2023-06-23 20:00:11,404 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.24 vs. limit=15.0 2023-06-23 20:00:45,065 INFO [train.py:996] (1/4) Epoch 9, batch 16950, loss[loss=0.2158, simple_loss=0.2867, pruned_loss=0.07248, over 21770.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3166, pruned_loss=0.08257, over 4280068.17 frames. ], batch size: 247, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:01:49,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.26 vs. limit=15.0 2023-06-23 20:02:23,844 INFO [train.py:996] (1/4) Epoch 9, batch 17000, loss[loss=0.2539, simple_loss=0.3198, pruned_loss=0.09402, over 21856.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3136, pruned_loss=0.08327, over 4286114.13 frames. ], batch size: 391, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:03:04,624 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.656e+02 4.823e+02 5.834e+02 7.335e+02 1.533e+03, threshold=1.167e+03, percent-clipped=2.0 2023-06-23 20:04:02,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1565982.0, ans=0.1 2023-06-23 20:04:05,386 INFO [train.py:996] (1/4) Epoch 9, batch 17050, loss[loss=0.2349, simple_loss=0.3177, pruned_loss=0.07602, over 21293.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3206, pruned_loss=0.08561, over 4289384.90 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:05:20,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-23 20:05:31,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1566282.0, ans=0.125 2023-06-23 20:05:44,752 INFO [train.py:996] (1/4) Epoch 9, batch 17100, loss[loss=0.2504, simple_loss=0.3201, pruned_loss=0.09032, over 21882.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3187, pruned_loss=0.08633, over 4288183.89 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:05:56,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1566342.0, ans=0.125 2023-06-23 20:06:24,479 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.937e+02 5.247e+02 7.699e+02 1.064e+03 2.324e+03, threshold=1.540e+03, percent-clipped=17.0 2023-06-23 20:06:46,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1566522.0, ans=0.125 2023-06-23 20:06:51,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1566522.0, ans=0.0 2023-06-23 20:07:15,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1566582.0, ans=0.2 2023-06-23 20:07:24,730 INFO [train.py:996] (1/4) Epoch 9, batch 17150, loss[loss=0.2212, simple_loss=0.2884, pruned_loss=0.07703, over 21264.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3152, pruned_loss=0.0859, over 4296156.05 frames. ], batch size: 143, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:08:16,445 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.41 vs. limit=10.0 2023-06-23 20:09:05,729 INFO [train.py:996] (1/4) Epoch 9, batch 17200, loss[loss=0.2597, simple_loss=0.3222, pruned_loss=0.09862, over 21817.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3166, pruned_loss=0.08634, over 4289766.08 frames. ], batch size: 247, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:09:08,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1566942.0, ans=15.0 2023-06-23 20:09:19,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1566942.0, ans=0.07 2023-06-23 20:09:29,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1567002.0, ans=0.125 2023-06-23 20:09:44,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1567002.0, ans=0.125 2023-06-23 20:09:53,662 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 8.105e+02 1.085e+03 1.487e+03 3.292e+03, threshold=2.169e+03, percent-clipped=22.0 2023-06-23 20:10:28,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1567122.0, ans=0.0 2023-06-23 20:10:52,891 INFO [train.py:996] (1/4) Epoch 9, batch 17250, loss[loss=0.2454, simple_loss=0.324, pruned_loss=0.08346, over 21634.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3189, pruned_loss=0.0878, over 4287449.45 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:11:18,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1567302.0, ans=0.125 2023-06-23 20:11:45,095 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-23 20:11:57,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1567422.0, ans=0.125 2023-06-23 20:11:58,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1567422.0, ans=0.125 2023-06-23 20:12:02,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1567422.0, ans=0.125 2023-06-23 20:12:26,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1567482.0, ans=0.125 2023-06-23 20:12:35,707 INFO [train.py:996] (1/4) Epoch 9, batch 17300, loss[loss=0.299, simple_loss=0.3667, pruned_loss=0.1156, over 21323.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3285, pruned_loss=0.09191, over 4290666.77 frames. ], batch size: 143, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:13:15,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1567602.0, ans=0.2 2023-06-23 20:13:28,012 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 5.597e+02 7.560e+02 1.039e+03 2.489e+03, threshold=1.512e+03, percent-clipped=1.0 2023-06-23 20:13:39,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1567722.0, ans=0.1 2023-06-23 20:13:43,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1567722.0, ans=0.07 2023-06-23 20:14:21,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1567842.0, ans=0.0 2023-06-23 20:14:22,082 INFO [train.py:996] (1/4) Epoch 9, batch 17350, loss[loss=0.2862, simple_loss=0.3883, pruned_loss=0.09199, over 20794.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3276, pruned_loss=0.09044, over 4288327.46 frames. ], batch size: 607, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:14:24,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-23 20:14:42,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1567902.0, ans=0.125 2023-06-23 20:14:44,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1567902.0, ans=0.125 2023-06-23 20:15:01,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1567902.0, ans=0.015 2023-06-23 20:15:28,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1568022.0, ans=0.1 2023-06-23 20:15:28,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1568022.0, ans=0.1 2023-06-23 20:15:34,815 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-23 20:16:08,003 INFO [train.py:996] (1/4) Epoch 9, batch 17400, loss[loss=0.2758, simple_loss=0.3592, pruned_loss=0.09617, over 21591.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3245, pruned_loss=0.08714, over 4282796.55 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:16:22,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.84 vs. limit=15.0 2023-06-23 20:16:36,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1568202.0, ans=0.0 2023-06-23 20:16:48,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1568262.0, ans=0.125 2023-06-23 20:16:55,295 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 5.667e+02 9.192e+02 1.513e+03 3.310e+03, threshold=1.838e+03, percent-clipped=24.0 2023-06-23 20:16:59,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1568262.0, ans=0.125 2023-06-23 20:17:42,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1568382.0, ans=0.1 2023-06-23 20:17:42,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1568382.0, ans=0.0 2023-06-23 20:17:49,326 INFO [train.py:996] (1/4) Epoch 9, batch 17450, loss[loss=0.1786, simple_loss=0.2685, pruned_loss=0.04437, over 21772.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3207, pruned_loss=0.08362, over 4264526.92 frames. ], batch size: 282, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:19:10,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1568682.0, ans=0.125 2023-06-23 20:19:25,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1568682.0, ans=0.125 2023-06-23 20:19:25,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1568682.0, ans=0.1 2023-06-23 20:19:27,730 INFO [train.py:996] (1/4) Epoch 9, batch 17500, loss[loss=0.2577, simple_loss=0.3694, pruned_loss=0.07297, over 19827.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3179, pruned_loss=0.08141, over 4264040.75 frames. ], batch size: 703, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:19:42,210 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:20:19,175 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.530e+02 5.781e+02 8.305e+02 1.162e+03 2.249e+03, threshold=1.661e+03, percent-clipped=4.0 2023-06-23 20:20:25,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1568862.0, ans=0.2 2023-06-23 20:20:26,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-23 20:20:35,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1568922.0, ans=0.0 2023-06-23 20:21:04,910 INFO [train.py:996] (1/4) Epoch 9, batch 17550, loss[loss=0.2433, simple_loss=0.3261, pruned_loss=0.0802, over 21797.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3183, pruned_loss=0.08008, over 4271792.67 frames. ], batch size: 124, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:22:18,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1569222.0, ans=0.125 2023-06-23 20:22:28,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1569282.0, ans=0.125 2023-06-23 20:22:43,274 INFO [train.py:996] (1/4) Epoch 9, batch 17600, loss[loss=0.295, simple_loss=0.3616, pruned_loss=0.1142, over 21529.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3204, pruned_loss=0.08076, over 4270684.07 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:22:43,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1569342.0, ans=0.1 2023-06-23 20:23:38,194 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.747e+02 5.641e+02 7.220e+02 1.088e+03 2.051e+03, threshold=1.444e+03, percent-clipped=1.0 2023-06-23 20:24:01,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1569522.0, ans=0.2 2023-06-23 20:24:11,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-23 20:24:16,753 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-23 20:24:20,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1569582.0, ans=0.2 2023-06-23 20:24:25,259 INFO [train.py:996] (1/4) Epoch 9, batch 17650, loss[loss=0.1962, simple_loss=0.2509, pruned_loss=0.07072, over 21275.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3158, pruned_loss=0.0805, over 4263425.45 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:25:17,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1569762.0, ans=0.0 2023-06-23 20:25:17,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1569762.0, ans=0.1 2023-06-23 20:25:17,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1569762.0, ans=0.0 2023-06-23 20:25:31,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1569822.0, ans=0.125 2023-06-23 20:26:05,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1569882.0, ans=0.125 2023-06-23 20:26:14,548 INFO [train.py:996] (1/4) Epoch 9, batch 17700, loss[loss=0.2427, simple_loss=0.32, pruned_loss=0.08267, over 21377.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3111, pruned_loss=0.07779, over 4261425.78 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:27:02,719 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 6.405e+02 1.162e+03 1.769e+03 3.070e+03, threshold=2.325e+03, percent-clipped=36.0 2023-06-23 20:27:41,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1570182.0, ans=0.125 2023-06-23 20:27:53,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1570242.0, ans=0.0 2023-06-23 20:27:54,412 INFO [train.py:996] (1/4) Epoch 9, batch 17750, loss[loss=0.2635, simple_loss=0.3396, pruned_loss=0.09371, over 21750.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3165, pruned_loss=0.08028, over 4268645.89 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:28:00,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-23 20:28:16,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1570302.0, ans=0.1 2023-06-23 20:28:16,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-23 20:29:26,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1570482.0, ans=0.1 2023-06-23 20:29:40,676 INFO [train.py:996] (1/4) Epoch 9, batch 17800, loss[loss=0.2039, simple_loss=0.3072, pruned_loss=0.05027, over 19814.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3177, pruned_loss=0.08072, over 4272944.41 frames. ], batch size: 702, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:29:46,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1570542.0, ans=0.125 2023-06-23 20:30:01,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=15.0 2023-06-23 20:30:15,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1570602.0, ans=0.0 2023-06-23 20:30:24,624 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.850e+02 7.073e+02 9.847e+02 1.535e+03 2.589e+03, threshold=1.969e+03, percent-clipped=1.0 2023-06-23 20:30:50,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1570722.0, ans=0.0 2023-06-23 20:31:17,182 INFO [train.py:996] (1/4) Epoch 9, batch 17850, loss[loss=0.2867, simple_loss=0.3563, pruned_loss=0.1086, over 21764.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3168, pruned_loss=0.0811, over 4259712.76 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:32:26,121 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:32:56,166 INFO [train.py:996] (1/4) Epoch 9, batch 17900, loss[loss=0.2774, simple_loss=0.3858, pruned_loss=0.08444, over 20819.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3212, pruned_loss=0.0821, over 4264373.35 frames. ], batch size: 608, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:33:08,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1571142.0, ans=0.0 2023-06-23 20:33:11,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1571142.0, ans=0.0 2023-06-23 20:33:15,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1571142.0, ans=0.125 2023-06-23 20:33:23,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.84 vs. limit=22.5 2023-06-23 20:33:31,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1571202.0, ans=0.0 2023-06-23 20:33:49,667 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.056e+02 5.757e+02 7.357e+02 9.993e+02 2.226e+03, threshold=1.471e+03, percent-clipped=2.0 2023-06-23 20:34:13,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1571322.0, ans=0.125 2023-06-23 20:34:14,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1571322.0, ans=0.125 2023-06-23 20:34:41,354 INFO [train.py:996] (1/4) Epoch 9, batch 17950, loss[loss=0.2359, simple_loss=0.3276, pruned_loss=0.0721, over 21622.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3216, pruned_loss=0.07933, over 4267539.12 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:35:11,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1571502.0, ans=0.0 2023-06-23 20:35:11,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1571502.0, ans=0.1 2023-06-23 20:36:10,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1571682.0, ans=0.0 2023-06-23 20:36:19,117 INFO [train.py:996] (1/4) Epoch 9, batch 18000, loss[loss=0.1917, simple_loss=0.2735, pruned_loss=0.0549, over 20759.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3152, pruned_loss=0.07772, over 4262994.72 frames. ], batch size: 607, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:36:19,118 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 20:36:35,999 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2626, simple_loss=0.3575, pruned_loss=0.08385, over 1796401.00 frames. 2023-06-23 20:36:36,000 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 20:36:38,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.35 vs. limit=10.0 2023-06-23 20:36:55,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1571802.0, ans=0.0 2023-06-23 20:37:17,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1571862.0, ans=0.0 2023-06-23 20:37:29,992 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.324e+02 4.978e+02 7.262e+02 1.032e+03 1.973e+03, threshold=1.452e+03, percent-clipped=7.0 2023-06-23 20:38:19,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1572042.0, ans=0.125 2023-06-23 20:38:20,233 INFO [train.py:996] (1/4) Epoch 9, batch 18050, loss[loss=0.2543, simple_loss=0.3254, pruned_loss=0.09166, over 21715.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3097, pruned_loss=0.07704, over 4263373.92 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:38:40,496 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:38:44,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1572102.0, ans=0.1 2023-06-23 20:39:11,522 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-23 20:40:00,664 INFO [train.py:996] (1/4) Epoch 9, batch 18100, loss[loss=0.2372, simple_loss=0.3187, pruned_loss=0.07788, over 21253.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3136, pruned_loss=0.07851, over 4263825.40 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:40:25,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1572402.0, ans=0.07 2023-06-23 20:40:55,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 5.547e+02 7.703e+02 1.178e+03 2.128e+03, threshold=1.541e+03, percent-clipped=13.0 2023-06-23 20:41:38,962 INFO [train.py:996] (1/4) Epoch 9, batch 18150, loss[loss=0.2329, simple_loss=0.3042, pruned_loss=0.08079, over 21660.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.315, pruned_loss=0.07817, over 4264833.94 frames. ], batch size: 415, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:42:32,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.67 vs. limit=15.0 2023-06-23 20:42:37,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1572762.0, ans=0.125 2023-06-23 20:42:39,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1572822.0, ans=0.125 2023-06-23 20:43:00,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-23 20:43:12,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1572882.0, ans=0.125 2023-06-23 20:43:13,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-06-23 20:43:15,634 INFO [train.py:996] (1/4) Epoch 9, batch 18200, loss[loss=0.2391, simple_loss=0.3021, pruned_loss=0.08804, over 21610.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3091, pruned_loss=0.07844, over 4257978.19 frames. ], batch size: 415, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:44:03,318 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.079e+02 5.748e+02 7.422e+02 1.064e+03 2.381e+03, threshold=1.484e+03, percent-clipped=9.0 2023-06-23 20:44:10,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-23 20:44:12,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1573122.0, ans=0.0 2023-06-23 20:44:26,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-23 20:44:28,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1573122.0, ans=0.125 2023-06-23 20:44:42,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1573182.0, ans=0.1 2023-06-23 20:44:51,700 INFO [train.py:996] (1/4) Epoch 9, batch 18250, loss[loss=0.2247, simple_loss=0.2901, pruned_loss=0.07968, over 21638.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.302, pruned_loss=0.07657, over 4261940.24 frames. ], batch size: 195, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:45:14,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=22.5 2023-06-23 20:45:17,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1573302.0, ans=0.125 2023-06-23 20:45:51,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1573422.0, ans=0.1 2023-06-23 20:45:58,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1573422.0, ans=0.125 2023-06-23 20:46:30,607 INFO [train.py:996] (1/4) Epoch 9, batch 18300, loss[loss=0.2582, simple_loss=0.3657, pruned_loss=0.07532, over 21823.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3027, pruned_loss=0.07696, over 4269729.71 frames. ], batch size: 316, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:46:45,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1573602.0, ans=0.2 2023-06-23 20:47:14,639 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 5.362e+02 7.234e+02 9.653e+02 2.224e+03, threshold=1.447e+03, percent-clipped=7.0 2023-06-23 20:47:34,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-23 20:47:47,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1573782.0, ans=0.2 2023-06-23 20:47:49,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1573782.0, ans=0.125 2023-06-23 20:48:08,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1573842.0, ans=0.0 2023-06-23 20:48:09,467 INFO [train.py:996] (1/4) Epoch 9, batch 18350, loss[loss=0.2657, simple_loss=0.3279, pruned_loss=0.1018, over 21285.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3093, pruned_loss=0.07708, over 4267850.59 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:48:19,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1573842.0, ans=0.125 2023-06-23 20:48:23,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1573902.0, ans=0.05 2023-06-23 20:48:42,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1573902.0, ans=0.125 2023-06-23 20:48:44,649 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-23 20:49:09,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.44 vs. limit=10.0 2023-06-23 20:49:20,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1574022.0, ans=0.0 2023-06-23 20:49:21,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-23 20:49:34,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-23 20:49:48,149 INFO [train.py:996] (1/4) Epoch 9, batch 18400, loss[loss=0.1789, simple_loss=0.2657, pruned_loss=0.04603, over 21477.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3045, pruned_loss=0.07584, over 4265728.48 frames. ], batch size: 212, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:49:50,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1574142.0, ans=0.2 2023-06-23 20:50:21,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1574202.0, ans=0.125 2023-06-23 20:50:29,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1574262.0, ans=0.04949747468305833 2023-06-23 20:50:38,669 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.618e+02 5.691e+02 8.490e+02 1.304e+03 3.377e+03, threshold=1.698e+03, percent-clipped=15.0 2023-06-23 20:50:53,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1574322.0, ans=0.2 2023-06-23 20:50:57,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1574322.0, ans=0.04949747468305833 2023-06-23 20:51:11,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1574382.0, ans=0.0 2023-06-23 20:51:19,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574382.0, ans=0.1 2023-06-23 20:51:24,279 INFO [train.py:996] (1/4) Epoch 9, batch 18450, loss[loss=0.1817, simple_loss=0.2743, pruned_loss=0.04452, over 21826.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3003, pruned_loss=0.07257, over 4257793.34 frames. ], batch size: 317, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:52:03,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1574562.0, ans=0.0 2023-06-23 20:52:11,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1574562.0, ans=0.125 2023-06-23 20:52:33,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1574622.0, ans=0.125 2023-06-23 20:52:50,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1574682.0, ans=0.125 2023-06-23 20:53:03,441 INFO [train.py:996] (1/4) Epoch 9, batch 18500, loss[loss=0.2258, simple_loss=0.2902, pruned_loss=0.08066, over 21743.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.296, pruned_loss=0.07126, over 4255858.64 frames. ], batch size: 112, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:53:57,760 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.581e+02 5.113e+02 7.633e+02 1.101e+03 4.944e+03, threshold=1.527e+03, percent-clipped=5.0 2023-06-23 20:54:33,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1574982.0, ans=0.0 2023-06-23 20:54:41,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-23 20:54:42,164 INFO [train.py:996] (1/4) Epoch 9, batch 18550, loss[loss=0.2438, simple_loss=0.3002, pruned_loss=0.09372, over 21778.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2938, pruned_loss=0.07078, over 4248516.29 frames. ], batch size: 371, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:56:21,051 INFO [train.py:996] (1/4) Epoch 9, batch 18600, loss[loss=0.2475, simple_loss=0.3216, pruned_loss=0.08669, over 21556.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2914, pruned_loss=0.07125, over 4233425.30 frames. ], batch size: 389, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:56:24,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1575342.0, ans=0.0 2023-06-23 20:57:16,747 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.217e+02 6.797e+02 9.080e+02 2.355e+03, threshold=1.359e+03, percent-clipped=3.0 2023-06-23 20:57:25,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1575522.0, ans=0.125 2023-06-23 20:57:39,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1575582.0, ans=0.125 2023-06-23 20:57:44,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1575582.0, ans=0.1 2023-06-23 20:57:44,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1575582.0, ans=0.1 2023-06-23 20:57:59,925 INFO [train.py:996] (1/4) Epoch 9, batch 18650, loss[loss=0.1952, simple_loss=0.253, pruned_loss=0.06872, over 16999.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2913, pruned_loss=0.07154, over 4220686.29 frames. ], batch size: 66, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 20:58:25,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1575702.0, ans=0.0 2023-06-23 20:58:27,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-23 20:59:15,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1575822.0, ans=0.0 2023-06-23 20:59:18,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.33 vs. limit=15.0 2023-06-23 20:59:27,548 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:59:33,526 INFO [train.py:996] (1/4) Epoch 9, batch 18700, loss[loss=0.2188, simple_loss=0.2829, pruned_loss=0.07731, over 21707.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2896, pruned_loss=0.07324, over 4235177.96 frames. ], batch size: 416, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 20:59:41,595 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:00:27,903 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.989e+02 5.925e+02 8.021e+02 1.131e+03 2.066e+03, threshold=1.604e+03, percent-clipped=15.0 2023-06-23 21:00:47,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1576122.0, ans=0.125 2023-06-23 21:01:10,844 INFO [train.py:996] (1/4) Epoch 9, batch 18750, loss[loss=0.2336, simple_loss=0.3078, pruned_loss=0.07967, over 21628.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.291, pruned_loss=0.07542, over 4255762.04 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:01:47,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1576362.0, ans=0.0 2023-06-23 21:02:50,414 INFO [train.py:996] (1/4) Epoch 9, batch 18800, loss[loss=0.1997, simple_loss=0.2845, pruned_loss=0.05747, over 21381.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2991, pruned_loss=0.07746, over 4264780.53 frames. ], batch size: 211, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:02:50,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1576542.0, ans=0.02 2023-06-23 21:03:03,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1576542.0, ans=0.125 2023-06-23 21:03:05,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1576542.0, ans=0.2 2023-06-23 21:03:42,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1576662.0, ans=0.1 2023-06-23 21:03:44,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-23 21:03:48,145 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.728e+02 5.591e+02 7.847e+02 1.340e+03 4.014e+03, threshold=1.569e+03, percent-clipped=18.0 2023-06-23 21:04:01,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1576722.0, ans=0.015 2023-06-23 21:04:23,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1576842.0, ans=0.1 2023-06-23 21:04:24,344 INFO [train.py:996] (1/4) Epoch 9, batch 18850, loss[loss=0.1645, simple_loss=0.2487, pruned_loss=0.04012, over 21431.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2994, pruned_loss=0.0739, over 4259439.91 frames. ], batch size: 211, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:04:34,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1576842.0, ans=0.0 2023-06-23 21:04:40,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1576842.0, ans=0.1 2023-06-23 21:04:43,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1576902.0, ans=0.0 2023-06-23 21:05:02,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1576902.0, ans=0.125 2023-06-23 21:06:01,651 INFO [train.py:996] (1/4) Epoch 9, batch 18900, loss[loss=0.2197, simple_loss=0.2894, pruned_loss=0.07498, over 21419.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2945, pruned_loss=0.073, over 4258739.06 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:06:17,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1577142.0, ans=0.035 2023-06-23 21:06:35,882 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:07:02,792 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.570e+02 8.138e+02 1.105e+03 2.529e+03, threshold=1.628e+03, percent-clipped=6.0 2023-06-23 21:07:05,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-23 21:07:40,594 INFO [train.py:996] (1/4) Epoch 9, batch 18950, loss[loss=0.2683, simple_loss=0.3561, pruned_loss=0.09025, over 21790.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2943, pruned_loss=0.07455, over 4264805.65 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:07:51,341 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-23 21:08:17,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1577502.0, ans=10.0 2023-06-23 21:09:03,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1577682.0, ans=0.1 2023-06-23 21:09:13,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1577682.0, ans=0.05 2023-06-23 21:09:16,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1577682.0, ans=0.0 2023-06-23 21:09:25,300 INFO [train.py:996] (1/4) Epoch 9, batch 19000, loss[loss=0.2421, simple_loss=0.3259, pruned_loss=0.07912, over 21691.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3037, pruned_loss=0.07634, over 4258850.06 frames. ], batch size: 332, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:09:35,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1577742.0, ans=0.1 2023-06-23 21:10:22,698 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.803e+02 7.303e+02 9.676e+02 2.097e+03, threshold=1.461e+03, percent-clipped=8.0 2023-06-23 21:10:28,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1577922.0, ans=22.5 2023-06-23 21:11:04,908 INFO [train.py:996] (1/4) Epoch 9, batch 19050, loss[loss=0.2485, simple_loss=0.3155, pruned_loss=0.09073, over 21943.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3087, pruned_loss=0.07977, over 4257636.37 frames. ], batch size: 113, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:12:07,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1578222.0, ans=0.09899494936611666 2023-06-23 21:12:15,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1578222.0, ans=0.125 2023-06-23 21:12:44,210 INFO [train.py:996] (1/4) Epoch 9, batch 19100, loss[loss=0.2124, simple_loss=0.2784, pruned_loss=0.07319, over 21397.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3082, pruned_loss=0.08184, over 4255635.52 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:12:46,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1578342.0, ans=0.125 2023-06-23 21:12:56,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1578342.0, ans=0.0 2023-06-23 21:13:38,238 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.144e+02 5.907e+02 9.492e+02 1.356e+03 2.303e+03, threshold=1.898e+03, percent-clipped=18.0 2023-06-23 21:14:26,149 INFO [train.py:996] (1/4) Epoch 9, batch 19150, loss[loss=0.3404, simple_loss=0.4175, pruned_loss=0.1316, over 21581.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.311, pruned_loss=0.08261, over 4257717.47 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:15:58,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1578882.0, ans=0.125 2023-06-23 21:16:07,501 INFO [train.py:996] (1/4) Epoch 9, batch 19200, loss[loss=0.2507, simple_loss=0.3515, pruned_loss=0.07498, over 21400.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3194, pruned_loss=0.0824, over 4265915.43 frames. ], batch size: 194, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:17:01,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.099e+02 6.099e+02 9.205e+02 1.363e+03 2.424e+03, threshold=1.841e+03, percent-clipped=8.0 2023-06-23 21:17:47,716 INFO [train.py:996] (1/4) Epoch 9, batch 19250, loss[loss=0.188, simple_loss=0.2864, pruned_loss=0.04478, over 21690.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3196, pruned_loss=0.07788, over 4275743.28 frames. ], batch size: 298, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:18:02,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1579242.0, ans=0.1 2023-06-23 21:18:39,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1579362.0, ans=0.2 2023-06-23 21:18:45,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1579422.0, ans=0.2 2023-06-23 21:19:13,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1579482.0, ans=0.125 2023-06-23 21:19:22,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1579482.0, ans=0.0 2023-06-23 21:19:27,375 INFO [train.py:996] (1/4) Epoch 9, batch 19300, loss[loss=0.2293, simple_loss=0.297, pruned_loss=0.0808, over 21645.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3173, pruned_loss=0.07817, over 4285495.25 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:19:56,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1579602.0, ans=0.2 2023-06-23 21:20:20,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1579662.0, ans=0.125 2023-06-23 21:20:23,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.305e+02 4.858e+02 7.673e+02 1.130e+03 2.664e+03, threshold=1.535e+03, percent-clipped=6.0 2023-06-23 21:20:36,911 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-23 21:21:18,272 INFO [train.py:996] (1/4) Epoch 9, batch 19350, loss[loss=0.2289, simple_loss=0.32, pruned_loss=0.06893, over 21699.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.313, pruned_loss=0.07477, over 4276278.91 frames. ], batch size: 391, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:21:41,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-23 21:21:46,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579902.0, ans=0.1 2023-06-23 21:22:36,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1580082.0, ans=0.125 2023-06-23 21:22:46,743 INFO [train.py:996] (1/4) Epoch 9, batch 19400, loss[loss=0.215, simple_loss=0.2827, pruned_loss=0.07365, over 21617.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3098, pruned_loss=0.07378, over 4274742.59 frames. ], batch size: 195, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:23:03,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1580142.0, ans=0.2 2023-06-23 21:23:22,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1580202.0, ans=0.07 2023-06-23 21:23:30,892 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:23:41,604 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 5.598e+02 7.737e+02 1.076e+03 2.272e+03, threshold=1.547e+03, percent-clipped=6.0 2023-06-23 21:23:41,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1580262.0, ans=0.125 2023-06-23 21:23:56,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1580322.0, ans=0.0 2023-06-23 21:24:02,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1580322.0, ans=0.0 2023-06-23 21:24:03,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1580382.0, ans=0.2 2023-06-23 21:24:19,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1580382.0, ans=10.0 2023-06-23 21:24:36,310 INFO [train.py:996] (1/4) Epoch 9, batch 19450, loss[loss=0.2106, simple_loss=0.2786, pruned_loss=0.07127, over 21792.00 frames. ], tot_loss[loss=0.229, simple_loss=0.307, pruned_loss=0.07548, over 4281040.59 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:24:40,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1580442.0, ans=0.125 2023-06-23 21:24:57,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1580502.0, ans=0.125 2023-06-23 21:25:45,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1580622.0, ans=0.0 2023-06-23 21:25:46,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=15.0 2023-06-23 21:25:53,094 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:25:56,161 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1580682.0, ans=0.125 2023-06-23 21:25:59,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1580682.0, ans=0.125 2023-06-23 21:26:17,334 INFO [train.py:996] (1/4) Epoch 9, batch 19500, loss[loss=0.2702, simple_loss=0.353, pruned_loss=0.0937, over 21159.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3044, pruned_loss=0.07686, over 4277693.74 frames. ], batch size: 548, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:27:07,906 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.161e+02 5.798e+02 8.156e+02 1.303e+03 2.400e+03, threshold=1.631e+03, percent-clipped=12.0 2023-06-23 21:27:08,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-23 21:27:10,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-23 21:27:13,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1580922.0, ans=0.125 2023-06-23 21:27:22,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1580922.0, ans=0.05 2023-06-23 21:27:57,890 INFO [train.py:996] (1/4) Epoch 9, batch 19550, loss[loss=0.1622, simple_loss=0.2152, pruned_loss=0.05465, over 21847.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2984, pruned_loss=0.07491, over 4278342.72 frames. ], batch size: 98, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:28:10,484 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-06-23 21:28:44,258 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:29:08,358 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-23 21:29:20,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1581282.0, ans=0.0 2023-06-23 21:29:22,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1581282.0, ans=0.1 2023-06-23 21:29:37,028 INFO [train.py:996] (1/4) Epoch 9, batch 19600, loss[loss=0.3048, simple_loss=0.3664, pruned_loss=0.1216, over 21464.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3002, pruned_loss=0.07587, over 4276659.13 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:29:39,786 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-06-23 21:30:20,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-23 21:30:25,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.955e+02 6.042e+02 7.862e+02 1.200e+03 2.695e+03, threshold=1.572e+03, percent-clipped=11.0 2023-06-23 21:30:27,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1581522.0, ans=0.125 2023-06-23 21:30:39,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-23 21:30:48,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1581582.0, ans=0.1 2023-06-23 21:30:58,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.26 vs. limit=15.0 2023-06-23 21:31:15,790 INFO [train.py:996] (1/4) Epoch 9, batch 19650, loss[loss=0.2279, simple_loss=0.2974, pruned_loss=0.0792, over 21425.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3052, pruned_loss=0.07901, over 4280448.87 frames. ], batch size: 211, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:31:28,019 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:31:33,756 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-23 21:31:36,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1581702.0, ans=0.1 2023-06-23 21:32:56,825 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-23 21:32:59,045 INFO [train.py:996] (1/4) Epoch 9, batch 19700, loss[loss=0.2277, simple_loss=0.3216, pruned_loss=0.06693, over 21713.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3075, pruned_loss=0.07983, over 4271474.05 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:33:15,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1581942.0, ans=0.125 2023-06-23 21:34:05,979 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 5.626e+02 7.962e+02 1.128e+03 2.480e+03, threshold=1.592e+03, percent-clipped=10.0 2023-06-23 21:34:34,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2023-06-23 21:34:43,681 INFO [train.py:996] (1/4) Epoch 9, batch 19750, loss[loss=0.2625, simple_loss=0.3673, pruned_loss=0.07884, over 21753.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3175, pruned_loss=0.08111, over 4263088.86 frames. ], batch size: 332, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:34:53,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1582242.0, ans=0.125 2023-06-23 21:35:08,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1582302.0, ans=0.125 2023-06-23 21:35:51,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1582422.0, ans=0.0 2023-06-23 21:35:53,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1582422.0, ans=0.125 2023-06-23 21:35:58,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1582422.0, ans=0.2 2023-06-23 21:36:23,320 INFO [train.py:996] (1/4) Epoch 9, batch 19800, loss[loss=0.1805, simple_loss=0.2576, pruned_loss=0.05169, over 21405.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3156, pruned_loss=0.08192, over 4274470.95 frames. ], batch size: 211, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:37:03,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1582602.0, ans=0.125 2023-06-23 21:37:26,056 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.106e+02 6.339e+02 1.006e+03 1.413e+03 2.674e+03, threshold=2.011e+03, percent-clipped=18.0 2023-06-23 21:37:44,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1582782.0, ans=0.125 2023-06-23 21:38:04,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-23 21:38:05,355 INFO [train.py:996] (1/4) Epoch 9, batch 19850, loss[loss=0.1963, simple_loss=0.2692, pruned_loss=0.06166, over 21221.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3099, pruned_loss=0.07741, over 4270555.19 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:38:10,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1582842.0, ans=0.125 2023-06-23 21:38:36,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1582902.0, ans=0.125 2023-06-23 21:38:39,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1582902.0, ans=0.05 2023-06-23 21:39:28,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1583082.0, ans=0.0 2023-06-23 21:39:28,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1583082.0, ans=0.0 2023-06-23 21:39:32,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1583082.0, ans=0.0 2023-06-23 21:39:43,960 INFO [train.py:996] (1/4) Epoch 9, batch 19900, loss[loss=0.1897, simple_loss=0.2717, pruned_loss=0.0539, over 21832.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3085, pruned_loss=0.07427, over 4264638.47 frames. ], batch size: 107, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:40:17,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1583202.0, ans=0.125 2023-06-23 21:40:45,972 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.607e+02 5.026e+02 6.181e+02 9.309e+02 2.570e+03, threshold=1.236e+03, percent-clipped=2.0 2023-06-23 21:41:19,618 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1583382.0, ans=0.0 2023-06-23 21:41:28,259 INFO [train.py:996] (1/4) Epoch 9, batch 19950, loss[loss=0.2155, simple_loss=0.2844, pruned_loss=0.07328, over 21770.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3035, pruned_loss=0.07422, over 4265542.77 frames. ], batch size: 118, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:41:33,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1583442.0, ans=0.04949747468305833 2023-06-23 21:42:31,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1583622.0, ans=0.125 2023-06-23 21:42:33,836 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:42:39,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1583622.0, ans=0.0 2023-06-23 21:43:07,072 INFO [train.py:996] (1/4) Epoch 9, batch 20000, loss[loss=0.2403, simple_loss=0.3046, pruned_loss=0.08801, over 21345.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3025, pruned_loss=0.07452, over 4252389.92 frames. ], batch size: 159, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:43:13,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1583742.0, ans=0.125 2023-06-23 21:44:00,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1583862.0, ans=0.1 2023-06-23 21:44:03,269 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.657e+02 5.210e+02 7.571e+02 1.068e+03 2.474e+03, threshold=1.514e+03, percent-clipped=20.0 2023-06-23 21:44:46,986 INFO [train.py:996] (1/4) Epoch 9, batch 20050, loss[loss=0.2438, simple_loss=0.3127, pruned_loss=0.08743, over 21797.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3062, pruned_loss=0.07752, over 4269338.78 frames. ], batch size: 247, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:44:53,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1584042.0, ans=0.0 2023-06-23 21:45:22,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1584102.0, ans=0.125 2023-06-23 21:45:36,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1584162.0, ans=0.125 2023-06-23 21:45:45,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1584222.0, ans=0.0 2023-06-23 21:45:49,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584222.0, ans=0.1 2023-06-23 21:46:09,123 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:46:28,207 INFO [train.py:996] (1/4) Epoch 9, batch 20100, loss[loss=0.2217, simple_loss=0.2991, pruned_loss=0.07218, over 21338.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3089, pruned_loss=0.08006, over 4279973.35 frames. ], batch size: 159, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:46:57,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1584402.0, ans=0.2 2023-06-23 21:47:11,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1584462.0, ans=0.1 2023-06-23 21:47:31,043 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:47:32,237 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.946e+02 5.376e+02 7.013e+02 1.127e+03 1.999e+03, threshold=1.403e+03, percent-clipped=12.0 2023-06-23 21:47:37,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1584522.0, ans=0.125 2023-06-23 21:47:50,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1584522.0, ans=0.1 2023-06-23 21:47:53,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-23 21:48:18,568 INFO [train.py:996] (1/4) Epoch 9, batch 20150, loss[loss=0.2494, simple_loss=0.325, pruned_loss=0.08687, over 21786.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3165, pruned_loss=0.08302, over 4274109.18 frames. ], batch size: 247, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:48:19,117 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:48:30,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1584642.0, ans=0.1 2023-06-23 21:48:45,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1584702.0, ans=0.125 2023-06-23 21:48:57,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1584762.0, ans=0.125 2023-06-23 21:49:15,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1584762.0, ans=0.125 2023-06-23 21:49:35,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1584822.0, ans=0.2 2023-06-23 21:49:54,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1584882.0, ans=0.125 2023-06-23 21:50:01,819 INFO [train.py:996] (1/4) Epoch 9, batch 20200, loss[loss=0.2442, simple_loss=0.3549, pruned_loss=0.06672, over 19869.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3241, pruned_loss=0.08615, over 4270731.65 frames. ], batch size: 702, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:50:53,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1585062.0, ans=0.125 2023-06-23 21:50:59,415 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 7.487e+02 1.027e+03 1.466e+03 2.661e+03, threshold=2.055e+03, percent-clipped=25.0 2023-06-23 21:51:09,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.71 vs. limit=6.0 2023-06-23 21:51:27,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1585182.0, ans=0.125 2023-06-23 21:51:41,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1585242.0, ans=0.125 2023-06-23 21:51:42,590 INFO [train.py:996] (1/4) Epoch 9, batch 20250, loss[loss=0.238, simple_loss=0.3011, pruned_loss=0.08744, over 21189.00 frames. ], tot_loss[loss=0.247, simple_loss=0.325, pruned_loss=0.08454, over 4274289.91 frames. ], batch size: 143, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:52:26,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1585362.0, ans=0.125 2023-06-23 21:52:40,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1585362.0, ans=0.0 2023-06-23 21:53:14,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1585482.0, ans=0.125 2023-06-23 21:53:22,029 INFO [train.py:996] (1/4) Epoch 9, batch 20300, loss[loss=0.218, simple_loss=0.2947, pruned_loss=0.07062, over 21163.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3231, pruned_loss=0.08143, over 4273991.37 frames. ], batch size: 159, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:53:50,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1585602.0, ans=0.125 2023-06-23 21:54:28,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.674e+02 5.917e+02 9.257e+02 1.332e+03 3.110e+03, threshold=1.851e+03, percent-clipped=6.0 2023-06-23 21:55:00,633 INFO [train.py:996] (1/4) Epoch 9, batch 20350, loss[loss=0.2526, simple_loss=0.3281, pruned_loss=0.08849, over 21900.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3217, pruned_loss=0.08056, over 4264700.55 frames. ], batch size: 118, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:55:02,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1585842.0, ans=0.0 2023-06-23 21:55:27,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1585902.0, ans=0.125 2023-06-23 21:55:57,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1585962.0, ans=0.125 2023-06-23 21:55:58,399 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-23 21:56:11,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1586022.0, ans=0.0 2023-06-23 21:56:29,244 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-23 21:56:37,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1586082.0, ans=0.95 2023-06-23 21:56:40,532 INFO [train.py:996] (1/4) Epoch 9, batch 20400, loss[loss=0.2864, simple_loss=0.3522, pruned_loss=0.1103, over 21640.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3256, pruned_loss=0.08378, over 4258853.91 frames. ], batch size: 230, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 21:57:42,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.128e+02 5.797e+02 8.307e+02 1.162e+03 2.401e+03, threshold=1.661e+03, percent-clipped=4.0 2023-06-23 21:57:46,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1586322.0, ans=0.1 2023-06-23 21:57:56,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-23 21:58:10,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1586382.0, ans=15.0 2023-06-23 21:58:15,250 INFO [train.py:996] (1/4) Epoch 9, batch 20450, loss[loss=0.1914, simple_loss=0.2645, pruned_loss=0.05913, over 21089.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3253, pruned_loss=0.08611, over 4254319.51 frames. ], batch size: 608, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 21:58:19,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1586442.0, ans=0.125 2023-06-23 21:58:41,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1586502.0, ans=0.1 2023-06-23 21:59:25,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1586622.0, ans=0.0 2023-06-23 21:59:49,247 INFO [train.py:996] (1/4) Epoch 9, batch 20500, loss[loss=0.243, simple_loss=0.3078, pruned_loss=0.08917, over 21809.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3212, pruned_loss=0.08669, over 4253182.24 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:00:05,695 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-23 22:00:45,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1586862.0, ans=0.125 2023-06-23 22:00:52,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.56 vs. limit=6.0 2023-06-23 22:00:57,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1586922.0, ans=0.1 2023-06-23 22:00:58,943 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 5.832e+02 8.906e+02 1.259e+03 2.508e+03, threshold=1.781e+03, percent-clipped=16.0 2023-06-23 22:01:02,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1586922.0, ans=0.2 2023-06-23 22:01:29,787 INFO [train.py:996] (1/4) Epoch 9, batch 20550, loss[loss=0.205, simple_loss=0.2906, pruned_loss=0.05965, over 21644.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3141, pruned_loss=0.08482, over 4245941.79 frames. ], batch size: 247, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:01:59,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1587102.0, ans=0.1 2023-06-23 22:02:04,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1587102.0, ans=0.125 2023-06-23 22:03:10,672 INFO [train.py:996] (1/4) Epoch 9, batch 20600, loss[loss=0.2874, simple_loss=0.3552, pruned_loss=0.1098, over 21746.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3163, pruned_loss=0.08291, over 4236083.49 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:03:12,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1587342.0, ans=0.1 2023-06-23 22:03:29,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.00 vs. limit=15.0 2023-06-23 22:03:50,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587402.0, ans=0.1 2023-06-23 22:04:20,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.756e+02 4.749e+02 5.700e+02 8.605e+02 1.495e+03, threshold=1.140e+03, percent-clipped=0.0 2023-06-23 22:04:51,888 INFO [train.py:996] (1/4) Epoch 9, batch 20650, loss[loss=0.2304, simple_loss=0.301, pruned_loss=0.0799, over 17607.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3125, pruned_loss=0.08317, over 4223070.39 frames. ], batch size: 60, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:05:16,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-23 22:05:54,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1587762.0, ans=0.125 2023-06-23 22:06:21,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1587882.0, ans=0.2 2023-06-23 22:06:32,157 INFO [train.py:996] (1/4) Epoch 9, batch 20700, loss[loss=0.2288, simple_loss=0.3344, pruned_loss=0.06164, over 20078.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3052, pruned_loss=0.08013, over 4235122.74 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:06:47,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1587942.0, ans=0.125 2023-06-23 22:07:21,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1588062.0, ans=0.0 2023-06-23 22:07:31,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1588062.0, ans=0.125 2023-06-23 22:07:42,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1588122.0, ans=0.125 2023-06-23 22:07:44,056 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.364e+02 5.397e+02 7.207e+02 1.144e+03 2.870e+03, threshold=1.441e+03, percent-clipped=25.0 2023-06-23 22:08:20,579 INFO [train.py:996] (1/4) Epoch 9, batch 20750, loss[loss=0.3155, simple_loss=0.4104, pruned_loss=0.1103, over 21655.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3082, pruned_loss=0.07923, over 4235848.48 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:09:03,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1588302.0, ans=0.125 2023-06-23 22:09:20,752 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-23 22:09:27,392 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-23 22:09:38,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1588422.0, ans=0.2 2023-06-23 22:09:44,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1588482.0, ans=0.0 2023-06-23 22:10:00,944 INFO [train.py:996] (1/4) Epoch 9, batch 20800, loss[loss=0.2149, simple_loss=0.28, pruned_loss=0.07488, over 21185.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.313, pruned_loss=0.08074, over 4244513.51 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 22:10:01,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1588542.0, ans=0.0 2023-06-23 22:10:01,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1588542.0, ans=0.125 2023-06-23 22:10:11,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1588542.0, ans=0.125 2023-06-23 22:10:17,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1588542.0, ans=0.0 2023-06-23 22:10:48,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1588662.0, ans=0.0 2023-06-23 22:11:06,864 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 5.156e+02 7.961e+02 1.123e+03 3.663e+03, threshold=1.592e+03, percent-clipped=17.0 2023-06-23 22:11:07,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1588722.0, ans=0.04949747468305833 2023-06-23 22:11:23,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1588782.0, ans=0.1 2023-06-23 22:11:40,270 INFO [train.py:996] (1/4) Epoch 9, batch 20850, loss[loss=0.1835, simple_loss=0.2578, pruned_loss=0.05462, over 21621.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3039, pruned_loss=0.07827, over 4245830.27 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:11:55,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1588842.0, ans=0.125 2023-06-23 22:12:07,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1588902.0, ans=0.125 2023-06-23 22:13:09,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1589082.0, ans=0.1 2023-06-23 22:13:19,732 INFO [train.py:996] (1/4) Epoch 9, batch 20900, loss[loss=0.231, simple_loss=0.3041, pruned_loss=0.07895, over 21851.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3052, pruned_loss=0.07915, over 4257161.86 frames. ], batch size: 124, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:13:40,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1589142.0, ans=0.125 2023-06-23 22:13:44,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-23 22:13:52,036 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.88 vs. limit=10.0 2023-06-23 22:13:52,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1589202.0, ans=0.0 2023-06-23 22:14:23,923 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.693e+02 5.793e+02 9.224e+02 1.777e+03 3.715e+03, threshold=1.845e+03, percent-clipped=30.0 2023-06-23 22:14:28,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1589322.0, ans=0.125 2023-06-23 22:14:38,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1589382.0, ans=0.125 2023-06-23 22:14:49,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1589382.0, ans=0.125 2023-06-23 22:14:49,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1589382.0, ans=0.125 2023-06-23 22:14:51,905 INFO [train.py:996] (1/4) Epoch 9, batch 20950, loss[loss=0.191, simple_loss=0.2684, pruned_loss=0.05675, over 21450.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3016, pruned_loss=0.07599, over 4247063.22 frames. ], batch size: 211, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:15:06,362 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:15:48,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=22.5 2023-06-23 22:16:00,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1589622.0, ans=0.0 2023-06-23 22:16:00,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1589622.0, ans=0.125 2023-06-23 22:16:00,988 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-23 22:16:05,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1589622.0, ans=0.125 2023-06-23 22:16:14,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1589682.0, ans=0.125 2023-06-23 22:16:29,645 INFO [train.py:996] (1/4) Epoch 9, batch 21000, loss[loss=0.2738, simple_loss=0.3448, pruned_loss=0.1014, over 21877.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3018, pruned_loss=0.07631, over 4254919.24 frames. ], batch size: 124, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:16:29,646 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 22:16:50,149 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2633, simple_loss=0.3613, pruned_loss=0.0826, over 1796401.00 frames. 2023-06-23 22:16:50,150 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 22:16:55,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1589742.0, ans=0.0 2023-06-23 22:17:29,773 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-23 22:17:39,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1589862.0, ans=0.125 2023-06-23 22:17:51,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 5.898e+02 8.113e+02 1.195e+03 2.501e+03, threshold=1.623e+03, percent-clipped=8.0 2023-06-23 22:18:21,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1589982.0, ans=0.0 2023-06-23 22:18:30,044 INFO [train.py:996] (1/4) Epoch 9, batch 21050, loss[loss=0.1794, simple_loss=0.2407, pruned_loss=0.05903, over 17484.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2989, pruned_loss=0.07633, over 4234196.52 frames. ], batch size: 67, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:18:59,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1590102.0, ans=0.1 2023-06-23 22:19:01,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1590102.0, ans=0.125 2023-06-23 22:19:43,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1590282.0, ans=0.0 2023-06-23 22:20:08,641 INFO [train.py:996] (1/4) Epoch 9, batch 21100, loss[loss=0.2272, simple_loss=0.2903, pruned_loss=0.08203, over 21982.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2952, pruned_loss=0.07569, over 4242264.96 frames. ], batch size: 103, lr: 3.25e-03, grad_scale: 8.0 2023-06-23 22:20:35,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1590402.0, ans=0.1 2023-06-23 22:21:11,572 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.136e+02 6.651e+02 8.328e+02 1.901e+03, threshold=1.330e+03, percent-clipped=2.0 2023-06-23 22:21:24,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1590522.0, ans=0.0 2023-06-23 22:21:48,082 INFO [train.py:996] (1/4) Epoch 9, batch 21150, loss[loss=0.1915, simple_loss=0.2426, pruned_loss=0.07017, over 20786.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2916, pruned_loss=0.07684, over 4247682.26 frames. ], batch size: 609, lr: 3.25e-03, grad_scale: 8.0 2023-06-23 22:22:19,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1590702.0, ans=0.125 2023-06-23 22:22:26,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1590762.0, ans=0.125 2023-06-23 22:22:55,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1590822.0, ans=0.1 2023-06-23 22:23:26,854 INFO [train.py:996] (1/4) Epoch 9, batch 21200, loss[loss=0.2154, simple_loss=0.2677, pruned_loss=0.08156, over 20299.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2881, pruned_loss=0.07611, over 4252589.31 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:23:49,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1591002.0, ans=0.0 2023-06-23 22:23:59,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1591002.0, ans=0.125 2023-06-23 22:24:00,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1591002.0, ans=0.125 2023-06-23 22:24:09,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1591062.0, ans=0.125 2023-06-23 22:24:22,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-23 22:24:29,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.749e+02 4.861e+02 6.796e+02 9.543e+02 2.010e+03, threshold=1.359e+03, percent-clipped=3.0 2023-06-23 22:24:39,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1591182.0, ans=0.125 2023-06-23 22:25:05,783 INFO [train.py:996] (1/4) Epoch 9, batch 21250, loss[loss=0.3022, simple_loss=0.3893, pruned_loss=0.1075, over 20760.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2868, pruned_loss=0.0766, over 4252462.93 frames. ], batch size: 609, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:25:15,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1591242.0, ans=0.0 2023-06-23 22:25:24,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1591242.0, ans=0.1 2023-06-23 22:25:35,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1591302.0, ans=0.125 2023-06-23 22:26:06,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1591422.0, ans=0.1 2023-06-23 22:26:41,385 INFO [train.py:996] (1/4) Epoch 9, batch 21300, loss[loss=0.2118, simple_loss=0.2655, pruned_loss=0.07906, over 16408.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2928, pruned_loss=0.07817, over 4248163.00 frames. ], batch size: 62, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:27:04,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-23 22:27:49,245 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.208e+02 6.806e+02 9.811e+02 1.401e+03 3.569e+03, threshold=1.962e+03, percent-clipped=29.0 2023-06-23 22:28:02,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1591782.0, ans=0.0 2023-06-23 22:28:24,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1591842.0, ans=0.05 2023-06-23 22:28:25,314 INFO [train.py:996] (1/4) Epoch 9, batch 21350, loss[loss=0.2104, simple_loss=0.3035, pruned_loss=0.05864, over 21738.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2974, pruned_loss=0.07882, over 4259845.25 frames. ], batch size: 351, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:29:07,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-23 22:30:07,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1592082.0, ans=0.125 2023-06-23 22:30:10,719 INFO [train.py:996] (1/4) Epoch 9, batch 21400, loss[loss=0.2427, simple_loss=0.3279, pruned_loss=0.0787, over 21636.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3012, pruned_loss=0.07894, over 4267528.34 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:30:13,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1592142.0, ans=0.125 2023-06-23 22:30:16,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.20 vs. limit=22.5 2023-06-23 22:30:27,378 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-23 22:30:44,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1592202.0, ans=0.0 2023-06-23 22:30:59,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1592262.0, ans=0.0 2023-06-23 22:30:59,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1592262.0, ans=0.1 2023-06-23 22:31:08,191 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.672e+02 5.270e+02 6.886e+02 1.009e+03 2.109e+03, threshold=1.377e+03, percent-clipped=2.0 2023-06-23 22:31:13,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1592322.0, ans=0.2 2023-06-23 22:31:42,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1592382.0, ans=0.1 2023-06-23 22:31:49,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1592442.0, ans=0.0 2023-06-23 22:31:50,207 INFO [train.py:996] (1/4) Epoch 9, batch 21450, loss[loss=0.2399, simple_loss=0.302, pruned_loss=0.08889, over 20081.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3046, pruned_loss=0.08016, over 4273715.76 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:32:22,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1592502.0, ans=0.07 2023-06-23 22:32:22,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1592502.0, ans=0.0 2023-06-23 22:32:27,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1592562.0, ans=0.0 2023-06-23 22:32:28,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1592562.0, ans=0.125 2023-06-23 22:33:09,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1592622.0, ans=0.1 2023-06-23 22:33:11,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-23 22:33:15,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1592682.0, ans=0.05 2023-06-23 22:33:27,859 INFO [train.py:996] (1/4) Epoch 9, batch 21500, loss[loss=0.2011, simple_loss=0.268, pruned_loss=0.06709, over 21712.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.303, pruned_loss=0.08183, over 4277686.49 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:33:39,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1592742.0, ans=0.1 2023-06-23 22:34:16,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1592862.0, ans=0.125 2023-06-23 22:34:29,427 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 5.721e+02 7.470e+02 9.927e+02 1.833e+03, threshold=1.494e+03, percent-clipped=12.0 2023-06-23 22:34:29,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1592922.0, ans=0.04949747468305833 2023-06-23 22:34:33,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1592922.0, ans=0.2 2023-06-23 22:35:06,824 INFO [train.py:996] (1/4) Epoch 9, batch 21550, loss[loss=0.1732, simple_loss=0.242, pruned_loss=0.05219, over 21360.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2969, pruned_loss=0.07897, over 4273053.46 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:35:07,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1593042.0, ans=0.2 2023-06-23 22:35:12,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1593042.0, ans=0.125 2023-06-23 22:35:12,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1593042.0, ans=0.125 2023-06-23 22:36:12,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-23 22:36:14,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-23 22:36:47,564 INFO [train.py:996] (1/4) Epoch 9, batch 21600, loss[loss=0.2348, simple_loss=0.2928, pruned_loss=0.08844, over 21991.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2925, pruned_loss=0.07766, over 4264933.11 frames. ], batch size: 103, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 22:37:10,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-06-23 22:37:19,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1593402.0, ans=0.0 2023-06-23 22:38:04,084 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.659e+02 6.338e+02 9.920e+02 1.459e+03 3.157e+03, threshold=1.984e+03, percent-clipped=22.0 2023-06-23 22:38:28,840 INFO [train.py:996] (1/4) Epoch 9, batch 21650, loss[loss=0.205, simple_loss=0.2876, pruned_loss=0.06124, over 21757.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2978, pruned_loss=0.07658, over 4266050.14 frames. ], batch size: 112, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:38:38,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1593642.0, ans=0.1 2023-06-23 22:40:07,979 INFO [train.py:996] (1/4) Epoch 9, batch 21700, loss[loss=0.2087, simple_loss=0.278, pruned_loss=0.06974, over 21350.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.298, pruned_loss=0.07483, over 4267898.21 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:40:12,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-23 22:41:10,625 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.363e+02 6.189e+02 8.394e+02 1.254e+03 2.013e+03, threshold=1.679e+03, percent-clipped=1.0 2023-06-23 22:41:11,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1594122.0, ans=0.2 2023-06-23 22:41:30,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1594182.0, ans=0.125 2023-06-23 22:41:30,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1594182.0, ans=0.0 2023-06-23 22:41:45,186 INFO [train.py:996] (1/4) Epoch 9, batch 21750, loss[loss=0.1983, simple_loss=0.2667, pruned_loss=0.06493, over 21442.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.294, pruned_loss=0.07478, over 4262420.56 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:42:38,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1594362.0, ans=0.95 2023-06-23 22:43:19,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1594482.0, ans=0.0 2023-06-23 22:43:25,531 INFO [train.py:996] (1/4) Epoch 9, batch 21800, loss[loss=0.2403, simple_loss=0.3283, pruned_loss=0.0762, over 21729.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2904, pruned_loss=0.07531, over 4269116.42 frames. ], batch size: 333, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:43:34,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-23 22:44:15,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1594662.0, ans=0.1 2023-06-23 22:44:34,102 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.869e+02 5.122e+02 6.776e+02 1.046e+03 2.535e+03, threshold=1.355e+03, percent-clipped=3.0 2023-06-23 22:45:05,016 INFO [train.py:996] (1/4) Epoch 9, batch 21850, loss[loss=0.2468, simple_loss=0.3338, pruned_loss=0.0799, over 21577.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2965, pruned_loss=0.07568, over 4258395.28 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:45:12,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-23 22:45:40,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1594962.0, ans=0.125 2023-06-23 22:45:42,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-23 22:46:08,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1595022.0, ans=0.0 2023-06-23 22:46:42,709 INFO [train.py:996] (1/4) Epoch 9, batch 21900, loss[loss=0.2255, simple_loss=0.2922, pruned_loss=0.07936, over 21744.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2974, pruned_loss=0.07672, over 4269573.61 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:47:03,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-23 22:47:06,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1595202.0, ans=0.125 2023-06-23 22:47:12,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1595202.0, ans=0.125 2023-06-23 22:47:49,999 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.776e+02 5.560e+02 7.973e+02 1.226e+03 2.341e+03, threshold=1.595e+03, percent-clipped=19.0 2023-06-23 22:47:53,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1595322.0, ans=0.125 2023-06-23 22:48:02,425 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-23 22:48:20,439 INFO [train.py:996] (1/4) Epoch 9, batch 21950, loss[loss=0.1667, simple_loss=0.2568, pruned_loss=0.03826, over 21751.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.292, pruned_loss=0.07585, over 4272582.78 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:48:55,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-23 22:49:16,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1595562.0, ans=0.1 2023-06-23 22:49:49,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-23 22:49:59,998 INFO [train.py:996] (1/4) Epoch 9, batch 22000, loss[loss=0.1583, simple_loss=0.2574, pruned_loss=0.02959, over 20814.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2851, pruned_loss=0.0713, over 4262548.14 frames. ], batch size: 608, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 22:50:14,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-23 22:50:18,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1595802.0, ans=0.0 2023-06-23 22:50:44,817 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=12.0 2023-06-23 22:51:14,078 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 5.153e+02 7.605e+02 1.162e+03 2.837e+03, threshold=1.521e+03, percent-clipped=11.0 2023-06-23 22:51:25,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1595982.0, ans=0.125 2023-06-23 22:51:40,206 INFO [train.py:996] (1/4) Epoch 9, batch 22050, loss[loss=0.32, simple_loss=0.3963, pruned_loss=0.1219, over 21472.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2895, pruned_loss=0.07322, over 4264019.71 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 22:52:01,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1596102.0, ans=0.125 2023-06-23 22:52:02,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1596102.0, ans=0.125 2023-06-23 22:52:09,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1596102.0, ans=0.0 2023-06-23 22:52:53,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1596222.0, ans=0.09899494936611666 2023-06-23 22:53:19,878 INFO [train.py:996] (1/4) Epoch 9, batch 22100, loss[loss=0.1861, simple_loss=0.2631, pruned_loss=0.05459, over 16326.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3011, pruned_loss=0.07867, over 4264729.43 frames. ], batch size: 61, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:53:21,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1596342.0, ans=0.125 2023-06-23 22:53:58,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1596402.0, ans=0.125 2023-06-23 22:54:07,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1596462.0, ans=0.125 2023-06-23 22:54:22,629 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-23 22:54:31,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1596522.0, ans=0.0 2023-06-23 22:54:34,353 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.056e+02 6.584e+02 8.540e+02 1.234e+03 2.755e+03, threshold=1.708e+03, percent-clipped=13.0 2023-06-23 22:54:45,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1596582.0, ans=0.2 2023-06-23 22:54:57,992 INFO [train.py:996] (1/4) Epoch 9, batch 22150, loss[loss=0.2207, simple_loss=0.2979, pruned_loss=0.07172, over 21849.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3046, pruned_loss=0.08084, over 4275788.04 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:55:36,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1596702.0, ans=0.125 2023-06-23 22:55:51,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1596762.0, ans=0.0 2023-06-23 22:56:06,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1596822.0, ans=0.0 2023-06-23 22:56:19,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1596822.0, ans=0.0 2023-06-23 22:56:37,901 INFO [train.py:996] (1/4) Epoch 9, batch 22200, loss[loss=0.2829, simple_loss=0.389, pruned_loss=0.08841, over 19936.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3071, pruned_loss=0.08217, over 4274662.78 frames. ], batch size: 702, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 22:56:54,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1596942.0, ans=0.0 2023-06-23 22:57:08,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1597002.0, ans=0.0 2023-06-23 22:57:24,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597062.0, ans=0.1 2023-06-23 22:57:49,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-23 22:57:54,242 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.796e+02 5.424e+02 7.068e+02 9.828e+02 2.083e+03, threshold=1.414e+03, percent-clipped=7.0 2023-06-23 22:58:02,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1597182.0, ans=0.2 2023-06-23 22:58:16,277 INFO [train.py:996] (1/4) Epoch 9, batch 22250, loss[loss=0.2637, simple_loss=0.3376, pruned_loss=0.09495, over 21356.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3136, pruned_loss=0.0834, over 4269308.85 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 22:58:18,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1597242.0, ans=0.2 2023-06-23 22:58:35,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1597242.0, ans=0.0 2023-06-23 22:59:54,583 INFO [train.py:996] (1/4) Epoch 9, batch 22300, loss[loss=0.2399, simple_loss=0.3, pruned_loss=0.08985, over 21677.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3162, pruned_loss=0.08523, over 4267109.05 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 23:00:52,743 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.63 vs. limit=10.0 2023-06-23 23:00:53,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1597662.0, ans=0.125 2023-06-23 23:01:01,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1597722.0, ans=0.125 2023-06-23 23:01:10,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.760e+02 5.946e+02 8.093e+02 1.234e+03 3.372e+03, threshold=1.619e+03, percent-clipped=19.0 2023-06-23 23:01:20,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1597782.0, ans=0.0 2023-06-23 23:01:33,332 INFO [train.py:996] (1/4) Epoch 9, batch 22350, loss[loss=0.2156, simple_loss=0.2825, pruned_loss=0.07437, over 21672.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3134, pruned_loss=0.08517, over 4275162.78 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 23:02:06,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1597902.0, ans=0.0 2023-06-23 23:02:37,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1597962.0, ans=0.2 2023-06-23 23:02:59,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1598082.0, ans=0.125 2023-06-23 23:03:06,664 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-23 23:03:22,854 INFO [train.py:996] (1/4) Epoch 9, batch 22400, loss[loss=0.1912, simple_loss=0.2569, pruned_loss=0.0628, over 21776.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3096, pruned_loss=0.08201, over 4278532.17 frames. ], batch size: 102, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:04:03,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1598262.0, ans=0.1 2023-06-23 23:04:07,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1598262.0, ans=0.125 2023-06-23 23:04:16,477 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-23 23:04:26,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1598322.0, ans=0.125 2023-06-23 23:04:29,516 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.562e+02 4.983e+02 6.853e+02 9.645e+02 2.077e+03, threshold=1.371e+03, percent-clipped=3.0 2023-06-23 23:04:41,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1598382.0, ans=0.1 2023-06-23 23:05:00,763 INFO [train.py:996] (1/4) Epoch 9, batch 22450, loss[loss=0.2014, simple_loss=0.2567, pruned_loss=0.07302, over 21350.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3047, pruned_loss=0.08166, over 4271936.19 frames. ], batch size: 177, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:05:11,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-23 23:06:35,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1598682.0, ans=0.0 2023-06-23 23:06:38,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1598742.0, ans=0.2 2023-06-23 23:06:39,881 INFO [train.py:996] (1/4) Epoch 9, batch 22500, loss[loss=0.1963, simple_loss=0.2718, pruned_loss=0.06041, over 21413.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3, pruned_loss=0.08142, over 4263950.11 frames. ], batch size: 211, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:07:02,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1598742.0, ans=0.125 2023-06-23 23:07:07,576 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-23 23:07:25,126 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.79 vs. limit=10.0 2023-06-23 23:07:47,141 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 5.895e+02 7.835e+02 1.248e+03 2.629e+03, threshold=1.567e+03, percent-clipped=21.0 2023-06-23 23:07:57,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1598982.0, ans=0.0 2023-06-23 23:08:08,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-06-23 23:08:18,966 INFO [train.py:996] (1/4) Epoch 9, batch 22550, loss[loss=0.2411, simple_loss=0.3171, pruned_loss=0.08258, over 21509.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3039, pruned_loss=0.08183, over 4272378.87 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:08:51,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1599102.0, ans=0.2 2023-06-23 23:08:56,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1599102.0, ans=0.0 2023-06-23 23:09:43,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1599222.0, ans=0.125 2023-06-23 23:10:06,623 INFO [train.py:996] (1/4) Epoch 9, batch 22600, loss[loss=0.2066, simple_loss=0.2605, pruned_loss=0.0764, over 21196.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3074, pruned_loss=0.08148, over 4274816.47 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:10:26,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1599402.0, ans=0.04949747468305833 2023-06-23 23:10:37,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1599462.0, ans=0.125 2023-06-23 23:10:53,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-23 23:11:02,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1599522.0, ans=0.0 2023-06-23 23:11:13,385 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 6.856e+02 1.098e+03 1.547e+03 4.006e+03, threshold=2.196e+03, percent-clipped=25.0 2023-06-23 23:11:45,441 INFO [train.py:996] (1/4) Epoch 9, batch 22650, loss[loss=0.2204, simple_loss=0.2802, pruned_loss=0.08032, over 21767.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3051, pruned_loss=0.08145, over 4283707.63 frames. ], batch size: 300, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:11:57,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1599642.0, ans=0.0 2023-06-23 23:12:48,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1599822.0, ans=0.0 2023-06-23 23:13:03,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-23 23:13:12,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1599882.0, ans=0.07 2023-06-23 23:13:18,779 INFO [train.py:996] (1/4) Epoch 9, batch 22700, loss[loss=0.2143, simple_loss=0.2743, pruned_loss=0.07715, over 14846.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2983, pruned_loss=0.08074, over 4281131.05 frames. ], batch size: 60, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:14:02,333 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:14:26,587 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.745e+02 5.778e+02 8.238e+02 1.243e+03 2.659e+03, threshold=1.648e+03, percent-clipped=2.0 2023-06-23 23:14:38,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1600182.0, ans=0.125 2023-06-23 23:14:49,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1600182.0, ans=0.125 2023-06-23 23:14:50,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1600182.0, ans=0.125 2023-06-23 23:14:58,275 INFO [train.py:996] (1/4) Epoch 9, batch 22750, loss[loss=0.2249, simple_loss=0.3005, pruned_loss=0.07462, over 21506.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.299, pruned_loss=0.08169, over 4276117.06 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:15:16,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-23 23:15:27,018 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-23 23:15:28,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1600302.0, ans=0.0 2023-06-23 23:16:27,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-23 23:16:28,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1600482.0, ans=0.0 2023-06-23 23:16:37,212 INFO [train.py:996] (1/4) Epoch 9, batch 22800, loss[loss=0.211, simple_loss=0.2802, pruned_loss=0.07086, over 21765.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3046, pruned_loss=0.08445, over 4269997.35 frames. ], batch size: 282, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:17:45,288 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.424e+02 5.839e+02 8.746e+02 1.348e+03 2.535e+03, threshold=1.749e+03, percent-clipped=13.0 2023-06-23 23:18:03,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1600782.0, ans=0.2 2023-06-23 23:18:14,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1600842.0, ans=0.125 2023-06-23 23:18:15,354 INFO [train.py:996] (1/4) Epoch 9, batch 22850, loss[loss=0.2118, simple_loss=0.274, pruned_loss=0.07478, over 21659.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3022, pruned_loss=0.08385, over 4277671.35 frames. ], batch size: 282, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:18:55,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1600962.0, ans=0.2 2023-06-23 23:19:42,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-23 23:19:49,685 INFO [train.py:996] (1/4) Epoch 9, batch 22900, loss[loss=0.2184, simple_loss=0.2977, pruned_loss=0.06952, over 21285.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.303, pruned_loss=0.08298, over 4277874.70 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:20:40,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1601262.0, ans=0.0 2023-06-23 23:20:45,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1601322.0, ans=0.0 2023-06-23 23:20:46,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-23 23:21:02,734 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.772e+02 7.597e+02 1.119e+03 1.552e+03 2.740e+03, threshold=2.237e+03, percent-clipped=15.0 2023-06-23 23:21:21,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1601382.0, ans=0.2 2023-06-23 23:21:23,746 INFO [train.py:996] (1/4) Epoch 9, batch 22950, loss[loss=0.251, simple_loss=0.3845, pruned_loss=0.05872, over 21648.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3149, pruned_loss=0.08106, over 4283366.22 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:21:26,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-23 23:21:40,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1601502.0, ans=0.125 2023-06-23 23:21:51,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1601502.0, ans=0.0 2023-06-23 23:23:02,630 INFO [train.py:996] (1/4) Epoch 9, batch 23000, loss[loss=0.2408, simple_loss=0.3098, pruned_loss=0.08592, over 21912.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3161, pruned_loss=0.07903, over 4281655.28 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:23:12,483 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:23:39,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1601802.0, ans=0.125 2023-06-23 23:23:45,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1601862.0, ans=0.125 2023-06-23 23:24:17,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.759e+02 5.434e+02 6.875e+02 9.682e+02 1.732e+03, threshold=1.375e+03, percent-clipped=0.0 2023-06-23 23:24:22,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1601982.0, ans=0.125 2023-06-23 23:24:38,065 INFO [train.py:996] (1/4) Epoch 9, batch 23050, loss[loss=0.2312, simple_loss=0.3054, pruned_loss=0.07845, over 21617.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3165, pruned_loss=0.08122, over 4279195.79 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:25:41,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1602222.0, ans=0.0 2023-06-23 23:25:53,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1602222.0, ans=0.1 2023-06-23 23:26:01,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1602282.0, ans=0.125 2023-06-23 23:26:12,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1602342.0, ans=0.2 2023-06-23 23:26:13,072 INFO [train.py:996] (1/4) Epoch 9, batch 23100, loss[loss=0.218, simple_loss=0.2807, pruned_loss=0.07766, over 21805.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3131, pruned_loss=0.08243, over 4281506.17 frames. ], batch size: 317, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:27:17,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-23 23:27:20,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1602522.0, ans=0.125 2023-06-23 23:27:23,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1602522.0, ans=0.1 2023-06-23 23:27:30,607 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.933e+02 6.265e+02 7.988e+02 9.890e+02 1.959e+03, threshold=1.598e+03, percent-clipped=10.0 2023-06-23 23:27:43,320 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-23 23:27:51,471 INFO [train.py:996] (1/4) Epoch 9, batch 23150, loss[loss=0.2331, simple_loss=0.3107, pruned_loss=0.07774, over 21504.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3091, pruned_loss=0.08226, over 4278461.93 frames. ], batch size: 131, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:28:27,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1602702.0, ans=0.125 2023-06-23 23:28:28,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1602702.0, ans=0.1 2023-06-23 23:28:45,219 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-23 23:29:29,538 INFO [train.py:996] (1/4) Epoch 9, batch 23200, loss[loss=0.2116, simple_loss=0.2802, pruned_loss=0.07149, over 21712.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3083, pruned_loss=0.08333, over 4281659.86 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:30:46,164 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.636e+02 5.576e+02 6.977e+02 1.069e+03 2.508e+03, threshold=1.395e+03, percent-clipped=7.0 2023-06-23 23:31:07,222 INFO [train.py:996] (1/4) Epoch 9, batch 23250, loss[loss=0.2615, simple_loss=0.3256, pruned_loss=0.09868, over 21950.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3073, pruned_loss=0.0838, over 4292086.78 frames. ], batch size: 316, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:31:17,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1603242.0, ans=0.125 2023-06-23 23:31:17,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1603242.0, ans=0.125 2023-06-23 23:31:20,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1603242.0, ans=0.125 2023-06-23 23:31:38,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1603302.0, ans=0.1 2023-06-23 23:32:14,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1603422.0, ans=0.0 2023-06-23 23:32:24,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1603422.0, ans=0.125 2023-06-23 23:32:25,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603422.0, ans=0.1 2023-06-23 23:32:52,348 INFO [train.py:996] (1/4) Epoch 9, batch 23300, loss[loss=0.2358, simple_loss=0.336, pruned_loss=0.06786, over 21298.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3127, pruned_loss=0.08487, over 4298024.18 frames. ], batch size: 176, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:33:15,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1603602.0, ans=0.0 2023-06-23 23:33:44,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1603662.0, ans=0.0 2023-06-23 23:34:08,555 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.193e+02 5.673e+02 7.444e+02 1.083e+03 2.210e+03, threshold=1.489e+03, percent-clipped=13.0 2023-06-23 23:34:37,338 INFO [train.py:996] (1/4) Epoch 9, batch 23350, loss[loss=0.2446, simple_loss=0.3271, pruned_loss=0.08107, over 21476.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3177, pruned_loss=0.08421, over 4296330.66 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:34:55,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1603902.0, ans=0.0 2023-06-23 23:35:12,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1603902.0, ans=0.025 2023-06-23 23:35:22,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1603962.0, ans=0.2 2023-06-23 23:36:11,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-23 23:36:15,593 INFO [train.py:996] (1/4) Epoch 9, batch 23400, loss[loss=0.21, simple_loss=0.2849, pruned_loss=0.0676, over 21918.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3108, pruned_loss=0.08009, over 4283303.23 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:37:00,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1604262.0, ans=0.125 2023-06-23 23:37:32,090 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.577e+02 6.077e+02 8.548e+02 1.177e+03 1.985e+03, threshold=1.710e+03, percent-clipped=13.0 2023-06-23 23:37:50,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1604382.0, ans=0.125 2023-06-23 23:37:53,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-23 23:37:55,559 INFO [train.py:996] (1/4) Epoch 9, batch 23450, loss[loss=0.2359, simple_loss=0.3069, pruned_loss=0.08246, over 21447.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3108, pruned_loss=0.08233, over 4287283.97 frames. ], batch size: 548, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:38:37,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1604562.0, ans=0.125 2023-06-23 23:38:54,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1604622.0, ans=0.1 2023-06-23 23:38:59,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1604622.0, ans=0.125 2023-06-23 23:39:31,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=10.0 2023-06-23 23:39:33,086 INFO [train.py:996] (1/4) Epoch 9, batch 23500, loss[loss=0.2429, simple_loss=0.3077, pruned_loss=0.08909, over 21889.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3114, pruned_loss=0.08397, over 4296125.94 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:39:38,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1604742.0, ans=0.2 2023-06-23 23:40:00,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1604802.0, ans=0.125 2023-06-23 23:40:10,401 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-23 23:40:46,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1604922.0, ans=0.2 2023-06-23 23:40:48,973 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.149e+02 5.607e+02 7.001e+02 9.691e+02 1.810e+03, threshold=1.400e+03, percent-clipped=1.0 2023-06-23 23:41:00,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1604982.0, ans=0.0 2023-06-23 23:41:11,082 INFO [train.py:996] (1/4) Epoch 9, batch 23550, loss[loss=0.1949, simple_loss=0.2627, pruned_loss=0.06357, over 21690.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3057, pruned_loss=0.0831, over 4300686.30 frames. ], batch size: 333, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:41:41,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1605102.0, ans=0.125 2023-06-23 23:42:17,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1605222.0, ans=0.125 2023-06-23 23:42:30,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1605222.0, ans=0.125 2023-06-23 23:42:43,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1605282.0, ans=0.1 2023-06-23 23:42:54,902 INFO [train.py:996] (1/4) Epoch 9, batch 23600, loss[loss=0.2755, simple_loss=0.3499, pruned_loss=0.1005, over 21835.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3077, pruned_loss=0.0835, over 4287541.04 frames. ], batch size: 118, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:42:55,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1605342.0, ans=0.125 2023-06-23 23:43:35,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1605462.0, ans=0.0 2023-06-23 23:44:12,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1605522.0, ans=0.05 2023-06-23 23:44:15,254 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.063e+02 5.548e+02 8.580e+02 1.181e+03 2.336e+03, threshold=1.716e+03, percent-clipped=15.0 2023-06-23 23:44:36,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1605582.0, ans=0.2 2023-06-23 23:44:43,207 INFO [train.py:996] (1/4) Epoch 9, batch 23650, loss[loss=0.2549, simple_loss=0.3378, pruned_loss=0.08596, over 21419.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3085, pruned_loss=0.08192, over 4288712.26 frames. ], batch size: 131, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:45:06,939 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-23 23:45:26,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.93 vs. limit=10.0 2023-06-23 23:45:37,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1605762.0, ans=0.125 2023-06-23 23:45:40,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1605822.0, ans=0.125 2023-06-23 23:46:01,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1605882.0, ans=0.125 2023-06-23 23:46:23,158 INFO [train.py:996] (1/4) Epoch 9, batch 23700, loss[loss=0.2068, simple_loss=0.2825, pruned_loss=0.0655, over 21365.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3116, pruned_loss=0.08246, over 4285622.69 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:46:28,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1605942.0, ans=0.1 2023-06-23 23:47:34,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1606122.0, ans=0.035 2023-06-23 23:47:42,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.64 vs. limit=15.0 2023-06-23 23:47:44,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1606122.0, ans=0.0 2023-06-23 23:47:46,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1606122.0, ans=0.0 2023-06-23 23:47:49,316 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.090e+02 6.365e+02 8.254e+02 1.222e+03 2.661e+03, threshold=1.651e+03, percent-clipped=9.0 2023-06-23 23:47:59,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1606182.0, ans=0.125 2023-06-23 23:48:05,093 INFO [train.py:996] (1/4) Epoch 9, batch 23750, loss[loss=0.2135, simple_loss=0.3134, pruned_loss=0.05683, over 21695.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3137, pruned_loss=0.08264, over 4282135.27 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:48:07,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1606242.0, ans=0.2 2023-06-23 23:48:48,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1606362.0, ans=0.2 2023-06-23 23:49:02,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1606362.0, ans=0.0 2023-06-23 23:49:20,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1606422.0, ans=0.125 2023-06-23 23:49:45,924 INFO [train.py:996] (1/4) Epoch 9, batch 23800, loss[loss=0.217, simple_loss=0.2877, pruned_loss=0.07312, over 21762.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3104, pruned_loss=0.07981, over 4279519.07 frames. ], batch size: 112, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:49:49,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1606542.0, ans=0.1 2023-06-23 23:51:11,426 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 6.511e+02 9.622e+02 1.496e+03 3.900e+03, threshold=1.924e+03, percent-clipped=16.0 2023-06-23 23:51:33,198 INFO [train.py:996] (1/4) Epoch 9, batch 23850, loss[loss=0.2422, simple_loss=0.3155, pruned_loss=0.08441, over 21335.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3206, pruned_loss=0.08287, over 4277439.60 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:51:44,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1606842.0, ans=0.5 2023-06-23 23:52:50,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-06-23 23:53:17,375 INFO [train.py:996] (1/4) Epoch 9, batch 23900, loss[loss=0.227, simple_loss=0.3089, pruned_loss=0.07249, over 21750.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3278, pruned_loss=0.08499, over 4276300.20 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:54:30,542 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.004e+02 6.120e+02 8.437e+02 1.170e+03 2.663e+03, threshold=1.687e+03, percent-clipped=5.0 2023-06-23 23:54:38,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1607382.0, ans=0.125 2023-06-23 23:54:53,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1607382.0, ans=0.125 2023-06-23 23:54:56,281 INFO [train.py:996] (1/4) Epoch 9, batch 23950, loss[loss=0.235, simple_loss=0.2857, pruned_loss=0.09213, over 20062.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3211, pruned_loss=0.08437, over 4272991.64 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:54:58,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1607442.0, ans=0.1 2023-06-23 23:55:07,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1607442.0, ans=0.125 2023-06-23 23:55:29,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1607502.0, ans=22.5 2023-06-23 23:55:52,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1607622.0, ans=0.0 2023-06-23 23:56:15,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1607682.0, ans=0.0 2023-06-23 23:56:26,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1607682.0, ans=0.0 2023-06-23 23:56:40,015 INFO [train.py:996] (1/4) Epoch 9, batch 24000, loss[loss=0.2725, simple_loss=0.341, pruned_loss=0.102, over 21362.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3224, pruned_loss=0.08738, over 4272572.55 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:56:40,015 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-23 23:57:00,111 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2698, simple_loss=0.3635, pruned_loss=0.08806, over 1796401.00 frames. 2023-06-23 23:57:00,112 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-23 23:58:19,614 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 5.711e+02 7.408e+02 1.023e+03 1.952e+03, threshold=1.482e+03, percent-clipped=3.0 2023-06-23 23:58:23,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1607982.0, ans=0.1 2023-06-23 23:58:41,991 INFO [train.py:996] (1/4) Epoch 9, batch 24050, loss[loss=0.2036, simple_loss=0.2867, pruned_loss=0.06025, over 21284.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3225, pruned_loss=0.087, over 4272503.21 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:58:47,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1608042.0, ans=0.125 2023-06-23 23:58:58,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1608102.0, ans=0.0 2023-06-23 23:59:16,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1608162.0, ans=0.125 2023-06-24 00:00:21,954 INFO [train.py:996] (1/4) Epoch 9, batch 24100, loss[loss=0.275, simple_loss=0.3486, pruned_loss=0.1007, over 21576.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3226, pruned_loss=0.08515, over 4268087.68 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:00:38,919 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=6.105e-03 2023-06-24 00:00:40,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608402.0, ans=0.1 2023-06-24 00:01:34,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1608522.0, ans=0.0 2023-06-24 00:01:39,805 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.035e+02 6.221e+02 8.690e+02 1.208e+03 2.210e+03, threshold=1.738e+03, percent-clipped=15.0 2023-06-24 00:02:00,852 INFO [train.py:996] (1/4) Epoch 9, batch 24150, loss[loss=0.2881, simple_loss=0.3415, pruned_loss=0.1173, over 21307.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3226, pruned_loss=0.08716, over 4277378.29 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:02:27,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1608702.0, ans=0.0 2023-06-24 00:03:41,759 INFO [train.py:996] (1/4) Epoch 9, batch 24200, loss[loss=0.2587, simple_loss=0.345, pruned_loss=0.08617, over 21738.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3251, pruned_loss=0.08925, over 4277133.09 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:03:42,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1608942.0, ans=0.125 2023-06-24 00:03:46,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-24 00:04:14,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1609002.0, ans=0.125 2023-06-24 00:04:16,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1609002.0, ans=0.125 2023-06-24 00:04:37,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1609062.0, ans=0.025 2023-06-24 00:05:07,426 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 7.380e+02 9.978e+02 1.387e+03 2.651e+03, threshold=1.996e+03, percent-clipped=13.0 2023-06-24 00:05:22,633 INFO [train.py:996] (1/4) Epoch 9, batch 24250, loss[loss=0.1801, simple_loss=0.2856, pruned_loss=0.03736, over 21651.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3218, pruned_loss=0.08227, over 4277632.42 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:05:35,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1609242.0, ans=15.0 2023-06-24 00:05:44,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1609242.0, ans=0.0 2023-06-24 00:05:45,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1609242.0, ans=0.125 2023-06-24 00:07:02,886 INFO [train.py:996] (1/4) Epoch 9, batch 24300, loss[loss=0.146, simple_loss=0.232, pruned_loss=0.03001, over 21686.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3155, pruned_loss=0.07689, over 4273487.47 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:08:09,330 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-24 00:08:28,257 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.714e+02 6.136e+02 8.324e+02 1.263e+03 3.161e+03, threshold=1.665e+03, percent-clipped=12.0 2023-06-24 00:08:37,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-24 00:08:52,380 INFO [train.py:996] (1/4) Epoch 9, batch 24350, loss[loss=0.241, simple_loss=0.3175, pruned_loss=0.08224, over 21511.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3105, pruned_loss=0.0764, over 4278894.50 frames. ], batch size: 548, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:09:01,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1609842.0, ans=0.125 2023-06-24 00:10:28,065 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-24 00:10:34,931 INFO [train.py:996] (1/4) Epoch 9, batch 24400, loss[loss=0.2308, simple_loss=0.3085, pruned_loss=0.07654, over 21889.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3163, pruned_loss=0.07939, over 4276275.33 frames. ], batch size: 373, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:10:35,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-24 00:11:00,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-24 00:11:01,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1610202.0, ans=0.125 2023-06-24 00:11:56,505 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 5.410e+02 6.876e+02 9.194e+02 2.686e+03, threshold=1.375e+03, percent-clipped=10.0 2023-06-24 00:12:11,158 INFO [train.py:996] (1/4) Epoch 9, batch 24450, loss[loss=0.3556, simple_loss=0.4264, pruned_loss=0.1424, over 21438.00 frames. ], tot_loss[loss=0.238, simple_loss=0.316, pruned_loss=0.08, over 4274767.26 frames. ], batch size: 507, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:12:14,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.73 vs. limit=6.0 2023-06-24 00:12:21,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1610442.0, ans=0.2 2023-06-24 00:12:56,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1610562.0, ans=0.1 2023-06-24 00:12:58,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1610562.0, ans=0.5 2023-06-24 00:13:04,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1610562.0, ans=0.0 2023-06-24 00:13:44,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1610682.0, ans=0.5 2023-06-24 00:13:44,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1610682.0, ans=0.2 2023-06-24 00:13:48,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1610682.0, ans=0.0 2023-06-24 00:13:48,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1610682.0, ans=0.0 2023-06-24 00:13:51,193 INFO [train.py:996] (1/4) Epoch 9, batch 24500, loss[loss=0.2641, simple_loss=0.3309, pruned_loss=0.09862, over 21742.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3178, pruned_loss=0.08101, over 4282363.14 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:14:42,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1610862.0, ans=0.09899494936611666 2023-06-24 00:15:13,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1610982.0, ans=0.125 2023-06-24 00:15:14,910 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:15:15,971 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.883e+02 4.941e+02 6.307e+02 8.711e+02 3.165e+03, threshold=1.261e+03, percent-clipped=6.0 2023-06-24 00:15:21,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1610982.0, ans=0.0 2023-06-24 00:15:35,234 INFO [train.py:996] (1/4) Epoch 9, batch 24550, loss[loss=0.2536, simple_loss=0.3272, pruned_loss=0.08995, over 21900.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3206, pruned_loss=0.08383, over 4283667.49 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:15:41,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1611042.0, ans=0.125 2023-06-24 00:16:16,467 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-24 00:16:30,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1611162.0, ans=0.0 2023-06-24 00:16:58,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-06-24 00:17:01,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1611282.0, ans=0.125 2023-06-24 00:17:05,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1611282.0, ans=0.125 2023-06-24 00:17:13,033 INFO [train.py:996] (1/4) Epoch 9, batch 24600, loss[loss=0.2306, simple_loss=0.3021, pruned_loss=0.07952, over 21763.00 frames. ], tot_loss[loss=0.242, simple_loss=0.316, pruned_loss=0.08401, over 4288441.85 frames. ], batch size: 333, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:17:26,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-24 00:17:39,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1611402.0, ans=0.0 2023-06-24 00:17:45,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1611402.0, ans=0.95 2023-06-24 00:17:54,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1611462.0, ans=0.1 2023-06-24 00:18:30,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1611582.0, ans=0.015 2023-06-24 00:18:33,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.043e+02 5.417e+02 8.330e+02 1.065e+03 1.781e+03, threshold=1.666e+03, percent-clipped=13.0 2023-06-24 00:18:40,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1611582.0, ans=0.1 2023-06-24 00:18:52,541 INFO [train.py:996] (1/4) Epoch 9, batch 24650, loss[loss=0.2155, simple_loss=0.275, pruned_loss=0.07804, over 21512.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3074, pruned_loss=0.08281, over 4277499.53 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:19:57,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-24 00:20:03,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1611822.0, ans=0.0 2023-06-24 00:20:32,034 INFO [train.py:996] (1/4) Epoch 9, batch 24700, loss[loss=0.2254, simple_loss=0.2891, pruned_loss=0.08081, over 21229.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3073, pruned_loss=0.08181, over 4261993.25 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:20:46,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1611942.0, ans=0.1 2023-06-24 00:20:54,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-24 00:20:58,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1612002.0, ans=0.125 2023-06-24 00:21:50,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1612182.0, ans=0.125 2023-06-24 00:21:50,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1612182.0, ans=0.125 2023-06-24 00:21:53,239 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 5.774e+02 7.890e+02 1.274e+03 2.911e+03, threshold=1.578e+03, percent-clipped=10.0 2023-06-24 00:22:10,844 INFO [train.py:996] (1/4) Epoch 9, batch 24750, loss[loss=0.244, simple_loss=0.298, pruned_loss=0.09503, over 21435.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2995, pruned_loss=0.07866, over 4258634.86 frames. ], batch size: 509, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:22:25,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1612242.0, ans=0.0 2023-06-24 00:22:40,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1612302.0, ans=0.0 2023-06-24 00:22:44,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1612302.0, ans=0.125 2023-06-24 00:22:46,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1612362.0, ans=0.2 2023-06-24 00:23:11,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1612422.0, ans=0.2 2023-06-24 00:23:38,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1612482.0, ans=0.95 2023-06-24 00:23:44,092 INFO [train.py:996] (1/4) Epoch 9, batch 24800, loss[loss=0.2619, simple_loss=0.3171, pruned_loss=0.1034, over 21474.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2945, pruned_loss=0.07789, over 4262221.78 frames. ], batch size: 212, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:24:01,700 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=12.0 2023-06-24 00:24:19,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-24 00:24:47,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1612722.0, ans=0.125 2023-06-24 00:25:07,477 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.818e+02 6.000e+02 9.294e+02 1.511e+03 3.142e+03, threshold=1.859e+03, percent-clipped=19.0 2023-06-24 00:25:22,867 INFO [train.py:996] (1/4) Epoch 9, batch 24850, loss[loss=0.1772, simple_loss=0.2475, pruned_loss=0.05343, over 21863.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2955, pruned_loss=0.07943, over 4267192.20 frames. ], batch size: 107, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:25:26,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1612842.0, ans=0.0 2023-06-24 00:25:26,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1612842.0, ans=0.0 2023-06-24 00:25:53,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1612902.0, ans=0.1 2023-06-24 00:27:06,907 INFO [train.py:996] (1/4) Epoch 9, batch 24900, loss[loss=0.2587, simple_loss=0.333, pruned_loss=0.09218, over 21362.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2991, pruned_loss=0.0802, over 4272421.54 frames. ], batch size: 548, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:27:20,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1613142.0, ans=0.0 2023-06-24 00:27:34,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-24 00:28:06,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1613322.0, ans=0.125 2023-06-24 00:28:30,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613382.0, ans=0.1 2023-06-24 00:28:33,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1613382.0, ans=0.0 2023-06-24 00:28:36,561 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.759e+02 5.969e+02 8.739e+02 1.291e+03 2.372e+03, threshold=1.748e+03, percent-clipped=6.0 2023-06-24 00:28:48,019 INFO [train.py:996] (1/4) Epoch 9, batch 24950, loss[loss=0.2529, simple_loss=0.3207, pruned_loss=0.09252, over 21464.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.306, pruned_loss=0.08386, over 4267000.75 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:28:57,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1613442.0, ans=0.0 2023-06-24 00:28:57,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-24 00:29:30,864 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:29:51,338 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:30:18,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1613682.0, ans=0.0 2023-06-24 00:30:29,806 INFO [train.py:996] (1/4) Epoch 9, batch 25000, loss[loss=0.2534, simple_loss=0.3222, pruned_loss=0.09226, over 21494.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3116, pruned_loss=0.08518, over 4269822.89 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:30:32,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-24 00:30:37,248 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=22.5 2023-06-24 00:30:42,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1613742.0, ans=0.125 2023-06-24 00:31:03,324 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:31:57,963 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.997e+02 6.297e+02 8.541e+02 1.164e+03 2.225e+03, threshold=1.708e+03, percent-clipped=6.0 2023-06-24 00:32:08,797 INFO [train.py:996] (1/4) Epoch 9, batch 25050, loss[loss=0.1956, simple_loss=0.2645, pruned_loss=0.06331, over 21655.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3049, pruned_loss=0.08356, over 4271355.13 frames. ], batch size: 333, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:32:23,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1614042.0, ans=0.125 2023-06-24 00:32:56,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1614162.0, ans=0.0 2023-06-24 00:33:15,506 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-24 00:33:24,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1614222.0, ans=0.0 2023-06-24 00:33:46,625 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-24 00:33:50,138 INFO [train.py:996] (1/4) Epoch 9, batch 25100, loss[loss=0.2272, simple_loss=0.2913, pruned_loss=0.08156, over 21599.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3006, pruned_loss=0.0819, over 4264081.73 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:33:53,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1614342.0, ans=0.0 2023-06-24 00:34:07,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1614342.0, ans=10.0 2023-06-24 00:34:07,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1614342.0, ans=0.09899494936611666 2023-06-24 00:35:03,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1614522.0, ans=0.2 2023-06-24 00:35:07,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1614522.0, ans=0.2 2023-06-24 00:35:14,652 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-24 00:35:16,809 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 5.962e+02 8.437e+02 1.206e+03 2.426e+03, threshold=1.687e+03, percent-clipped=3.0 2023-06-24 00:35:18,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1614582.0, ans=0.125 2023-06-24 00:35:27,818 INFO [train.py:996] (1/4) Epoch 9, batch 25150, loss[loss=0.2386, simple_loss=0.3206, pruned_loss=0.0783, over 21766.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3026, pruned_loss=0.07998, over 4251387.88 frames. ], batch size: 112, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:35:31,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-24 00:35:39,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1614642.0, ans=0.2 2023-06-24 00:36:03,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1614702.0, ans=0.0 2023-06-24 00:36:09,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1614762.0, ans=0.125 2023-06-24 00:36:53,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1614882.0, ans=0.125 2023-06-24 00:37:08,167 INFO [train.py:996] (1/4) Epoch 9, batch 25200, loss[loss=0.2177, simple_loss=0.3087, pruned_loss=0.06328, over 21678.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3029, pruned_loss=0.07775, over 4242427.52 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:38:24,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1615122.0, ans=0.0 2023-06-24 00:38:30,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1615182.0, ans=0.125 2023-06-24 00:38:35,128 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.592e+02 5.246e+02 7.409e+02 1.372e+03 3.913e+03, threshold=1.482e+03, percent-clipped=20.0 2023-06-24 00:38:46,420 INFO [train.py:996] (1/4) Epoch 9, batch 25250, loss[loss=0.2226, simple_loss=0.2866, pruned_loss=0.07931, over 21779.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3019, pruned_loss=0.07709, over 4251296.89 frames. ], batch size: 371, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:38:51,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1615242.0, ans=0.2 2023-06-24 00:38:59,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1615242.0, ans=0.1 2023-06-24 00:39:04,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1615302.0, ans=0.125 2023-06-24 00:39:20,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1615362.0, ans=0.125 2023-06-24 00:40:24,976 INFO [train.py:996] (1/4) Epoch 9, batch 25300, loss[loss=0.2439, simple_loss=0.301, pruned_loss=0.09345, over 20110.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2994, pruned_loss=0.07715, over 4247788.13 frames. ], batch size: 703, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:40:40,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1615602.0, ans=0.125 2023-06-24 00:40:46,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1615602.0, ans=10.0 2023-06-24 00:41:31,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1615722.0, ans=0.125 2023-06-24 00:41:55,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.408e+02 6.402e+02 8.209e+02 1.215e+03 2.497e+03, threshold=1.642e+03, percent-clipped=20.0 2023-06-24 00:41:55,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1615782.0, ans=0.125 2023-06-24 00:42:04,771 INFO [train.py:996] (1/4) Epoch 9, batch 25350, loss[loss=0.1848, simple_loss=0.2835, pruned_loss=0.04303, over 21748.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3018, pruned_loss=0.07659, over 4235033.54 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:42:54,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1615962.0, ans=0.0 2023-06-24 00:42:56,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-24 00:43:42,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1616082.0, ans=0.125 2023-06-24 00:43:44,793 INFO [train.py:996] (1/4) Epoch 9, batch 25400, loss[loss=0.2282, simple_loss=0.2881, pruned_loss=0.08413, over 21604.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2992, pruned_loss=0.07612, over 4238487.98 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:45:11,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1616382.0, ans=0.125 2023-06-24 00:45:17,980 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.471e+02 5.867e+02 9.461e+02 1.414e+03 2.497e+03, threshold=1.892e+03, percent-clipped=14.0 2023-06-24 00:45:27,907 INFO [train.py:996] (1/4) Epoch 9, batch 25450, loss[loss=0.2434, simple_loss=0.3408, pruned_loss=0.07297, over 21774.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2991, pruned_loss=0.07736, over 4244159.84 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:45:30,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1616442.0, ans=0.125 2023-06-24 00:45:33,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1616442.0, ans=0.0 2023-06-24 00:46:22,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1616562.0, ans=0.125 2023-06-24 00:46:45,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1616622.0, ans=10.0 2023-06-24 00:47:04,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-24 00:47:04,832 INFO [train.py:996] (1/4) Epoch 9, batch 25500, loss[loss=0.2231, simple_loss=0.2995, pruned_loss=0.07338, over 21306.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2998, pruned_loss=0.07469, over 4249007.67 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:47:16,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1616742.0, ans=0.0 2023-06-24 00:47:43,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1616802.0, ans=0.125 2023-06-24 00:48:35,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1616982.0, ans=0.2 2023-06-24 00:48:39,572 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.450e+02 4.898e+02 7.192e+02 1.024e+03 1.607e+03, threshold=1.438e+03, percent-clipped=0.0 2023-06-24 00:48:48,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1617042.0, ans=0.125 2023-06-24 00:48:49,473 INFO [train.py:996] (1/4) Epoch 9, batch 25550, loss[loss=0.2592, simple_loss=0.3621, pruned_loss=0.07815, over 21528.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.307, pruned_loss=0.07473, over 4251359.37 frames. ], batch size: 471, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:49:46,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1617162.0, ans=0.125 2023-06-24 00:50:15,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-24 00:50:34,409 INFO [train.py:996] (1/4) Epoch 9, batch 25600, loss[loss=0.3304, simple_loss=0.3824, pruned_loss=0.1392, over 21445.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3127, pruned_loss=0.07623, over 4264930.58 frames. ], batch size: 471, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:50:36,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1617342.0, ans=0.125 2023-06-24 00:50:57,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1617402.0, ans=0.07 2023-06-24 00:51:05,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1617402.0, ans=0.125 2023-06-24 00:51:25,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-24 00:51:59,677 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.613e+02 7.503e+02 1.087e+03 1.475e+03 2.223e+03, threshold=2.175e+03, percent-clipped=27.0 2023-06-24 00:52:13,792 INFO [train.py:996] (1/4) Epoch 9, batch 25650, loss[loss=0.2204, simple_loss=0.2794, pruned_loss=0.08066, over 21674.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3126, pruned_loss=0.0786, over 4257587.89 frames. ], batch size: 247, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:52:54,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1617702.0, ans=0.1 2023-06-24 00:53:01,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1617762.0, ans=0.125 2023-06-24 00:53:51,683 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 00:53:54,160 INFO [train.py:996] (1/4) Epoch 9, batch 25700, loss[loss=0.2331, simple_loss=0.3081, pruned_loss=0.07901, over 21489.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3103, pruned_loss=0.08013, over 4259592.11 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:54:01,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1617942.0, ans=0.125 2023-06-24 00:54:07,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1617942.0, ans=0.125 2023-06-24 00:54:37,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1618002.0, ans=0.0 2023-06-24 00:54:40,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1618062.0, ans=0.1 2023-06-24 00:54:43,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618062.0, ans=0.1 2023-06-24 00:55:04,375 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-24 00:55:24,787 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.568e+02 8.876e+02 1.245e+03 3.057e+03, threshold=1.775e+03, percent-clipped=5.0 2023-06-24 00:55:35,288 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:55:39,600 INFO [train.py:996] (1/4) Epoch 9, batch 25750, loss[loss=0.2989, simple_loss=0.3734, pruned_loss=0.1122, over 21602.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3132, pruned_loss=0.0819, over 4268449.15 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:56:00,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1618242.0, ans=0.125 2023-06-24 00:56:05,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618302.0, ans=0.1 2023-06-24 00:56:06,007 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-06-24 00:56:25,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1618362.0, ans=0.125 2023-06-24 00:57:33,108 INFO [train.py:996] (1/4) Epoch 9, batch 25800, loss[loss=0.2662, simple_loss=0.3389, pruned_loss=0.09674, over 21931.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3273, pruned_loss=0.08719, over 4263423.91 frames. ], batch size: 316, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:58:05,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618662.0, ans=0.1 2023-06-24 00:58:31,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1618722.0, ans=0.125 2023-06-24 00:59:05,233 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.745e+02 6.884e+02 9.421e+02 1.458e+03 3.090e+03, threshold=1.884e+03, percent-clipped=11.0 2023-06-24 00:59:14,589 INFO [train.py:996] (1/4) Epoch 9, batch 25850, loss[loss=0.2713, simple_loss=0.3416, pruned_loss=0.1005, over 21763.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3282, pruned_loss=0.08697, over 4272497.90 frames. ], batch size: 389, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:59:21,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.82 vs. limit=22.5 2023-06-24 01:00:15,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1618962.0, ans=0.0 2023-06-24 01:00:56,729 INFO [train.py:996] (1/4) Epoch 9, batch 25900, loss[loss=0.2339, simple_loss=0.3265, pruned_loss=0.07071, over 21563.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3296, pruned_loss=0.08797, over 4281839.54 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:01:31,618 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-24 01:02:14,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1619322.0, ans=0.035 2023-06-24 01:02:27,424 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.264e+02 6.970e+02 9.529e+02 1.427e+03 2.797e+03, threshold=1.906e+03, percent-clipped=4.0 2023-06-24 01:02:37,422 INFO [train.py:996] (1/4) Epoch 9, batch 25950, loss[loss=0.2725, simple_loss=0.346, pruned_loss=0.09949, over 21579.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.337, pruned_loss=0.09116, over 4285202.86 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:03:07,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1619502.0, ans=0.1 2023-06-24 01:03:07,587 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:03:26,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1619562.0, ans=0.2 2023-06-24 01:03:39,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1619562.0, ans=0.125 2023-06-24 01:03:44,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1619622.0, ans=0.2 2023-06-24 01:04:13,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-06-24 01:04:21,701 INFO [train.py:996] (1/4) Epoch 9, batch 26000, loss[loss=0.2648, simple_loss=0.3438, pruned_loss=0.09295, over 21685.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3336, pruned_loss=0.08856, over 4280998.29 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 32.0 2023-06-24 01:04:29,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1619742.0, ans=0.2 2023-06-24 01:04:29,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-24 01:04:58,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1619802.0, ans=0.1 2023-06-24 01:05:42,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1619982.0, ans=0.125 2023-06-24 01:05:49,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.394e+02 5.949e+02 7.869e+02 1.155e+03 1.920e+03, threshold=1.574e+03, percent-clipped=1.0 2023-06-24 01:06:01,910 INFO [train.py:996] (1/4) Epoch 9, batch 26050, loss[loss=0.2399, simple_loss=0.3097, pruned_loss=0.085, over 21839.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3339, pruned_loss=0.08949, over 4274409.87 frames. ], batch size: 107, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:06:40,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1620102.0, ans=0.125 2023-06-24 01:06:48,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-24 01:07:28,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1620282.0, ans=0.1 2023-06-24 01:07:42,030 INFO [train.py:996] (1/4) Epoch 9, batch 26100, loss[loss=0.2169, simple_loss=0.2789, pruned_loss=0.07745, over 21240.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3283, pruned_loss=0.08804, over 4273991.41 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:07:45,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1620342.0, ans=0.125 2023-06-24 01:08:41,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1620462.0, ans=0.0 2023-06-24 01:08:57,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1620522.0, ans=0.035 2023-06-24 01:09:14,428 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 5.600e+02 7.578e+02 1.132e+03 2.519e+03, threshold=1.516e+03, percent-clipped=12.0 2023-06-24 01:09:18,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1620582.0, ans=0.0 2023-06-24 01:09:27,695 INFO [train.py:996] (1/4) Epoch 9, batch 26150, loss[loss=0.2538, simple_loss=0.3287, pruned_loss=0.0894, over 21563.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3259, pruned_loss=0.08782, over 4276622.08 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:10:36,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1620822.0, ans=0.1 2023-06-24 01:10:45,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1620882.0, ans=0.125 2023-06-24 01:10:50,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1620882.0, ans=0.125 2023-06-24 01:11:08,848 INFO [train.py:996] (1/4) Epoch 9, batch 26200, loss[loss=0.2302, simple_loss=0.3127, pruned_loss=0.07385, over 20644.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3266, pruned_loss=0.08605, over 4277767.73 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:11:23,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1620942.0, ans=0.0 2023-06-24 01:11:26,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1620942.0, ans=0.125 2023-06-24 01:11:32,382 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-24 01:12:04,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=22.5 2023-06-24 01:12:40,881 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.187e+02 6.061e+02 7.931e+02 1.081e+03 1.881e+03, threshold=1.586e+03, percent-clipped=8.0 2023-06-24 01:12:48,764 INFO [train.py:996] (1/4) Epoch 9, batch 26250, loss[loss=0.2754, simple_loss=0.355, pruned_loss=0.09788, over 21807.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3301, pruned_loss=0.08449, over 4279091.72 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:12:57,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1621242.0, ans=0.5 2023-06-24 01:13:01,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1621242.0, ans=0.1 2023-06-24 01:13:04,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1621242.0, ans=0.125 2023-06-24 01:13:32,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1621362.0, ans=0.1 2023-06-24 01:13:34,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.37 vs. limit=22.5 2023-06-24 01:13:35,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1621362.0, ans=0.0 2023-06-24 01:14:27,397 INFO [train.py:996] (1/4) Epoch 9, batch 26300, loss[loss=0.2539, simple_loss=0.3187, pruned_loss=0.09453, over 21505.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3273, pruned_loss=0.08614, over 4285769.71 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:14:46,044 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-24 01:16:00,715 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.956e+02 5.501e+02 7.809e+02 1.118e+03 2.350e+03, threshold=1.562e+03, percent-clipped=10.0 2023-06-24 01:16:08,913 INFO [train.py:996] (1/4) Epoch 9, batch 26350, loss[loss=0.2103, simple_loss=0.296, pruned_loss=0.0623, over 20790.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3251, pruned_loss=0.08692, over 4288489.69 frames. ], batch size: 607, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:16:29,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1621902.0, ans=0.0 2023-06-24 01:16:33,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-24 01:17:17,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1622022.0, ans=0.0 2023-06-24 01:17:50,257 INFO [train.py:996] (1/4) Epoch 9, batch 26400, loss[loss=0.2632, simple_loss=0.3128, pruned_loss=0.1068, over 21997.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3196, pruned_loss=0.08703, over 4282549.25 frames. ], batch size: 103, lr: 3.22e-03, grad_scale: 32.0 2023-06-24 01:18:13,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1622202.0, ans=0.125 2023-06-24 01:19:17,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1622382.0, ans=0.0 2023-06-24 01:19:27,020 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.128e+02 6.613e+02 9.874e+02 1.376e+03 2.693e+03, threshold=1.975e+03, percent-clipped=17.0 2023-06-24 01:19:30,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1622382.0, ans=0.0 2023-06-24 01:19:32,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622442.0, ans=0.1 2023-06-24 01:19:33,664 INFO [train.py:996] (1/4) Epoch 9, batch 26450, loss[loss=0.277, simple_loss=0.3996, pruned_loss=0.0772, over 21165.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3197, pruned_loss=0.08678, over 4284251.70 frames. ], batch size: 549, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:19:39,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1622442.0, ans=0.1 2023-06-24 01:19:48,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1622442.0, ans=0.125 2023-06-24 01:19:50,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1622502.0, ans=0.1 2023-06-24 01:20:41,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1622622.0, ans=0.125 2023-06-24 01:20:47,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1622622.0, ans=0.2 2023-06-24 01:21:00,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1622682.0, ans=0.1 2023-06-24 01:21:01,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.12 vs. limit=10.0 2023-06-24 01:21:08,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1622682.0, ans=0.2 2023-06-24 01:21:16,745 INFO [train.py:996] (1/4) Epoch 9, batch 26500, loss[loss=0.2441, simple_loss=0.3314, pruned_loss=0.07838, over 21679.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3222, pruned_loss=0.08492, over 4281768.17 frames. ], batch size: 389, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:21:21,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-24 01:21:33,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1622742.0, ans=0.1 2023-06-24 01:22:15,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1622862.0, ans=0.125 2023-06-24 01:22:23,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1622862.0, ans=0.04949747468305833 2023-06-24 01:22:34,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=35.17 vs. limit=15.0 2023-06-24 01:22:34,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-24 01:22:59,836 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 8.000e+02 1.204e+03 2.180e+03 3.765e+03, threshold=2.409e+03, percent-clipped=29.0 2023-06-24 01:23:05,413 INFO [train.py:996] (1/4) Epoch 9, batch 26550, loss[loss=0.1784, simple_loss=0.2495, pruned_loss=0.05361, over 21282.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.319, pruned_loss=0.08235, over 4262904.07 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:24:05,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1623162.0, ans=0.0 2023-06-24 01:24:31,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1623282.0, ans=0.125 2023-06-24 01:24:55,381 INFO [train.py:996] (1/4) Epoch 9, batch 26600, loss[loss=0.2334, simple_loss=0.2996, pruned_loss=0.08363, over 20076.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3185, pruned_loss=0.07933, over 4255617.49 frames. ], batch size: 703, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:25:19,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1623402.0, ans=0.05 2023-06-24 01:25:51,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1623522.0, ans=0.125 2023-06-24 01:25:59,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1623522.0, ans=0.125 2023-06-24 01:26:10,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1623582.0, ans=0.035 2023-06-24 01:26:10,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1623582.0, ans=0.0 2023-06-24 01:26:33,772 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 5.096e+02 6.459e+02 8.360e+02 2.532e+03, threshold=1.292e+03, percent-clipped=3.0 2023-06-24 01:26:38,315 INFO [train.py:996] (1/4) Epoch 9, batch 26650, loss[loss=0.1675, simple_loss=0.2633, pruned_loss=0.03584, over 21669.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.31, pruned_loss=0.07722, over 4252541.29 frames. ], batch size: 391, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:26:52,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1623642.0, ans=0.2 2023-06-24 01:27:00,892 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-24 01:27:01,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1623702.0, ans=0.125 2023-06-24 01:28:17,070 INFO [train.py:996] (1/4) Epoch 9, batch 26700, loss[loss=0.2372, simple_loss=0.3114, pruned_loss=0.08146, over 21883.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3021, pruned_loss=0.07381, over 4262805.59 frames. ], batch size: 107, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:29:25,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1624122.0, ans=0.2 2023-06-24 01:29:26,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1624122.0, ans=0.125 2023-06-24 01:29:55,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1624182.0, ans=0.125 2023-06-24 01:29:58,472 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.138e+02 5.000e+02 7.039e+02 1.019e+03 2.446e+03, threshold=1.408e+03, percent-clipped=9.0 2023-06-24 01:30:03,542 INFO [train.py:996] (1/4) Epoch 9, batch 26750, loss[loss=0.2562, simple_loss=0.3339, pruned_loss=0.08923, over 21566.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3026, pruned_loss=0.07329, over 4265886.56 frames. ], batch size: 507, lr: 3.21e-03, grad_scale: 8.0 2023-06-24 01:30:50,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=22.5 2023-06-24 01:31:03,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1624422.0, ans=0.025 2023-06-24 01:31:35,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-24 01:31:41,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1624482.0, ans=0.2 2023-06-24 01:31:41,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1624482.0, ans=0.125 2023-06-24 01:31:44,748 INFO [train.py:996] (1/4) Epoch 9, batch 26800, loss[loss=0.2786, simple_loss=0.3504, pruned_loss=0.1034, over 21544.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.31, pruned_loss=0.07755, over 4262306.85 frames. ], batch size: 414, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:31:52,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1624542.0, ans=0.125 2023-06-24 01:31:55,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1624542.0, ans=0.04949747468305833 2023-06-24 01:32:18,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-24 01:33:09,311 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-24 01:33:19,724 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.341e+02 8.835e+02 1.201e+03 2.832e+03, threshold=1.767e+03, percent-clipped=16.0 2023-06-24 01:33:23,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1624842.0, ans=0.0 2023-06-24 01:33:24,524 INFO [train.py:996] (1/4) Epoch 9, batch 26850, loss[loss=0.233, simple_loss=0.2941, pruned_loss=0.08596, over 21798.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3108, pruned_loss=0.08039, over 4259022.68 frames. ], batch size: 124, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:34:03,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1624962.0, ans=0.125 2023-06-24 01:34:26,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1624962.0, ans=0.0 2023-06-24 01:34:26,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1624962.0, ans=0.0 2023-06-24 01:34:43,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1625022.0, ans=0.0 2023-06-24 01:34:45,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1625022.0, ans=0.0 2023-06-24 01:34:56,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1625082.0, ans=0.025 2023-06-24 01:35:07,324 INFO [train.py:996] (1/4) Epoch 9, batch 26900, loss[loss=0.1897, simple_loss=0.2477, pruned_loss=0.06583, over 21604.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3025, pruned_loss=0.07923, over 4260986.32 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:35:49,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1625262.0, ans=0.125 2023-06-24 01:36:15,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1625322.0, ans=0.04949747468305833 2023-06-24 01:36:37,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.966e+02 5.702e+02 7.755e+02 1.168e+03 3.900e+03, threshold=1.551e+03, percent-clipped=4.0 2023-06-24 01:36:41,782 INFO [train.py:996] (1/4) Epoch 9, batch 26950, loss[loss=0.3169, simple_loss=0.3921, pruned_loss=0.1208, over 21466.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3019, pruned_loss=0.07911, over 4261845.50 frames. ], batch size: 508, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:38:17,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1625682.0, ans=0.125 2023-06-24 01:38:23,735 INFO [train.py:996] (1/4) Epoch 9, batch 27000, loss[loss=0.1864, simple_loss=0.2667, pruned_loss=0.0531, over 21517.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3034, pruned_loss=0.07782, over 4264208.66 frames. ], batch size: 212, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:38:23,736 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 01:38:35,364 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.6728, 4.0517, 4.1881, 4.4352], device='cuda:1') 2023-06-24 01:38:43,002 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2397, simple_loss=0.3375, pruned_loss=0.07102, over 1796401.00 frames. 2023-06-24 01:38:43,003 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 01:38:47,616 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-24 01:38:48,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1625742.0, ans=0.125 2023-06-24 01:39:05,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1625802.0, ans=0.125 2023-06-24 01:39:59,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1625922.0, ans=0.1 2023-06-24 01:40:09,410 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-24 01:40:18,516 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.845e+02 5.629e+02 9.070e+02 1.221e+03 2.568e+03, threshold=1.814e+03, percent-clipped=16.0 2023-06-24 01:40:21,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=22.5 2023-06-24 01:40:23,210 INFO [train.py:996] (1/4) Epoch 9, batch 27050, loss[loss=0.2292, simple_loss=0.3111, pruned_loss=0.07365, over 21747.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3058, pruned_loss=0.07572, over 4266992.57 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:40:43,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1626102.0, ans=0.125 2023-06-24 01:40:57,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1626102.0, ans=0.2 2023-06-24 01:41:17,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1626162.0, ans=0.95 2023-06-24 01:41:26,226 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-24 01:41:55,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1626282.0, ans=0.0 2023-06-24 01:42:04,843 INFO [train.py:996] (1/4) Epoch 9, batch 27100, loss[loss=0.2112, simple_loss=0.3067, pruned_loss=0.0578, over 21370.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3072, pruned_loss=0.07669, over 4275256.89 frames. ], batch size: 131, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:43:42,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.346e+02 6.138e+02 8.870e+02 1.301e+03 2.299e+03, threshold=1.774e+03, percent-clipped=4.0 2023-06-24 01:43:47,602 INFO [train.py:996] (1/4) Epoch 9, batch 27150, loss[loss=0.2245, simple_loss=0.3076, pruned_loss=0.07074, over 21308.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3191, pruned_loss=0.07988, over 4277919.56 frames. ], batch size: 131, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:43:48,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1626642.0, ans=0.0 2023-06-24 01:44:23,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1626702.0, ans=0.1 2023-06-24 01:45:32,218 INFO [train.py:996] (1/4) Epoch 9, batch 27200, loss[loss=0.2558, simple_loss=0.3377, pruned_loss=0.08695, over 21283.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.327, pruned_loss=0.08299, over 4277188.33 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:45:46,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1626942.0, ans=0.07 2023-06-24 01:46:03,890 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-24 01:46:04,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1627002.0, ans=0.125 2023-06-24 01:46:09,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1627002.0, ans=0.125 2023-06-24 01:46:09,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1627002.0, ans=0.07 2023-06-24 01:46:32,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1627062.0, ans=0.125 2023-06-24 01:46:37,845 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-06-24 01:46:46,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-24 01:47:17,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 6.892e+02 9.906e+02 1.306e+03 3.118e+03, threshold=1.981e+03, percent-clipped=15.0 2023-06-24 01:47:22,725 INFO [train.py:996] (1/4) Epoch 9, batch 27250, loss[loss=0.2753, simple_loss=0.3399, pruned_loss=0.1054, over 21724.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.328, pruned_loss=0.08592, over 4272506.04 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:47:25,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1627242.0, ans=0.125 2023-06-24 01:47:41,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-24 01:47:49,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1627302.0, ans=0.125 2023-06-24 01:49:03,894 INFO [train.py:996] (1/4) Epoch 9, batch 27300, loss[loss=0.2543, simple_loss=0.3519, pruned_loss=0.07836, over 21711.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3316, pruned_loss=0.08756, over 4277399.15 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:49:31,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1627602.0, ans=0.1 2023-06-24 01:50:18,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1627722.0, ans=0.0 2023-06-24 01:50:23,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1627722.0, ans=0.125 2023-06-24 01:50:23,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1627722.0, ans=0.07 2023-06-24 01:50:40,414 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 5.255e+02 6.671e+02 8.856e+02 1.687e+03, threshold=1.334e+03, percent-clipped=0.0 2023-06-24 01:50:43,367 INFO [train.py:996] (1/4) Epoch 9, batch 27350, loss[loss=0.2447, simple_loss=0.3269, pruned_loss=0.08126, over 21255.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3337, pruned_loss=0.08835, over 4280244.80 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:51:02,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1627902.0, ans=0.2 2023-06-24 01:51:52,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1628022.0, ans=10.0 2023-06-24 01:52:00,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1628022.0, ans=0.2 2023-06-24 01:52:09,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1628082.0, ans=0.2 2023-06-24 01:52:20,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1628142.0, ans=0.125 2023-06-24 01:52:21,422 INFO [train.py:996] (1/4) Epoch 9, batch 27400, loss[loss=0.2171, simple_loss=0.2869, pruned_loss=0.07367, over 21652.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.328, pruned_loss=0.08744, over 4283268.31 frames. ], batch size: 391, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:52:21,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1628142.0, ans=0.125 2023-06-24 01:52:22,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1628142.0, ans=0.07 2023-06-24 01:52:34,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1628142.0, ans=0.07 2023-06-24 01:52:59,662 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=12.0 2023-06-24 01:53:05,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1628262.0, ans=0.125 2023-06-24 01:53:09,161 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.31 vs. limit=6.0 2023-06-24 01:53:58,836 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.992e+02 5.150e+02 6.408e+02 1.007e+03 1.892e+03, threshold=1.282e+03, percent-clipped=7.0 2023-06-24 01:54:01,983 INFO [train.py:996] (1/4) Epoch 9, batch 27450, loss[loss=0.2256, simple_loss=0.3083, pruned_loss=0.07152, over 20702.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3217, pruned_loss=0.08565, over 4281629.99 frames. ], batch size: 607, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:55:39,278 INFO [train.py:996] (1/4) Epoch 9, batch 27500, loss[loss=0.2351, simple_loss=0.306, pruned_loss=0.0821, over 21902.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3207, pruned_loss=0.08583, over 4286875.18 frames. ], batch size: 351, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:55:43,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-24 01:56:39,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1628862.0, ans=0.125 2023-06-24 01:56:58,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1628982.0, ans=0.0 2023-06-24 01:57:09,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.57 vs. limit=22.5 2023-06-24 01:57:11,088 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.736e+02 5.167e+02 6.606e+02 9.536e+02 1.970e+03, threshold=1.321e+03, percent-clipped=8.0 2023-06-24 01:57:11,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1628982.0, ans=0.0 2023-06-24 01:57:18,526 INFO [train.py:996] (1/4) Epoch 9, batch 27550, loss[loss=0.1989, simple_loss=0.2661, pruned_loss=0.06591, over 21456.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3145, pruned_loss=0.08208, over 4291302.04 frames. ], batch size: 194, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:58:35,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1629282.0, ans=0.2 2023-06-24 01:58:56,723 INFO [train.py:996] (1/4) Epoch 9, batch 27600, loss[loss=0.2275, simple_loss=0.2895, pruned_loss=0.08275, over 21293.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3098, pruned_loss=0.08184, over 4280439.59 frames. ], batch size: 144, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:59:15,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=22.5 2023-06-24 02:00:09,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1629522.0, ans=0.2 2023-06-24 02:00:27,991 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.819e+02 6.454e+02 1.076e+03 1.748e+03 4.788e+03, threshold=2.152e+03, percent-clipped=40.0 2023-06-24 02:00:31,102 INFO [train.py:996] (1/4) Epoch 9, batch 27650, loss[loss=0.181, simple_loss=0.2472, pruned_loss=0.05745, over 21041.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3041, pruned_loss=0.0813, over 4285407.26 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:00:34,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-24 02:00:54,794 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:01:41,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1629822.0, ans=0.125 2023-06-24 02:01:56,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=22.5 2023-06-24 02:02:14,158 INFO [train.py:996] (1/4) Epoch 9, batch 27700, loss[loss=0.2855, simple_loss=0.3676, pruned_loss=0.1017, over 21307.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3042, pruned_loss=0.07895, over 4283983.38 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:02:17,816 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:02:49,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1630002.0, ans=0.0 2023-06-24 02:03:06,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1630062.0, ans=0.0 2023-06-24 02:03:19,643 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1630122.0, ans=0.125 2023-06-24 02:03:21,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1630122.0, ans=0.125 2023-06-24 02:03:22,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1630122.0, ans=0.1 2023-06-24 02:03:53,634 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.601e+02 6.018e+02 9.406e+02 1.396e+03 2.807e+03, threshold=1.881e+03, percent-clipped=3.0 2023-06-24 02:03:55,126 INFO [train.py:996] (1/4) Epoch 9, batch 27750, loss[loss=0.2299, simple_loss=0.3409, pruned_loss=0.0595, over 20860.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.309, pruned_loss=0.07909, over 4287266.48 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:04:11,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1630302.0, ans=0.0 2023-06-24 02:04:26,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1630302.0, ans=10.0 2023-06-24 02:04:32,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1630302.0, ans=0.1 2023-06-24 02:04:59,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1630422.0, ans=0.125 2023-06-24 02:05:02,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-24 02:05:12,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-24 02:05:28,041 INFO [train.py:996] (1/4) Epoch 9, batch 27800, loss[loss=0.215, simple_loss=0.2866, pruned_loss=0.0717, over 21893.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3082, pruned_loss=0.07987, over 4288888.87 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:05:37,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1630542.0, ans=0.1 2023-06-24 02:05:42,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1630542.0, ans=15.0 2023-06-24 02:06:07,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1630602.0, ans=0.125 2023-06-24 02:06:25,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1630662.0, ans=0.125 2023-06-24 02:06:55,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1630782.0, ans=0.125 2023-06-24 02:07:10,676 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.786e+02 6.303e+02 8.930e+02 1.307e+03 2.305e+03, threshold=1.786e+03, percent-clipped=6.0 2023-06-24 02:07:12,468 INFO [train.py:996] (1/4) Epoch 9, batch 27850, loss[loss=0.2612, simple_loss=0.323, pruned_loss=0.09972, over 21867.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3074, pruned_loss=0.08144, over 4287457.17 frames. ], batch size: 118, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:07:53,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-24 02:08:49,388 INFO [train.py:996] (1/4) Epoch 9, batch 27900, loss[loss=0.2516, simple_loss=0.3444, pruned_loss=0.0794, over 21793.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3138, pruned_loss=0.0817, over 4287565.40 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:09:24,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1631202.0, ans=0.5 2023-06-24 02:09:46,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1631262.0, ans=0.125 2023-06-24 02:09:47,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-24 02:10:10,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-24 02:10:44,057 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.905e+02 5.823e+02 7.958e+02 1.181e+03 2.526e+03, threshold=1.592e+03, percent-clipped=8.0 2023-06-24 02:10:45,601 INFO [train.py:996] (1/4) Epoch 9, batch 27950, loss[loss=0.1753, simple_loss=0.2405, pruned_loss=0.05501, over 17080.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3133, pruned_loss=0.07836, over 4276380.03 frames. ], batch size: 61, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:11:15,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1631502.0, ans=0.125 2023-06-24 02:11:21,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1631562.0, ans=0.125 2023-06-24 02:11:42,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-24 02:11:52,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-24 02:12:20,046 INFO [train.py:996] (1/4) Epoch 9, batch 28000, loss[loss=0.309, simple_loss=0.3556, pruned_loss=0.1312, over 21764.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3121, pruned_loss=0.07646, over 4281550.39 frames. ], batch size: 508, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:12:22,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1631742.0, ans=0.07 2023-06-24 02:12:33,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1631742.0, ans=0.125 2023-06-24 02:12:43,968 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-24 02:13:17,254 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.60 vs. limit=22.5 2023-06-24 02:13:22,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1631922.0, ans=0.0 2023-06-24 02:14:06,384 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.113e+02 6.367e+02 8.563e+02 1.195e+03 2.843e+03, threshold=1.713e+03, percent-clipped=10.0 2023-06-24 02:14:06,415 INFO [train.py:996] (1/4) Epoch 9, batch 28050, loss[loss=0.2873, simple_loss=0.3595, pruned_loss=0.1075, over 21589.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3098, pruned_loss=0.07703, over 4284904.94 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:14:21,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1632102.0, ans=0.2 2023-06-24 02:14:36,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1632102.0, ans=0.1 2023-06-24 02:15:26,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1632282.0, ans=0.09899494936611666 2023-06-24 02:15:45,598 INFO [train.py:996] (1/4) Epoch 9, batch 28100, loss[loss=0.2241, simple_loss=0.2808, pruned_loss=0.0837, over 21388.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3081, pruned_loss=0.07704, over 4277209.13 frames. ], batch size: 131, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:16:10,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1632402.0, ans=0.125 2023-06-24 02:16:18,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1632402.0, ans=0.125 2023-06-24 02:16:19,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1632462.0, ans=0.125 2023-06-24 02:16:29,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1632462.0, ans=0.0 2023-06-24 02:16:39,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1632522.0, ans=0.1 2023-06-24 02:17:22,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.058e+02 5.561e+02 8.006e+02 1.245e+03 3.714e+03, threshold=1.601e+03, percent-clipped=14.0 2023-06-24 02:17:22,351 INFO [train.py:996] (1/4) Epoch 9, batch 28150, loss[loss=0.2526, simple_loss=0.3076, pruned_loss=0.09879, over 21894.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3043, pruned_loss=0.07776, over 4279783.92 frames. ], batch size: 373, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:17:35,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1632642.0, ans=0.0 2023-06-24 02:18:01,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1632762.0, ans=0.125 2023-06-24 02:18:56,325 INFO [train.py:996] (1/4) Epoch 9, batch 28200, loss[loss=0.272, simple_loss=0.3879, pruned_loss=0.07802, over 19754.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3038, pruned_loss=0.07939, over 4274176.74 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:19:11,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1633002.0, ans=0.125 2023-06-24 02:19:29,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-24 02:19:40,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1633062.0, ans=0.125 2023-06-24 02:19:41,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1633062.0, ans=0.125 2023-06-24 02:19:54,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1633122.0, ans=0.125 2023-06-24 02:20:35,755 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.373e+02 7.070e+02 1.041e+03 1.545e+03 2.791e+03, threshold=2.082e+03, percent-clipped=22.0 2023-06-24 02:20:35,778 INFO [train.py:996] (1/4) Epoch 9, batch 28250, loss[loss=0.1757, simple_loss=0.2257, pruned_loss=0.06285, over 20740.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3044, pruned_loss=0.08139, over 4271473.69 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:20:43,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1633242.0, ans=0.125 2023-06-24 02:20:55,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1633302.0, ans=0.0 2023-06-24 02:21:18,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1633362.0, ans=0.2 2023-06-24 02:21:29,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1633362.0, ans=0.1 2023-06-24 02:22:17,419 INFO [train.py:996] (1/4) Epoch 9, batch 28300, loss[loss=0.2074, simple_loss=0.3033, pruned_loss=0.05577, over 21517.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3035, pruned_loss=0.0792, over 4262092.19 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:22:17,774 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:23:30,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1633722.0, ans=0.2 2023-06-24 02:23:45,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1633782.0, ans=0.0 2023-06-24 02:23:55,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1633842.0, ans=0.0 2023-06-24 02:23:55,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1633842.0, ans=0.125 2023-06-24 02:23:56,798 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 5.879e+02 9.596e+02 1.261e+03 3.601e+03, threshold=1.919e+03, percent-clipped=6.0 2023-06-24 02:23:56,823 INFO [train.py:996] (1/4) Epoch 9, batch 28350, loss[loss=0.204, simple_loss=0.2612, pruned_loss=0.07337, over 21247.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3025, pruned_loss=0.07405, over 4258110.96 frames. ], batch size: 160, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:24:05,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1633842.0, ans=0.05 2023-06-24 02:25:14,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1634022.0, ans=0.125 2023-06-24 02:25:18,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1634082.0, ans=0.04949747468305833 2023-06-24 02:25:40,951 INFO [train.py:996] (1/4) Epoch 9, batch 28400, loss[loss=0.194, simple_loss=0.2613, pruned_loss=0.06336, over 21627.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2981, pruned_loss=0.07408, over 4254433.44 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:26:25,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1634262.0, ans=0.09899494936611666 2023-06-24 02:26:49,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1634322.0, ans=0.125 2023-06-24 02:27:15,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1634382.0, ans=0.0 2023-06-24 02:27:18,255 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.912e+02 5.958e+02 8.832e+02 1.277e+03 2.222e+03, threshold=1.766e+03, percent-clipped=4.0 2023-06-24 02:27:18,277 INFO [train.py:996] (1/4) Epoch 9, batch 28450, loss[loss=0.2002, simple_loss=0.2728, pruned_loss=0.06382, over 15437.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3023, pruned_loss=0.07763, over 4256523.53 frames. ], batch size: 60, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:27:45,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1634502.0, ans=0.1 2023-06-24 02:27:53,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1634502.0, ans=0.1 2023-06-24 02:28:20,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1634622.0, ans=0.1 2023-06-24 02:28:33,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-24 02:28:55,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1634742.0, ans=0.2 2023-06-24 02:28:55,997 INFO [train.py:996] (1/4) Epoch 9, batch 28500, loss[loss=0.2798, simple_loss=0.3475, pruned_loss=0.106, over 21482.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3044, pruned_loss=0.08019, over 4260116.14 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:30:40,784 INFO [train.py:996] (1/4) Epoch 9, batch 28550, loss[loss=0.2807, simple_loss=0.3859, pruned_loss=0.08775, over 21272.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.313, pruned_loss=0.08344, over 4268030.09 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:30:41,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1635042.0, ans=0.125 2023-06-24 02:30:42,309 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.957e+02 5.998e+02 7.738e+02 1.217e+03 2.112e+03, threshold=1.548e+03, percent-clipped=6.0 2023-06-24 02:30:55,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=12.0 2023-06-24 02:31:09,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1635102.0, ans=0.0 2023-06-24 02:31:16,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1635102.0, ans=0.125 2023-06-24 02:31:18,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-24 02:32:24,259 INFO [train.py:996] (1/4) Epoch 9, batch 28600, loss[loss=0.2661, simple_loss=0.3374, pruned_loss=0.09739, over 21586.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3189, pruned_loss=0.08539, over 4274261.79 frames. ], batch size: 389, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:33:46,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1635582.0, ans=0.125 2023-06-24 02:34:02,759 INFO [train.py:996] (1/4) Epoch 9, batch 28650, loss[loss=0.2246, simple_loss=0.275, pruned_loss=0.08709, over 21319.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.314, pruned_loss=0.08537, over 4269563.18 frames. ], batch size: 144, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:34:11,204 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.746e+02 6.069e+02 8.380e+02 1.162e+03 2.307e+03, threshold=1.676e+03, percent-clipped=7.0 2023-06-24 02:34:14,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1635642.0, ans=0.125 2023-06-24 02:34:25,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1635702.0, ans=0.0 2023-06-24 02:34:45,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.64 vs. limit=15.0 2023-06-24 02:34:59,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1635822.0, ans=0.125 2023-06-24 02:35:32,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1635882.0, ans=0.125 2023-06-24 02:35:46,699 INFO [train.py:996] (1/4) Epoch 9, batch 28700, loss[loss=0.2564, simple_loss=0.3191, pruned_loss=0.09684, over 21460.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3128, pruned_loss=0.0866, over 4268300.18 frames. ], batch size: 194, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:35:54,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1635942.0, ans=0.0 2023-06-24 02:35:54,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1635942.0, ans=0.2 2023-06-24 02:36:12,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1636002.0, ans=0.0 2023-06-24 02:37:25,546 INFO [train.py:996] (1/4) Epoch 9, batch 28750, loss[loss=0.2342, simple_loss=0.3164, pruned_loss=0.07604, over 21800.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3111, pruned_loss=0.0861, over 4276940.74 frames. ], batch size: 282, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:37:26,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636242.0, ans=0.1 2023-06-24 02:37:27,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1636242.0, ans=0.5 2023-06-24 02:37:28,866 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.971e+02 6.417e+02 8.454e+02 1.129e+03 2.571e+03, threshold=1.691e+03, percent-clipped=6.0 2023-06-24 02:37:35,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1636242.0, ans=0.0 2023-06-24 02:37:55,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1636302.0, ans=0.125 2023-06-24 02:38:14,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636362.0, ans=0.1 2023-06-24 02:38:22,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1636422.0, ans=0.0 2023-06-24 02:39:00,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1636482.0, ans=0.125 2023-06-24 02:39:04,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=15.0 2023-06-24 02:39:04,499 INFO [train.py:996] (1/4) Epoch 9, batch 28800, loss[loss=0.3172, simple_loss=0.3882, pruned_loss=0.1231, over 21801.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3153, pruned_loss=0.08595, over 4271121.12 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:39:34,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1636602.0, ans=0.0 2023-06-24 02:39:57,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636662.0, ans=0.1 2023-06-24 02:40:43,045 INFO [train.py:996] (1/4) Epoch 9, batch 28850, loss[loss=0.2561, simple_loss=0.3101, pruned_loss=0.101, over 21454.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3169, pruned_loss=0.08787, over 4275706.77 frames. ], batch size: 194, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:40:46,360 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 7.030e+02 9.281e+02 1.224e+03 2.045e+03, threshold=1.856e+03, percent-clipped=4.0 2023-06-24 02:40:53,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1636842.0, ans=0.125 2023-06-24 02:40:53,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1636842.0, ans=0.1 2023-06-24 02:40:59,912 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-24 02:42:23,129 INFO [train.py:996] (1/4) Epoch 9, batch 28900, loss[loss=0.3002, simple_loss=0.3694, pruned_loss=0.1155, over 21747.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3223, pruned_loss=0.09028, over 4275834.73 frames. ], batch size: 441, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:42:59,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1637202.0, ans=0.2 2023-06-24 02:43:23,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1637262.0, ans=0.1 2023-06-24 02:43:45,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-24 02:43:48,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1637382.0, ans=0.1 2023-06-24 02:43:54,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1637382.0, ans=0.125 2023-06-24 02:44:07,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1637442.0, ans=0.1 2023-06-24 02:44:08,024 INFO [train.py:996] (1/4) Epoch 9, batch 28950, loss[loss=0.2449, simple_loss=0.3256, pruned_loss=0.08213, over 19959.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3224, pruned_loss=0.08948, over 4273091.34 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:44:11,415 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.380e+02 7.524e+02 1.128e+03 1.793e+03 3.083e+03, threshold=2.257e+03, percent-clipped=23.0 2023-06-24 02:45:05,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1637562.0, ans=0.0 2023-06-24 02:45:05,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1637562.0, ans=0.07 2023-06-24 02:45:40,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1637682.0, ans=0.2 2023-06-24 02:45:47,870 INFO [train.py:996] (1/4) Epoch 9, batch 29000, loss[loss=0.2302, simple_loss=0.3053, pruned_loss=0.07758, over 21780.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3239, pruned_loss=0.08836, over 4271031.60 frames. ], batch size: 247, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:46:54,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1637922.0, ans=0.125 2023-06-24 02:47:32,710 INFO [train.py:996] (1/4) Epoch 9, batch 29050, loss[loss=0.2303, simple_loss=0.2958, pruned_loss=0.08236, over 21587.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3231, pruned_loss=0.08826, over 4272279.26 frames. ], batch size: 212, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:47:40,523 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 6.530e+02 1.098e+03 1.738e+03 3.592e+03, threshold=2.195e+03, percent-clipped=7.0 2023-06-24 02:47:48,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1638042.0, ans=0.0 2023-06-24 02:48:20,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1638162.0, ans=0.1 2023-06-24 02:48:39,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1638222.0, ans=0.125 2023-06-24 02:48:47,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1638222.0, ans=0.125 2023-06-24 02:48:51,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-24 02:48:52,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1638282.0, ans=0.5 2023-06-24 02:49:09,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-24 02:49:11,748 INFO [train.py:996] (1/4) Epoch 9, batch 29100, loss[loss=0.211, simple_loss=0.275, pruned_loss=0.07347, over 21787.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3152, pruned_loss=0.08619, over 4272203.61 frames. ], batch size: 371, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:49:14,310 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-24 02:50:09,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1638522.0, ans=0.07 2023-06-24 02:50:17,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1638522.0, ans=0.1 2023-06-24 02:50:54,252 INFO [train.py:996] (1/4) Epoch 9, batch 29150, loss[loss=0.2475, simple_loss=0.3109, pruned_loss=0.09201, over 21313.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3132, pruned_loss=0.08454, over 4271719.51 frames. ], batch size: 608, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:50:57,284 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.333e+02 5.735e+02 8.237e+02 1.411e+03 3.649e+03, threshold=1.647e+03, percent-clipped=7.0 2023-06-24 02:50:57,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1638642.0, ans=0.95 2023-06-24 02:51:31,393 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-24 02:51:56,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.06 vs. limit=15.0 2023-06-24 02:52:20,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1638882.0, ans=0.125 2023-06-24 02:52:32,385 INFO [train.py:996] (1/4) Epoch 9, batch 29200, loss[loss=0.2856, simple_loss=0.3434, pruned_loss=0.1139, over 21446.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3078, pruned_loss=0.08305, over 4258626.78 frames. ], batch size: 509, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:52:36,743 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=22.5 2023-06-24 02:53:07,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1639062.0, ans=0.2 2023-06-24 02:53:12,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1639062.0, ans=0.125 2023-06-24 02:54:06,547 INFO [train.py:996] (1/4) Epoch 9, batch 29250, loss[loss=0.2093, simple_loss=0.2929, pruned_loss=0.06288, over 21546.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3061, pruned_loss=0.08069, over 4260485.72 frames. ], batch size: 230, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:54:09,745 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.813e+02 6.323e+02 1.080e+03 1.364e+03 2.361e+03, threshold=2.161e+03, percent-clipped=10.0 2023-06-24 02:54:33,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1639302.0, ans=0.125 2023-06-24 02:54:44,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1639362.0, ans=0.125 2023-06-24 02:55:10,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1639422.0, ans=0.125 2023-06-24 02:55:45,682 INFO [train.py:996] (1/4) Epoch 9, batch 29300, loss[loss=0.1958, simple_loss=0.2698, pruned_loss=0.0609, over 21631.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3074, pruned_loss=0.07977, over 4267108.92 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:56:18,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1639602.0, ans=0.125 2023-06-24 02:56:34,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1639662.0, ans=0.1 2023-06-24 02:57:21,361 INFO [train.py:996] (1/4) Epoch 9, batch 29350, loss[loss=0.2447, simple_loss=0.2954, pruned_loss=0.09695, over 21235.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3037, pruned_loss=0.07934, over 4262503.59 frames. ], batch size: 144, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:57:22,508 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-24 02:57:29,484 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 5.877e+02 8.410e+02 1.271e+03 3.253e+03, threshold=1.682e+03, percent-clipped=5.0 2023-06-24 02:57:32,466 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-24 02:57:38,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1639842.0, ans=0.0 2023-06-24 02:58:16,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-24 02:58:20,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1640022.0, ans=0.0 2023-06-24 02:59:02,526 INFO [train.py:996] (1/4) Epoch 9, batch 29400, loss[loss=0.2054, simple_loss=0.2799, pruned_loss=0.06546, over 21682.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3041, pruned_loss=0.07688, over 4256141.04 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:59:06,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1640142.0, ans=0.1 2023-06-24 02:59:41,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.02 vs. limit=22.5 2023-06-24 03:00:03,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1640322.0, ans=0.125 2023-06-24 03:00:08,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-24 03:00:38,204 INFO [train.py:996] (1/4) Epoch 9, batch 29450, loss[loss=0.1987, simple_loss=0.2685, pruned_loss=0.06449, over 21629.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3024, pruned_loss=0.07568, over 4261049.75 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:00:43,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.242e+02 8.018e+02 1.543e+03 2.395e+03 4.126e+03, threshold=3.085e+03, percent-clipped=41.0 2023-06-24 03:00:54,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1640442.0, ans=0.1 2023-06-24 03:00:56,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1640442.0, ans=0.025 2023-06-24 03:01:09,381 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:01:18,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1640562.0, ans=0.125 2023-06-24 03:02:13,464 INFO [train.py:996] (1/4) Epoch 9, batch 29500, loss[loss=0.217, simple_loss=0.2912, pruned_loss=0.07136, over 21869.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3067, pruned_loss=0.0786, over 4270433.91 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:03:15,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1640922.0, ans=0.125 2023-06-24 03:03:22,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1640922.0, ans=0.125 2023-06-24 03:03:25,453 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-24 03:03:52,446 INFO [train.py:996] (1/4) Epoch 9, batch 29550, loss[loss=0.2328, simple_loss=0.3015, pruned_loss=0.08203, over 21629.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3051, pruned_loss=0.0799, over 4279679.72 frames. ], batch size: 131, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:03:57,065 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.109e+02 5.276e+02 6.489e+02 8.099e+02 1.842e+03, threshold=1.298e+03, percent-clipped=0.0 2023-06-24 03:04:58,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1641222.0, ans=0.0 2023-06-24 03:05:33,549 INFO [train.py:996] (1/4) Epoch 9, batch 29600, loss[loss=0.2851, simple_loss=0.3646, pruned_loss=0.1027, over 21659.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3134, pruned_loss=0.08284, over 4276532.64 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:05:48,769 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-24 03:05:53,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1641402.0, ans=0.125 2023-06-24 03:05:59,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1641402.0, ans=0.5 2023-06-24 03:06:22,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-06-24 03:06:49,813 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-24 03:07:07,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1641582.0, ans=0.1 2023-06-24 03:07:16,568 INFO [train.py:996] (1/4) Epoch 9, batch 29650, loss[loss=0.2091, simple_loss=0.29, pruned_loss=0.06412, over 21410.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.311, pruned_loss=0.07978, over 4275684.37 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:07:17,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1641642.0, ans=0.125 2023-06-24 03:07:21,403 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.535e+02 6.438e+02 9.787e+02 1.351e+03 2.800e+03, threshold=1.957e+03, percent-clipped=29.0 2023-06-24 03:08:10,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1641762.0, ans=0.0 2023-06-24 03:08:21,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1641822.0, ans=0.1 2023-06-24 03:08:28,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-24 03:08:56,657 INFO [train.py:996] (1/4) Epoch 9, batch 29700, loss[loss=0.2204, simple_loss=0.3138, pruned_loss=0.06345, over 21823.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3112, pruned_loss=0.07986, over 4278971.13 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:09:42,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1642062.0, ans=0.1 2023-06-24 03:10:34,586 INFO [train.py:996] (1/4) Epoch 9, batch 29750, loss[loss=0.2898, simple_loss=0.3627, pruned_loss=0.1084, over 21588.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3145, pruned_loss=0.07945, over 4281328.32 frames. ], batch size: 507, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:10:35,291 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:10:40,757 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.742e+02 5.529e+02 6.914e+02 9.553e+02 2.350e+03, threshold=1.383e+03, percent-clipped=6.0 2023-06-24 03:10:49,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1642302.0, ans=0.0 2023-06-24 03:11:24,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1642362.0, ans=0.04949747468305833 2023-06-24 03:11:28,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1642362.0, ans=15.0 2023-06-24 03:11:46,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-24 03:12:13,541 INFO [train.py:996] (1/4) Epoch 9, batch 29800, loss[loss=0.2537, simple_loss=0.3186, pruned_loss=0.09442, over 21303.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.317, pruned_loss=0.08142, over 4282093.17 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:13:22,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1642722.0, ans=0.125 2023-06-24 03:13:23,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1642722.0, ans=0.125 2023-06-24 03:13:29,663 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:13:51,071 INFO [train.py:996] (1/4) Epoch 9, batch 29850, loss[loss=0.2249, simple_loss=0.2989, pruned_loss=0.07551, over 21906.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3143, pruned_loss=0.08031, over 4276236.94 frames. ], batch size: 124, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:13:57,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.819e+02 7.548e+02 1.159e+03 1.635e+03 3.345e+03, threshold=2.317e+03, percent-clipped=36.0 2023-06-24 03:13:57,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642842.0, ans=0.1 2023-06-24 03:15:29,150 INFO [train.py:996] (1/4) Epoch 9, batch 29900, loss[loss=0.2402, simple_loss=0.3065, pruned_loss=0.08693, over 21381.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3119, pruned_loss=0.08108, over 4281077.98 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:15:57,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1643202.0, ans=0.1 2023-06-24 03:16:22,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1643262.0, ans=0.125 2023-06-24 03:16:28,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1643262.0, ans=0.125 2023-06-24 03:17:08,333 INFO [train.py:996] (1/4) Epoch 9, batch 29950, loss[loss=0.2525, simple_loss=0.3163, pruned_loss=0.09432, over 21631.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3163, pruned_loss=0.08512, over 4279444.34 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:17:19,368 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.107e+02 5.748e+02 7.806e+02 1.232e+03 2.482e+03, threshold=1.561e+03, percent-clipped=2.0 2023-06-24 03:17:40,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1643502.0, ans=0.2 2023-06-24 03:17:42,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1643502.0, ans=0.0 2023-06-24 03:17:52,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-24 03:18:04,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1643562.0, ans=0.125 2023-06-24 03:18:08,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1643562.0, ans=6.0 2023-06-24 03:18:09,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1643562.0, ans=0.125 2023-06-24 03:18:22,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1643622.0, ans=0.5 2023-06-24 03:18:30,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1643622.0, ans=0.125 2023-06-24 03:19:00,251 INFO [train.py:996] (1/4) Epoch 9, batch 30000, loss[loss=0.21, simple_loss=0.3051, pruned_loss=0.05746, over 21725.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3185, pruned_loss=0.08491, over 4275215.53 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:19:00,251 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 03:19:17,056 INFO [train.py:1028] (1/4) Epoch 9, validation: loss=0.2502, simple_loss=0.3471, pruned_loss=0.07663, over 1796401.00 frames. 2023-06-24 03:19:17,057 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 03:19:35,696 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:20:08,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1643862.0, ans=0.125 2023-06-24 03:20:33,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1643922.0, ans=0.07 2023-06-24 03:20:40,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1643922.0, ans=0.1 2023-06-24 03:21:04,967 INFO [train.py:996] (1/4) Epoch 9, batch 30050, loss[loss=0.3113, simple_loss=0.4155, pruned_loss=0.1035, over 21498.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3239, pruned_loss=0.08272, over 4273968.08 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:21:11,489 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.143e+02 7.535e+02 1.024e+03 1.337e+03 2.624e+03, threshold=2.049e+03, percent-clipped=15.0 2023-06-24 03:22:44,519 INFO [train.py:996] (1/4) Epoch 9, batch 30100, loss[loss=0.2509, simple_loss=0.3046, pruned_loss=0.09855, over 21787.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3199, pruned_loss=0.08215, over 4267304.03 frames. ], batch size: 317, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:22:51,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.56 vs. limit=10.0 2023-06-24 03:23:16,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1644402.0, ans=0.5 2023-06-24 03:23:55,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1644522.0, ans=0.1 2023-06-24 03:24:12,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-24 03:24:19,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1644582.0, ans=0.0 2023-06-24 03:24:29,662 INFO [train.py:996] (1/4) Epoch 9, batch 30150, loss[loss=0.2584, simple_loss=0.3251, pruned_loss=0.09584, over 21697.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3175, pruned_loss=0.08417, over 4250792.39 frames. ], batch size: 298, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:24:38,067 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 6.683e+02 1.059e+03 1.463e+03 4.541e+03, threshold=2.119e+03, percent-clipped=12.0 2023-06-24 03:24:50,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1644702.0, ans=0.125 2023-06-24 03:25:21,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1644762.0, ans=0.125 2023-06-24 03:25:24,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1644762.0, ans=0.0 2023-06-24 03:26:12,276 INFO [train.py:996] (1/4) Epoch 9, batch 30200, loss[loss=0.2153, simple_loss=0.3238, pruned_loss=0.05334, over 21662.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.32, pruned_loss=0.08255, over 4259932.78 frames. ], batch size: 389, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:26:16,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1644942.0, ans=0.0 2023-06-24 03:27:06,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1645062.0, ans=0.125 2023-06-24 03:27:32,935 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:27:37,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1645182.0, ans=0.125 2023-06-24 03:27:40,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1645182.0, ans=0.1 2023-06-24 03:27:57,827 INFO [train.py:996] (1/4) Epoch 9, batch 30250, loss[loss=0.2302, simple_loss=0.306, pruned_loss=0.07722, over 19983.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3256, pruned_loss=0.08392, over 4260207.53 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:27:58,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645242.0, ans=0.1 2023-06-24 03:28:02,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1645242.0, ans=0.0 2023-06-24 03:28:05,422 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.017e+02 5.727e+02 7.436e+02 1.048e+03 2.592e+03, threshold=1.487e+03, percent-clipped=2.0 2023-06-24 03:29:22,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-24 03:29:23,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-24 03:29:36,912 INFO [train.py:996] (1/4) Epoch 9, batch 30300, loss[loss=0.2006, simple_loss=0.2726, pruned_loss=0.06434, over 20123.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3227, pruned_loss=0.08396, over 4262340.53 frames. ], batch size: 704, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:31:05,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1645782.0, ans=0.125 2023-06-24 03:31:28,067 INFO [train.py:996] (1/4) Epoch 9, batch 30350, loss[loss=0.3104, simple_loss=0.39, pruned_loss=0.1154, over 21537.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3209, pruned_loss=0.08408, over 4264837.03 frames. ], batch size: 473, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:31:36,357 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.062e+02 6.729e+02 9.654e+02 1.457e+03 3.930e+03, threshold=1.931e+03, percent-clipped=23.0 2023-06-24 03:31:52,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-24 03:32:47,133 INFO [train.py:996] (1/4) Epoch 9, batch 30400, loss[loss=0.249, simple_loss=0.3002, pruned_loss=0.09889, over 20295.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3149, pruned_loss=0.08265, over 4257004.96 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 32.0 2023-06-24 03:33:19,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1646262.0, ans=0.125 2023-06-24 03:33:27,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1646262.0, ans=0.0 2023-06-24 03:33:52,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=15.0 2023-06-24 03:33:56,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-24 03:34:12,615 INFO [train.py:996] (1/4) Epoch 9, batch 30450, loss[loss=0.2722, simple_loss=0.3822, pruned_loss=0.08115, over 19791.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3168, pruned_loss=0.08212, over 4198872.32 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:34:21,428 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 7.756e+02 1.127e+03 2.078e+03 9.482e+03, threshold=2.254e+03, percent-clipped=27.0 2023-06-24 03:34:39,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1646502.0, ans=0.0 2023-06-24 03:34:42,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1646502.0, ans=0.125 2023-06-24 03:34:51,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1646562.0, ans=0.0 2023-06-24 03:34:52,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-24 03:34:56,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1646562.0, ans=0.125 2023-06-24 03:35:04,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.45 vs. limit=22.5 2023-06-24 03:36:54,762 INFO [train.py:996] (1/4) Epoch 10, batch 0, loss[loss=0.2229, simple_loss=0.2832, pruned_loss=0.08133, over 21344.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2832, pruned_loss=0.08133, over 21344.00 frames. ], batch size: 177, lr: 3.02e-03, grad_scale: 32.0 2023-06-24 03:36:54,763 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 03:37:02,957 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7977, 3.2075, 3.1440, 4.2579], device='cuda:1') 2023-06-24 03:37:08,268 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3010, 4.6593, 2.6645, 2.0988], device='cuda:1') 2023-06-24 03:37:10,563 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2396, simple_loss=0.3488, pruned_loss=0.06521, over 1796401.00 frames. 2023-06-24 03:37:10,563 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 03:37:22,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1646712.0, ans=0.0 2023-06-24 03:37:49,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1646772.0, ans=0.04949747468305833 2023-06-24 03:38:29,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1646892.0, ans=0.125 2023-06-24 03:38:49,151 INFO [train.py:996] (1/4) Epoch 10, batch 50, loss[loss=0.2946, simple_loss=0.3655, pruned_loss=0.1119, over 21844.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3274, pruned_loss=0.08475, over 946845.01 frames. ], batch size: 118, lr: 3.02e-03, grad_scale: 16.0 2023-06-24 03:38:49,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1647012.0, ans=0.07 2023-06-24 03:39:11,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1647012.0, ans=0.0 2023-06-24 03:39:18,469 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.516e+02 8.479e+02 1.547e+03 2.623e+03 5.891e+03, threshold=3.095e+03, percent-clipped=28.0 2023-06-24 03:39:19,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1647072.0, ans=0.2 2023-06-24 03:39:42,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1647132.0, ans=0.125 2023-06-24 03:40:29,213 INFO [train.py:996] (1/4) Epoch 10, batch 100, loss[loss=0.3611, simple_loss=0.4127, pruned_loss=0.1547, over 21347.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3387, pruned_loss=0.08708, over 1677237.00 frames. ], batch size: 507, lr: 3.02e-03, grad_scale: 16.0 2023-06-24 03:41:36,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1647492.0, ans=0.0 2023-06-24 03:42:05,199 INFO [train.py:996] (1/4) Epoch 10, batch 150, loss[loss=0.2396, simple_loss=0.3385, pruned_loss=0.07033, over 21739.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3395, pruned_loss=0.08575, over 2252833.25 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:42:15,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1647612.0, ans=0.125 2023-06-24 03:42:39,496 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.031e+02 6.057e+02 8.805e+02 1.461e+03 2.839e+03, threshold=1.761e+03, percent-clipped=0.0 2023-06-24 03:43:46,544 INFO [train.py:996] (1/4) Epoch 10, batch 200, loss[loss=0.256, simple_loss=0.335, pruned_loss=0.08848, over 21412.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3367, pruned_loss=0.08302, over 2706220.41 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:43:50,763 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-24 03:43:54,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1647912.0, ans=0.1 2023-06-24 03:44:11,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-24 03:44:24,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1647972.0, ans=0.1 2023-06-24 03:44:24,783 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-24 03:44:28,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1648032.0, ans=0.0 2023-06-24 03:44:58,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1648092.0, ans=0.2 2023-06-24 03:45:18,480 INFO [train.py:996] (1/4) Epoch 10, batch 250, loss[loss=0.2363, simple_loss=0.3108, pruned_loss=0.0809, over 21767.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3296, pruned_loss=0.08233, over 3054562.90 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:45:48,953 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.127e+02 6.491e+02 8.586e+02 1.362e+03 2.608e+03, threshold=1.717e+03, percent-clipped=13.0 2023-06-24 03:46:07,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1648332.0, ans=0.125 2023-06-24 03:46:41,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1648452.0, ans=0.125 2023-06-24 03:46:58,118 INFO [train.py:996] (1/4) Epoch 10, batch 300, loss[loss=0.2525, simple_loss=0.3356, pruned_loss=0.08471, over 21829.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3263, pruned_loss=0.08331, over 3327125.90 frames. ], batch size: 332, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:47:03,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1648512.0, ans=0.0 2023-06-24 03:47:05,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1648512.0, ans=0.04949747468305833 2023-06-24 03:47:08,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=1648512.0, ans=12.0 2023-06-24 03:47:43,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=12.0 2023-06-24 03:48:34,872 INFO [train.py:996] (1/4) Epoch 10, batch 350, loss[loss=0.2269, simple_loss=0.2938, pruned_loss=0.07997, over 21491.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3192, pruned_loss=0.08313, over 3529990.31 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:48:49,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-24 03:49:03,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1648872.0, ans=0.2 2023-06-24 03:49:06,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.104e+02 6.936e+02 9.570e+02 1.355e+03 2.301e+03, threshold=1.914e+03, percent-clipped=7.0 2023-06-24 03:49:11,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.69 vs. limit=6.0 2023-06-24 03:50:04,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1649052.0, ans=0.125 2023-06-24 03:50:14,773 INFO [train.py:996] (1/4) Epoch 10, batch 400, loss[loss=0.2049, simple_loss=0.294, pruned_loss=0.05793, over 21208.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3141, pruned_loss=0.08094, over 3695301.51 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:50:42,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1649172.0, ans=0.2 2023-06-24 03:51:17,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1649292.0, ans=0.0 2023-06-24 03:51:17,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1649292.0, ans=0.2 2023-06-24 03:51:26,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-24 03:51:57,944 INFO [train.py:996] (1/4) Epoch 10, batch 450, loss[loss=0.2241, simple_loss=0.2862, pruned_loss=0.08098, over 21578.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3107, pruned_loss=0.08035, over 3826303.29 frames. ], batch size: 391, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:52:22,600 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 6.467e+02 1.066e+03 1.544e+03 3.388e+03, threshold=2.132e+03, percent-clipped=13.0 2023-06-24 03:52:45,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1649532.0, ans=0.0 2023-06-24 03:52:53,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1649592.0, ans=0.125 2023-06-24 03:52:54,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1649592.0, ans=0.0 2023-06-24 03:52:58,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-24 03:53:19,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-24 03:53:29,659 INFO [train.py:996] (1/4) Epoch 10, batch 500, loss[loss=0.2674, simple_loss=0.3769, pruned_loss=0.07895, over 21666.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3108, pruned_loss=0.07966, over 3925824.17 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:53:47,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1649712.0, ans=0.2 2023-06-24 03:54:51,545 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:55:04,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1649952.0, ans=0.125 2023-06-24 03:55:07,334 INFO [train.py:996] (1/4) Epoch 10, batch 550, loss[loss=0.2076, simple_loss=0.2962, pruned_loss=0.05947, over 21738.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3113, pruned_loss=0.07817, over 4004361.87 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:55:17,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1650012.0, ans=0.0 2023-06-24 03:55:26,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1650072.0, ans=0.125 2023-06-24 03:55:28,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1650072.0, ans=0.125 2023-06-24 03:55:28,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1650072.0, ans=0.125 2023-06-24 03:55:32,664 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.909e+02 8.917e+02 1.249e+03 2.003e+03 3.580e+03, threshold=2.497e+03, percent-clipped=21.0 2023-06-24 03:55:37,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1650072.0, ans=0.125 2023-06-24 03:55:47,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-24 03:56:00,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1650132.0, ans=12.0 2023-06-24 03:56:40,152 INFO [train.py:996] (1/4) Epoch 10, batch 600, loss[loss=0.2273, simple_loss=0.294, pruned_loss=0.0803, over 21858.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3138, pruned_loss=0.07739, over 4069666.22 frames. ], batch size: 107, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:56:51,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1650312.0, ans=0.1 2023-06-24 03:57:23,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-24 03:57:27,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-24 03:58:04,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1650552.0, ans=0.125 2023-06-24 03:58:13,513 INFO [train.py:996] (1/4) Epoch 10, batch 650, loss[loss=0.2218, simple_loss=0.3192, pruned_loss=0.06221, over 21644.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3163, pruned_loss=0.07699, over 4117450.40 frames. ], batch size: 263, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:58:14,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1650612.0, ans=0.125 2023-06-24 03:58:21,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1650612.0, ans=0.125 2023-06-24 03:58:44,453 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.130e+02 7.027e+02 1.084e+03 1.748e+03 3.374e+03, threshold=2.167e+03, percent-clipped=5.0 2023-06-24 03:59:41,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1650852.0, ans=0.1 2023-06-24 03:59:44,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1650912.0, ans=0.1 2023-06-24 03:59:45,377 INFO [train.py:996] (1/4) Epoch 10, batch 700, loss[loss=0.2816, simple_loss=0.3839, pruned_loss=0.08963, over 21541.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3201, pruned_loss=0.07839, over 4158713.53 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:59:45,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1650912.0, ans=0.125 2023-06-24 04:00:18,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1650972.0, ans=0.125 2023-06-24 04:00:29,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=1651032.0, ans=12.0 2023-06-24 04:01:08,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1651152.0, ans=0.0 2023-06-24 04:01:15,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1651152.0, ans=0.2 2023-06-24 04:01:27,488 INFO [train.py:996] (1/4) Epoch 10, batch 750, loss[loss=0.221, simple_loss=0.288, pruned_loss=0.07702, over 21825.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3192, pruned_loss=0.07903, over 4183781.38 frames. ], batch size: 282, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 04:01:28,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1651212.0, ans=0.0 2023-06-24 04:01:48,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1651272.0, ans=0.125 2023-06-24 04:01:53,759 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.183e+02 6.075e+02 9.989e+02 1.388e+03 3.247e+03, threshold=1.998e+03, percent-clipped=7.0 2023-06-24 04:01:54,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1651272.0, ans=0.1 2023-06-24 04:02:09,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1651332.0, ans=0.1 2023-06-24 04:02:27,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1651392.0, ans=0.035 2023-06-24 04:02:29,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1651392.0, ans=0.125 2023-06-24 04:02:40,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1651452.0, ans=0.0 2023-06-24 04:02:43,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1651452.0, ans=0.0 2023-06-24 04:02:48,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651452.0, ans=0.1 2023-06-24 04:02:50,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1651452.0, ans=15.0 2023-06-24 04:03:01,151 INFO [train.py:996] (1/4) Epoch 10, batch 800, loss[loss=0.2337, simple_loss=0.3028, pruned_loss=0.08228, over 21712.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3148, pruned_loss=0.07905, over 4193901.25 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:04:39,135 INFO [train.py:996] (1/4) Epoch 10, batch 850, loss[loss=0.2268, simple_loss=0.3027, pruned_loss=0.07549, over 21871.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3097, pruned_loss=0.07916, over 4220967.37 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:05:10,510 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.435e+02 1.007e+03 1.415e+03 2.798e+03, threshold=2.014e+03, percent-clipped=8.0 2023-06-24 04:05:19,562 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-24 04:05:27,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1651932.0, ans=0.125 2023-06-24 04:05:30,638 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-24 04:05:30,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-24 04:05:36,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1651932.0, ans=0.125 2023-06-24 04:05:40,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1651992.0, ans=0.0 2023-06-24 04:05:55,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1652052.0, ans=0.1 2023-06-24 04:06:12,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1652052.0, ans=0.0 2023-06-24 04:06:15,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1652112.0, ans=0.1 2023-06-24 04:06:16,334 INFO [train.py:996] (1/4) Epoch 10, batch 900, loss[loss=0.2013, simple_loss=0.2936, pruned_loss=0.05453, over 21840.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.306, pruned_loss=0.07815, over 4242938.56 frames. ], batch size: 316, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:06:37,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1652112.0, ans=0.125 2023-06-24 04:06:43,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1652172.0, ans=0.0 2023-06-24 04:07:02,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1652232.0, ans=0.125 2023-06-24 04:07:03,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-24 04:07:26,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1652292.0, ans=0.125 2023-06-24 04:07:50,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1652352.0, ans=0.125 2023-06-24 04:08:04,389 INFO [train.py:996] (1/4) Epoch 10, batch 950, loss[loss=0.2047, simple_loss=0.266, pruned_loss=0.07172, over 21577.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3033, pruned_loss=0.07853, over 4257542.16 frames. ], batch size: 263, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:08:04,980 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1652412.0, ans=0.0 2023-06-24 04:08:14,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1652412.0, ans=0.0 2023-06-24 04:08:27,076 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.827e+02 5.619e+02 7.658e+02 1.220e+03 3.060e+03, threshold=1.532e+03, percent-clipped=1.0 2023-06-24 04:08:34,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1652472.0, ans=0.02 2023-06-24 04:08:36,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1652472.0, ans=0.125 2023-06-24 04:08:48,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1652532.0, ans=0.125 2023-06-24 04:09:36,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-24 04:09:37,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1652652.0, ans=0.125 2023-06-24 04:09:39,809 INFO [train.py:996] (1/4) Epoch 10, batch 1000, loss[loss=0.2132, simple_loss=0.2818, pruned_loss=0.07234, over 21428.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3034, pruned_loss=0.07794, over 4265082.99 frames. ], batch size: 389, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:09:45,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1652712.0, ans=0.125 2023-06-24 04:10:47,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1652892.0, ans=0.125 2023-06-24 04:11:10,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1652952.0, ans=0.025 2023-06-24 04:11:19,962 INFO [train.py:996] (1/4) Epoch 10, batch 1050, loss[loss=0.1837, simple_loss=0.2755, pruned_loss=0.04596, over 21779.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3036, pruned_loss=0.07803, over 4266966.86 frames. ], batch size: 282, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:11:36,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1653012.0, ans=0.125 2023-06-24 04:11:46,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 8.659e+02 1.096e+03 1.679e+03 3.356e+03, threshold=2.191e+03, percent-clipped=32.0 2023-06-24 04:11:54,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-24 04:12:05,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1653132.0, ans=0.125 2023-06-24 04:12:53,900 INFO [train.py:996] (1/4) Epoch 10, batch 1100, loss[loss=0.2294, simple_loss=0.2931, pruned_loss=0.08287, over 21278.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3035, pruned_loss=0.07762, over 4275339.09 frames. ], batch size: 159, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:13:57,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-24 04:14:32,921 INFO [train.py:996] (1/4) Epoch 10, batch 1150, loss[loss=0.2367, simple_loss=0.3092, pruned_loss=0.08211, over 21686.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3045, pruned_loss=0.0779, over 4279869.58 frames. ], batch size: 230, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:14:43,981 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-06-24 04:14:52,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1653672.0, ans=0.1 2023-06-24 04:15:00,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.939e+02 5.998e+02 8.488e+02 1.315e+03 2.677e+03, threshold=1.698e+03, percent-clipped=3.0 2023-06-24 04:15:00,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1653672.0, ans=0.2 2023-06-24 04:15:18,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1653732.0, ans=0.125 2023-06-24 04:15:28,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1653732.0, ans=0.125 2023-06-24 04:15:46,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1653792.0, ans=0.2 2023-06-24 04:15:59,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1653852.0, ans=0.1 2023-06-24 04:16:07,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1653852.0, ans=0.0 2023-06-24 04:16:17,794 INFO [train.py:996] (1/4) Epoch 10, batch 1200, loss[loss=0.2754, simple_loss=0.3489, pruned_loss=0.1009, over 21804.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3049, pruned_loss=0.078, over 4270629.91 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:17:02,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1654032.0, ans=0.125 2023-06-24 04:17:30,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1654092.0, ans=0.0 2023-06-24 04:17:32,775 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-24 04:17:38,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1654152.0, ans=0.1 2023-06-24 04:17:56,456 INFO [train.py:996] (1/4) Epoch 10, batch 1250, loss[loss=0.2691, simple_loss=0.3483, pruned_loss=0.09494, over 21472.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3094, pruned_loss=0.07932, over 4272125.69 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:18:19,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.265e+02 6.637e+02 9.533e+02 1.248e+03 2.697e+03, threshold=1.907e+03, percent-clipped=13.0 2023-06-24 04:18:19,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1654272.0, ans=0.1 2023-06-24 04:18:28,935 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:18:39,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.44 vs. limit=10.0 2023-06-24 04:19:35,641 INFO [train.py:996] (1/4) Epoch 10, batch 1300, loss[loss=0.2441, simple_loss=0.3206, pruned_loss=0.08378, over 21822.00 frames. ], tot_loss[loss=0.235, simple_loss=0.311, pruned_loss=0.0795, over 4271320.83 frames. ], batch size: 282, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:19:49,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1654572.0, ans=0.125 2023-06-24 04:20:01,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1654572.0, ans=0.125 2023-06-24 04:20:01,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-24 04:21:08,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1654752.0, ans=0.1 2023-06-24 04:21:08,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1654752.0, ans=0.05 2023-06-24 04:21:14,474 INFO [train.py:996] (1/4) Epoch 10, batch 1350, loss[loss=0.2589, simple_loss=0.3234, pruned_loss=0.0972, over 21436.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3116, pruned_loss=0.08026, over 4278461.43 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:21:20,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-24 04:21:34,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.66 vs. limit=22.5 2023-06-24 04:21:42,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.99 vs. limit=10.0 2023-06-24 04:21:43,007 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.930e+02 9.209e+02 1.385e+03 4.036e+03, threshold=1.842e+03, percent-clipped=12.0 2023-06-24 04:21:43,434 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1654872.0, ans=0.2 2023-06-24 04:21:43,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1654872.0, ans=0.125 2023-06-24 04:22:26,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1655052.0, ans=0.125 2023-06-24 04:22:49,040 INFO [train.py:996] (1/4) Epoch 10, batch 1400, loss[loss=0.2099, simple_loss=0.2893, pruned_loss=0.06527, over 21792.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3109, pruned_loss=0.08043, over 4282944.03 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:22:51,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-24 04:22:55,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1655112.0, ans=0.125 2023-06-24 04:23:29,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1655232.0, ans=0.125 2023-06-24 04:24:01,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-24 04:24:02,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1655292.0, ans=0.2 2023-06-24 04:24:25,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1655352.0, ans=0.2 2023-06-24 04:24:28,489 INFO [train.py:996] (1/4) Epoch 10, batch 1450, loss[loss=0.2359, simple_loss=0.3126, pruned_loss=0.07965, over 21378.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3128, pruned_loss=0.08153, over 4285969.65 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:24:31,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1655412.0, ans=0.1 2023-06-24 04:24:42,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.90 vs. limit=10.0 2023-06-24 04:24:56,580 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.216e+02 6.383e+02 1.021e+03 1.504e+03 2.934e+03, threshold=2.041e+03, percent-clipped=11.0 2023-06-24 04:25:11,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1655532.0, ans=0.0 2023-06-24 04:25:33,098 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=15.0 2023-06-24 04:25:58,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1655652.0, ans=0.1 2023-06-24 04:26:00,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1655652.0, ans=0.0 2023-06-24 04:26:07,465 INFO [train.py:996] (1/4) Epoch 10, batch 1500, loss[loss=0.2108, simple_loss=0.2931, pruned_loss=0.0642, over 21097.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3121, pruned_loss=0.08143, over 4287714.22 frames. ], batch size: 607, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:26:16,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1655712.0, ans=0.09899494936611666 2023-06-24 04:26:23,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1655772.0, ans=0.125 2023-06-24 04:26:55,066 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-24 04:27:04,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1655832.0, ans=0.125 2023-06-24 04:27:07,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1655832.0, ans=0.0 2023-06-24 04:27:25,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1655892.0, ans=0.0 2023-06-24 04:27:50,601 INFO [train.py:996] (1/4) Epoch 10, batch 1550, loss[loss=0.199, simple_loss=0.2725, pruned_loss=0.06279, over 21440.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3102, pruned_loss=0.0811, over 4278168.33 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:28:24,606 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 5.706e+02 8.802e+02 1.256e+03 2.211e+03, threshold=1.760e+03, percent-clipped=1.0 2023-06-24 04:28:39,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656132.0, ans=0.1 2023-06-24 04:28:56,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1656132.0, ans=0.0 2023-06-24 04:29:19,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.67 vs. limit=15.0 2023-06-24 04:29:29,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1656312.0, ans=0.125 2023-06-24 04:29:35,792 INFO [train.py:996] (1/4) Epoch 10, batch 1600, loss[loss=0.2346, simple_loss=0.3003, pruned_loss=0.08447, over 21576.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3112, pruned_loss=0.08218, over 4275271.98 frames. ], batch size: 212, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:30:13,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1656372.0, ans=0.0 2023-06-24 04:30:16,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.10 vs. limit=22.5 2023-06-24 04:30:46,784 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-24 04:31:09,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1656552.0, ans=0.05 2023-06-24 04:31:15,797 INFO [train.py:996] (1/4) Epoch 10, batch 1650, loss[loss=0.2421, simple_loss=0.3002, pruned_loss=0.09201, over 21351.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3102, pruned_loss=0.08179, over 4277766.33 frames. ], batch size: 176, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:31:44,377 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.141e+02 9.207e+02 1.280e+03 2.509e+03, threshold=1.841e+03, percent-clipped=8.0 2023-06-24 04:32:57,174 INFO [train.py:996] (1/4) Epoch 10, batch 1700, loss[loss=0.2084, simple_loss=0.313, pruned_loss=0.05188, over 19896.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3098, pruned_loss=0.08058, over 4272479.28 frames. ], batch size: 702, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:33:04,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1656912.0, ans=0.1 2023-06-24 04:33:49,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1657032.0, ans=0.05 2023-06-24 04:34:03,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1657092.0, ans=0.0 2023-06-24 04:34:45,130 INFO [train.py:996] (1/4) Epoch 10, batch 1750, loss[loss=0.2509, simple_loss=0.3252, pruned_loss=0.08825, over 21432.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3113, pruned_loss=0.08074, over 4263952.85 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:34:49,544 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-24 04:35:00,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1657212.0, ans=0.125 2023-06-24 04:35:21,552 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.068e+02 6.335e+02 9.144e+02 1.525e+03 4.256e+03, threshold=1.829e+03, percent-clipped=17.0 2023-06-24 04:35:56,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1657392.0, ans=0.0 2023-06-24 04:36:01,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1657392.0, ans=0.0 2023-06-24 04:36:17,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1657452.0, ans=0.2 2023-06-24 04:36:32,789 INFO [train.py:996] (1/4) Epoch 10, batch 1800, loss[loss=0.2324, simple_loss=0.3014, pruned_loss=0.08169, over 21338.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3114, pruned_loss=0.07961, over 4262233.83 frames. ], batch size: 176, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:36:36,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1657512.0, ans=0.125 2023-06-24 04:36:41,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1657512.0, ans=0.0 2023-06-24 04:37:14,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1657632.0, ans=0.125 2023-06-24 04:37:58,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1657752.0, ans=0.2 2023-06-24 04:38:13,217 INFO [train.py:996] (1/4) Epoch 10, batch 1850, loss[loss=0.2024, simple_loss=0.2636, pruned_loss=0.07057, over 16469.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3127, pruned_loss=0.07791, over 4264861.04 frames. ], batch size: 64, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:38:43,373 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.154e+02 6.426e+02 1.042e+03 1.664e+03 4.444e+03, threshold=2.085e+03, percent-clipped=25.0 2023-06-24 04:38:48,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1657932.0, ans=0.1 2023-06-24 04:38:50,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1657932.0, ans=0.1 2023-06-24 04:39:06,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1657932.0, ans=0.0 2023-06-24 04:39:10,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1657992.0, ans=0.125 2023-06-24 04:39:14,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1657992.0, ans=0.0 2023-06-24 04:39:41,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1658052.0, ans=0.125 2023-06-24 04:39:52,061 INFO [train.py:996] (1/4) Epoch 10, batch 1900, loss[loss=0.2161, simple_loss=0.282, pruned_loss=0.07512, over 21666.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3131, pruned_loss=0.07759, over 4274141.80 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:39:54,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1658112.0, ans=0.125 2023-06-24 04:40:28,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1658232.0, ans=0.125 2023-06-24 04:41:05,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-24 04:41:17,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-24 04:41:31,894 INFO [train.py:996] (1/4) Epoch 10, batch 1950, loss[loss=0.1885, simple_loss=0.2743, pruned_loss=0.05135, over 21085.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3091, pruned_loss=0.07778, over 4261630.57 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:41:46,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1658412.0, ans=0.0 2023-06-24 04:42:02,928 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.830e+02 7.074e+02 9.115e+02 1.415e+03 2.823e+03, threshold=1.823e+03, percent-clipped=5.0 2023-06-24 04:43:12,643 INFO [train.py:996] (1/4) Epoch 10, batch 2000, loss[loss=0.2226, simple_loss=0.2827, pruned_loss=0.08128, over 21562.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3026, pruned_loss=0.07539, over 4262980.55 frames. ], batch size: 414, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:43:32,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1658772.0, ans=0.0 2023-06-24 04:43:50,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1658772.0, ans=0.035 2023-06-24 04:43:59,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1658832.0, ans=0.0 2023-06-24 04:44:26,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1658892.0, ans=0.0 2023-06-24 04:44:57,328 INFO [train.py:996] (1/4) Epoch 10, batch 2050, loss[loss=0.2534, simple_loss=0.3236, pruned_loss=0.09158, over 21852.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3054, pruned_loss=0.0754, over 4272996.51 frames. ], batch size: 332, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:44:59,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1659012.0, ans=0.125 2023-06-24 04:45:16,403 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-24 04:45:28,494 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.378e+02 7.380e+02 1.174e+03 1.683e+03 3.998e+03, threshold=2.349e+03, percent-clipped=22.0 2023-06-24 04:45:33,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1659132.0, ans=0.125 2023-06-24 04:45:44,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1659132.0, ans=0.0 2023-06-24 04:45:47,847 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:45:54,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659192.0, ans=0.1 2023-06-24 04:46:14,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1659192.0, ans=0.0 2023-06-24 04:46:30,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1659252.0, ans=0.2 2023-06-24 04:46:37,800 INFO [train.py:996] (1/4) Epoch 10, batch 2100, loss[loss=0.2651, simple_loss=0.3292, pruned_loss=0.1004, over 21795.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3092, pruned_loss=0.07618, over 4270636.27 frames. ], batch size: 371, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:46:39,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1659312.0, ans=0.0 2023-06-24 04:47:32,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1659432.0, ans=0.0 2023-06-24 04:47:48,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1659492.0, ans=0.125 2023-06-24 04:48:17,974 INFO [train.py:996] (1/4) Epoch 10, batch 2150, loss[loss=0.2802, simple_loss=0.3344, pruned_loss=0.113, over 21572.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3096, pruned_loss=0.07749, over 4250834.59 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:48:31,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1659612.0, ans=0.0 2023-06-24 04:48:48,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.248e+02 6.485e+02 1.170e+03 1.690e+03 3.411e+03, threshold=2.340e+03, percent-clipped=8.0 2023-06-24 04:48:50,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1659672.0, ans=0.125 2023-06-24 04:48:56,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1659732.0, ans=0.1 2023-06-24 04:49:11,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-06-24 04:49:17,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1659792.0, ans=0.0 2023-06-24 04:49:17,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1659792.0, ans=0.0 2023-06-24 04:49:58,098 INFO [train.py:996] (1/4) Epoch 10, batch 2200, loss[loss=0.1721, simple_loss=0.2344, pruned_loss=0.05486, over 15912.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3126, pruned_loss=0.07873, over 4256144.02 frames. ], batch size: 60, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:50:09,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1659912.0, ans=0.125 2023-06-24 04:50:46,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1660032.0, ans=0.0 2023-06-24 04:51:23,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1660152.0, ans=0.2 2023-06-24 04:51:31,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1660152.0, ans=0.125 2023-06-24 04:51:37,264 INFO [train.py:996] (1/4) Epoch 10, batch 2250, loss[loss=0.2418, simple_loss=0.3124, pruned_loss=0.08561, over 21532.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3076, pruned_loss=0.07747, over 4257653.47 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:52:08,792 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.133e+02 6.896e+02 1.012e+03 1.519e+03 4.116e+03, threshold=2.025e+03, percent-clipped=5.0 2023-06-24 04:52:09,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1660272.0, ans=0.0 2023-06-24 04:52:42,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1660392.0, ans=0.0 2023-06-24 04:52:48,832 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-24 04:52:49,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1660392.0, ans=0.1 2023-06-24 04:53:04,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1660452.0, ans=0.125 2023-06-24 04:53:15,548 INFO [train.py:996] (1/4) Epoch 10, batch 2300, loss[loss=0.2081, simple_loss=0.2807, pruned_loss=0.06779, over 22037.00 frames. ], tot_loss[loss=0.229, simple_loss=0.304, pruned_loss=0.07701, over 4267374.78 frames. ], batch size: 119, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:54:26,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1660692.0, ans=0.0 2023-06-24 04:54:55,374 INFO [train.py:996] (1/4) Epoch 10, batch 2350, loss[loss=0.27, simple_loss=0.3316, pruned_loss=0.1042, over 21683.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3028, pruned_loss=0.07824, over 4267053.00 frames. ], batch size: 332, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:54:55,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1660812.0, ans=0.1 2023-06-24 04:54:59,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1660812.0, ans=0.0 2023-06-24 04:55:13,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1660812.0, ans=0.125 2023-06-24 04:55:18,746 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:55:32,516 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.196e+02 7.285e+02 1.033e+03 1.548e+03 3.497e+03, threshold=2.065e+03, percent-clipped=14.0 2023-06-24 04:55:32,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1660872.0, ans=0.125 2023-06-24 04:55:42,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1660932.0, ans=0.0 2023-06-24 04:56:34,563 INFO [train.py:996] (1/4) Epoch 10, batch 2400, loss[loss=0.2685, simple_loss=0.3453, pruned_loss=0.09583, over 21828.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3044, pruned_loss=0.08034, over 4260789.17 frames. ], batch size: 124, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:56:49,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1661112.0, ans=0.0 2023-06-24 04:57:12,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1661172.0, ans=0.2 2023-06-24 04:57:15,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1661232.0, ans=0.2 2023-06-24 04:57:20,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1661232.0, ans=0.1 2023-06-24 04:57:59,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1661352.0, ans=0.125 2023-06-24 04:58:19,002 INFO [train.py:996] (1/4) Epoch 10, batch 2450, loss[loss=0.2062, simple_loss=0.2695, pruned_loss=0.07148, over 21729.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3086, pruned_loss=0.08226, over 4262748.42 frames. ], batch size: 112, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:58:50,934 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.358e+02 7.393e+02 1.203e+03 1.868e+03 3.512e+03, threshold=2.405e+03, percent-clipped=21.0 2023-06-24 04:59:35,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1661592.0, ans=0.0 2023-06-24 04:59:58,066 INFO [train.py:996] (1/4) Epoch 10, batch 2500, loss[loss=0.2449, simple_loss=0.3428, pruned_loss=0.0735, over 21317.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3067, pruned_loss=0.08131, over 4261671.97 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:00:27,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-24 05:00:48,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1661832.0, ans=0.125 2023-06-24 05:00:54,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-24 05:01:13,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1661892.0, ans=0.1 2023-06-24 05:01:39,219 INFO [train.py:996] (1/4) Epoch 10, batch 2550, loss[loss=0.209, simple_loss=0.2747, pruned_loss=0.07168, over 21271.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3071, pruned_loss=0.08098, over 4249603.73 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:02:02,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1662072.0, ans=0.1 2023-06-24 05:02:04,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1662072.0, ans=0.125 2023-06-24 05:02:04,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1662072.0, ans=0.0 2023-06-24 05:02:11,859 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.129e+02 7.144e+02 9.794e+02 1.361e+03 2.807e+03, threshold=1.959e+03, percent-clipped=4.0 2023-06-24 05:02:30,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1662132.0, ans=22.5 2023-06-24 05:02:48,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1662192.0, ans=0.125 2023-06-24 05:02:59,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1662252.0, ans=10.0 2023-06-24 05:03:00,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1662252.0, ans=0.125 2023-06-24 05:03:01,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-24 05:03:17,554 INFO [train.py:996] (1/4) Epoch 10, batch 2600, loss[loss=0.2214, simple_loss=0.2912, pruned_loss=0.07581, over 21298.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3092, pruned_loss=0.08287, over 4264750.65 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:03:49,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1662372.0, ans=0.125 2023-06-24 05:03:51,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-24 05:04:16,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-24 05:04:32,449 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.23 vs. limit=22.5 2023-06-24 05:04:58,963 INFO [train.py:996] (1/4) Epoch 10, batch 2650, loss[loss=0.2377, simple_loss=0.3131, pruned_loss=0.08114, over 21342.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3093, pruned_loss=0.08325, over 4275158.33 frames. ], batch size: 549, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:05:20,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1662672.0, ans=0.2 2023-06-24 05:05:32,844 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.149e+02 7.187e+02 9.516e+02 1.311e+03 3.015e+03, threshold=1.903e+03, percent-clipped=11.0 2023-06-24 05:05:35,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-24 05:06:09,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1662792.0, ans=0.0 2023-06-24 05:06:19,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.84 vs. limit=22.5 2023-06-24 05:06:34,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1662852.0, ans=0.2 2023-06-24 05:06:40,829 INFO [train.py:996] (1/4) Epoch 10, batch 2700, loss[loss=0.225, simple_loss=0.2947, pruned_loss=0.07766, over 21444.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3098, pruned_loss=0.08373, over 4272012.24 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:06:42,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662912.0, ans=0.1 2023-06-24 05:06:56,953 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:07:00,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1662972.0, ans=0.0 2023-06-24 05:07:03,618 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.06 vs. limit=12.0 2023-06-24 05:07:39,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1663092.0, ans=0.0 2023-06-24 05:08:21,443 INFO [train.py:996] (1/4) Epoch 10, batch 2750, loss[loss=0.1981, simple_loss=0.2655, pruned_loss=0.06537, over 21558.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3085, pruned_loss=0.08277, over 4282113.04 frames. ], batch size: 212, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:08:54,978 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 8.097e+02 1.070e+03 1.539e+03 2.944e+03, threshold=2.139e+03, percent-clipped=11.0 2023-06-24 05:09:24,146 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-06-24 05:10:04,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1663452.0, ans=0.0 2023-06-24 05:10:07,022 INFO [train.py:996] (1/4) Epoch 10, batch 2800, loss[loss=0.3494, simple_loss=0.4212, pruned_loss=0.1388, over 21511.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3128, pruned_loss=0.08208, over 4270872.76 frames. ], batch size: 507, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 05:11:31,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.54 vs. limit=8.0 2023-06-24 05:11:47,590 INFO [train.py:996] (1/4) Epoch 10, batch 2850, loss[loss=0.1703, simple_loss=0.2241, pruned_loss=0.05822, over 21377.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3136, pruned_loss=0.08282, over 4268968.83 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:12:06,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-24 05:12:14,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.35 vs. limit=15.0 2023-06-24 05:12:27,740 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.081e+02 7.888e+02 1.288e+03 1.995e+03 6.558e+03, threshold=2.577e+03, percent-clipped=20.0 2023-06-24 05:12:29,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1663932.0, ans=0.125 2023-06-24 05:12:49,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1663992.0, ans=0.2 2023-06-24 05:12:50,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1663992.0, ans=0.1 2023-06-24 05:13:05,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1664052.0, ans=0.0 2023-06-24 05:13:27,342 INFO [train.py:996] (1/4) Epoch 10, batch 2900, loss[loss=0.2291, simple_loss=0.3011, pruned_loss=0.07854, over 21844.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3106, pruned_loss=0.08224, over 4263116.77 frames. ], batch size: 351, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:13:41,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1664172.0, ans=0.2 2023-06-24 05:14:15,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1664232.0, ans=0.0 2023-06-24 05:14:42,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1664292.0, ans=0.0 2023-06-24 05:14:50,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1664352.0, ans=0.0 2023-06-24 05:14:59,788 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:15:05,622 INFO [train.py:996] (1/4) Epoch 10, batch 2950, loss[loss=0.24, simple_loss=0.3333, pruned_loss=0.07335, over 21797.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3122, pruned_loss=0.0821, over 4275152.68 frames. ], batch size: 282, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:15:06,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=12.0 2023-06-24 05:15:37,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1664472.0, ans=0.1 2023-06-24 05:15:45,404 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.895e+02 6.745e+02 8.632e+02 1.337e+03 3.191e+03, threshold=1.726e+03, percent-clipped=2.0 2023-06-24 05:16:04,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.68 vs. limit=15.0 2023-06-24 05:16:44,582 INFO [train.py:996] (1/4) Epoch 10, batch 3000, loss[loss=0.2725, simple_loss=0.3485, pruned_loss=0.09829, over 21553.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3136, pruned_loss=0.08223, over 4273578.33 frames. ], batch size: 414, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:16:44,582 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 05:16:53,530 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.0361, 1.9661, 1.8276, 2.7445], device='cuda:1') 2023-06-24 05:17:00,550 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2505, simple_loss=0.3452, pruned_loss=0.07794, over 1796401.00 frames. 2023-06-24 05:17:00,551 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 05:17:12,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1664712.0, ans=0.0 2023-06-24 05:17:28,163 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.10 vs. limit=15.0 2023-06-24 05:17:30,544 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:17:35,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1664772.0, ans=0.125 2023-06-24 05:17:46,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1664832.0, ans=0.2 2023-06-24 05:18:02,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1664892.0, ans=0.2 2023-06-24 05:18:23,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1664952.0, ans=0.125 2023-06-24 05:18:31,278 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:18:45,524 INFO [train.py:996] (1/4) Epoch 10, batch 3050, loss[loss=0.2348, simple_loss=0.3248, pruned_loss=0.07243, over 21456.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3156, pruned_loss=0.0818, over 4278992.37 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:18:47,606 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:19:21,827 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 6.265e+02 9.515e+02 1.393e+03 2.651e+03, threshold=1.903e+03, percent-clipped=13.0 2023-06-24 05:19:27,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1665132.0, ans=0.5 2023-06-24 05:19:55,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1665192.0, ans=0.2 2023-06-24 05:20:06,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1665252.0, ans=0.125 2023-06-24 05:20:25,962 INFO [train.py:996] (1/4) Epoch 10, batch 3100, loss[loss=0.1823, simple_loss=0.2758, pruned_loss=0.04444, over 21585.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3156, pruned_loss=0.08061, over 4281960.72 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:20:47,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1665372.0, ans=0.0 2023-06-24 05:21:10,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-24 05:21:59,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1665552.0, ans=0.1 2023-06-24 05:22:10,439 INFO [train.py:996] (1/4) Epoch 10, batch 3150, loss[loss=0.2806, simple_loss=0.3475, pruned_loss=0.1068, over 21318.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3168, pruned_loss=0.08101, over 4284823.73 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 8.0 2023-06-24 05:22:19,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1665612.0, ans=0.07 2023-06-24 05:22:22,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1665612.0, ans=0.0 2023-06-24 05:22:55,171 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.369e+02 7.727e+02 1.146e+03 1.592e+03 4.239e+03, threshold=2.292e+03, percent-clipped=10.0 2023-06-24 05:23:46,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1665852.0, ans=0.0 2023-06-24 05:23:53,259 INFO [train.py:996] (1/4) Epoch 10, batch 3200, loss[loss=0.2147, simple_loss=0.2811, pruned_loss=0.07412, over 21187.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3178, pruned_loss=0.0815, over 4276817.26 frames. ], batch size: 608, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:24:00,549 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:24:03,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1665912.0, ans=0.125 2023-06-24 05:24:56,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1666092.0, ans=0.2 2023-06-24 05:25:25,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1666152.0, ans=0.125 2023-06-24 05:25:34,001 INFO [train.py:996] (1/4) Epoch 10, batch 3250, loss[loss=0.2592, simple_loss=0.3569, pruned_loss=0.08073, over 21493.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3206, pruned_loss=0.08281, over 4280611.75 frames. ], batch size: 471, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:25:57,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1666272.0, ans=0.125 2023-06-24 05:26:16,587 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.011e+02 5.714e+02 8.207e+02 1.474e+03 3.383e+03, threshold=1.641e+03, percent-clipped=8.0 2023-06-24 05:26:52,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1666392.0, ans=0.125 2023-06-24 05:26:54,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1666392.0, ans=0.0 2023-06-24 05:27:15,044 INFO [train.py:996] (1/4) Epoch 10, batch 3300, loss[loss=0.1983, simple_loss=0.267, pruned_loss=0.06479, over 21199.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3172, pruned_loss=0.08209, over 4272301.40 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:27:17,632 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-24 05:27:40,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1666572.0, ans=0.2 2023-06-24 05:28:32,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1666692.0, ans=0.125 2023-06-24 05:28:55,173 INFO [train.py:996] (1/4) Epoch 10, batch 3350, loss[loss=0.2738, simple_loss=0.343, pruned_loss=0.1023, over 21373.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3181, pruned_loss=0.08292, over 4271020.46 frames. ], batch size: 549, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:29:31,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1666872.0, ans=0.0 2023-06-24 05:29:42,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.528e+02 7.146e+02 1.056e+03 1.768e+03 3.632e+03, threshold=2.111e+03, percent-clipped=30.0 2023-06-24 05:29:50,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1666932.0, ans=0.0 2023-06-24 05:30:38,705 INFO [train.py:996] (1/4) Epoch 10, batch 3400, loss[loss=0.2576, simple_loss=0.3605, pruned_loss=0.07741, over 20724.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3196, pruned_loss=0.0838, over 4278691.12 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:30:41,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1667112.0, ans=0.125 2023-06-24 05:30:56,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1667112.0, ans=0.1 2023-06-24 05:32:18,967 INFO [train.py:996] (1/4) Epoch 10, batch 3450, loss[loss=0.2122, simple_loss=0.274, pruned_loss=0.07523, over 21453.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.314, pruned_loss=0.08319, over 4276865.94 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:32:35,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1667412.0, ans=0.0 2023-06-24 05:32:37,861 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.14 vs. limit=10.0 2023-06-24 05:32:56,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1667472.0, ans=0.125 2023-06-24 05:33:00,174 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 9.142e+02 1.242e+03 1.836e+03 3.790e+03, threshold=2.483e+03, percent-clipped=19.0 2023-06-24 05:33:23,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1667592.0, ans=0.1 2023-06-24 05:33:25,223 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-24 05:33:32,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1667592.0, ans=0.0 2023-06-24 05:34:02,182 INFO [train.py:996] (1/4) Epoch 10, batch 3500, loss[loss=0.2694, simple_loss=0.346, pruned_loss=0.09638, over 21827.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3225, pruned_loss=0.08714, over 4278780.87 frames. ], batch size: 124, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:34:07,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1667712.0, ans=0.125 2023-06-24 05:34:28,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1667772.0, ans=0.2 2023-06-24 05:34:34,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1667772.0, ans=0.2 2023-06-24 05:34:42,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1667832.0, ans=0.0 2023-06-24 05:34:50,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1667832.0, ans=0.1 2023-06-24 05:34:55,303 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1667832.0, ans=0.0 2023-06-24 05:35:16,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1667892.0, ans=0.0 2023-06-24 05:35:32,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1667952.0, ans=0.0 2023-06-24 05:35:41,153 INFO [train.py:996] (1/4) Epoch 10, batch 3550, loss[loss=0.2169, simple_loss=0.2791, pruned_loss=0.07731, over 21852.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3248, pruned_loss=0.08881, over 4280093.35 frames. ], batch size: 107, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:35:41,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1668012.0, ans=0.2 2023-06-24 05:36:09,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1668072.0, ans=0.1 2023-06-24 05:36:22,740 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.441e+02 8.129e+02 1.130e+03 1.802e+03 3.924e+03, threshold=2.259e+03, percent-clipped=11.0 2023-06-24 05:36:23,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1668132.0, ans=0.0 2023-06-24 05:36:57,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-24 05:37:10,373 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-24 05:37:11,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1668252.0, ans=0.125 2023-06-24 05:37:20,206 INFO [train.py:996] (1/4) Epoch 10, batch 3600, loss[loss=0.2292, simple_loss=0.3189, pruned_loss=0.06979, over 20760.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.321, pruned_loss=0.08877, over 4275349.64 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 05:37:20,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1668312.0, ans=0.1 2023-06-24 05:37:24,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1668312.0, ans=0.125 2023-06-24 05:37:31,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-24 05:37:32,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1668312.0, ans=0.125 2023-06-24 05:37:59,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-24 05:39:01,635 INFO [train.py:996] (1/4) Epoch 10, batch 3650, loss[loss=0.1918, simple_loss=0.2751, pruned_loss=0.05424, over 21482.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3205, pruned_loss=0.08799, over 4274387.80 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:39:10,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1668612.0, ans=0.125 2023-06-24 05:39:33,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1668672.0, ans=0.05 2023-06-24 05:39:43,923 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.119e+02 6.332e+02 8.468e+02 1.461e+03 3.139e+03, threshold=1.694e+03, percent-clipped=4.0 2023-06-24 05:40:07,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-24 05:40:32,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-24 05:40:39,711 INFO [train.py:996] (1/4) Epoch 10, batch 3700, loss[loss=0.2618, simple_loss=0.3389, pruned_loss=0.09238, over 21864.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3178, pruned_loss=0.08683, over 4271424.35 frames. ], batch size: 371, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:40:57,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1668912.0, ans=0.0 2023-06-24 05:41:39,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1669092.0, ans=0.1 2023-06-24 05:42:17,157 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.46 vs. limit=15.0 2023-06-24 05:42:20,651 INFO [train.py:996] (1/4) Epoch 10, batch 3750, loss[loss=0.2804, simple_loss=0.3433, pruned_loss=0.1088, over 21667.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.316, pruned_loss=0.08564, over 4281643.68 frames. ], batch size: 508, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:42:37,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1669212.0, ans=0.02 2023-06-24 05:42:42,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1669272.0, ans=0.0 2023-06-24 05:43:00,072 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.152e+02 7.119e+02 9.667e+02 1.381e+03 3.413e+03, threshold=1.933e+03, percent-clipped=11.0 2023-06-24 05:43:08,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1669332.0, ans=0.125 2023-06-24 05:44:00,647 INFO [train.py:996] (1/4) Epoch 10, batch 3800, loss[loss=0.2032, simple_loss=0.2815, pruned_loss=0.06244, over 20115.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3132, pruned_loss=0.08337, over 4277459.58 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:44:25,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1669572.0, ans=0.025 2023-06-24 05:44:59,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1669692.0, ans=0.09899494936611666 2023-06-24 05:45:05,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1669692.0, ans=0.0 2023-06-24 05:45:18,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1669752.0, ans=0.125 2023-06-24 05:45:19,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1669752.0, ans=0.125 2023-06-24 05:45:34,341 INFO [train.py:996] (1/4) Epoch 10, batch 3850, loss[loss=0.2083, simple_loss=0.2935, pruned_loss=0.06161, over 21406.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.311, pruned_loss=0.08385, over 4277842.75 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:45:59,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1669872.0, ans=0.0 2023-06-24 05:46:02,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1669872.0, ans=0.125 2023-06-24 05:46:05,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1669932.0, ans=0.025 2023-06-24 05:46:06,332 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 6.682e+02 9.971e+02 1.611e+03 3.519e+03, threshold=1.994e+03, percent-clipped=16.0 2023-06-24 05:46:06,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1669932.0, ans=0.0 2023-06-24 05:47:06,827 INFO [train.py:996] (1/4) Epoch 10, batch 3900, loss[loss=0.2411, simple_loss=0.3065, pruned_loss=0.08788, over 21899.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3069, pruned_loss=0.08333, over 4282655.30 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:47:17,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1670112.0, ans=0.5 2023-06-24 05:47:25,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1670112.0, ans=0.125 2023-06-24 05:47:27,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1670172.0, ans=0.1 2023-06-24 05:47:49,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-24 05:47:50,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1670232.0, ans=0.125 2023-06-24 05:48:33,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1670352.0, ans=0.125 2023-06-24 05:48:34,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-24 05:48:51,330 INFO [train.py:996] (1/4) Epoch 10, batch 3950, loss[loss=0.1802, simple_loss=0.2685, pruned_loss=0.04592, over 21793.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3086, pruned_loss=0.08217, over 4285343.99 frames. ], batch size: 282, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:49:26,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-24 05:49:28,751 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.660e+02 7.100e+02 1.207e+03 1.862e+03 3.460e+03, threshold=2.413e+03, percent-clipped=21.0 2023-06-24 05:49:40,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1670532.0, ans=0.2 2023-06-24 05:49:40,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1670532.0, ans=0.1 2023-06-24 05:50:29,182 INFO [train.py:996] (1/4) Epoch 10, batch 4000, loss[loss=0.2244, simple_loss=0.2916, pruned_loss=0.07857, over 22015.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3031, pruned_loss=0.07889, over 4285306.31 frames. ], batch size: 103, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:51:10,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1670832.0, ans=0.0 2023-06-24 05:51:34,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-24 05:51:55,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1670952.0, ans=0.125 2023-06-24 05:52:09,796 INFO [train.py:996] (1/4) Epoch 10, batch 4050, loss[loss=0.1846, simple_loss=0.2425, pruned_loss=0.06336, over 20672.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3013, pruned_loss=0.0776, over 4283013.58 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:52:19,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1671012.0, ans=0.1 2023-06-24 05:52:28,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1671072.0, ans=0.0 2023-06-24 05:52:51,705 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.639e+02 6.579e+02 8.856e+02 1.407e+03 2.917e+03, threshold=1.771e+03, percent-clipped=4.0 2023-06-24 05:52:52,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.38 vs. limit=10.0 2023-06-24 05:53:04,144 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-24 05:53:49,035 INFO [train.py:996] (1/4) Epoch 10, batch 4100, loss[loss=0.2146, simple_loss=0.2997, pruned_loss=0.0648, over 21817.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3052, pruned_loss=0.07927, over 4291613.44 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:53:59,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1671312.0, ans=0.0 2023-06-24 05:54:34,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1671432.0, ans=0.125 2023-06-24 05:54:57,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1671492.0, ans=0.0 2023-06-24 05:55:28,614 INFO [train.py:996] (1/4) Epoch 10, batch 4150, loss[loss=0.1745, simple_loss=0.2668, pruned_loss=0.04104, over 21674.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3056, pruned_loss=0.07706, over 4282439.85 frames. ], batch size: 247, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:56:15,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1671732.0, ans=0.1 2023-06-24 05:56:16,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1671732.0, ans=0.125 2023-06-24 05:56:17,909 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.253e+02 6.615e+02 8.834e+02 1.095e+03 2.475e+03, threshold=1.767e+03, percent-clipped=7.0 2023-06-24 05:56:51,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1671852.0, ans=0.0 2023-06-24 05:57:09,391 INFO [train.py:996] (1/4) Epoch 10, batch 4200, loss[loss=0.2329, simple_loss=0.3379, pruned_loss=0.06395, over 19837.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3052, pruned_loss=0.07586, over 4273903.86 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:57:09,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1671912.0, ans=0.0 2023-06-24 05:57:22,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1671912.0, ans=0.125 2023-06-24 05:57:42,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1671972.0, ans=0.0 2023-06-24 05:57:53,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1671972.0, ans=0.0 2023-06-24 05:59:00,796 INFO [train.py:996] (1/4) Epoch 10, batch 4250, loss[loss=0.2719, simple_loss=0.3419, pruned_loss=0.1009, over 21342.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3094, pruned_loss=0.07685, over 4275829.70 frames. ], batch size: 143, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:59:01,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1672212.0, ans=0.0 2023-06-24 05:59:26,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1672272.0, ans=0.0 2023-06-24 05:59:36,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1672332.0, ans=0.125 2023-06-24 05:59:40,887 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 7.026e+02 9.985e+02 1.582e+03 3.548e+03, threshold=1.997e+03, percent-clipped=19.0 2023-06-24 05:59:55,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1672332.0, ans=0.125 2023-06-24 06:00:28,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1672452.0, ans=0.0 2023-06-24 06:00:43,131 INFO [train.py:996] (1/4) Epoch 10, batch 4300, loss[loss=0.2199, simple_loss=0.3228, pruned_loss=0.05851, over 21843.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3146, pruned_loss=0.07852, over 4277506.15 frames. ], batch size: 371, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:01:08,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1672572.0, ans=0.125 2023-06-24 06:01:54,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1672692.0, ans=10.0 2023-06-24 06:01:54,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1672692.0, ans=0.0 2023-06-24 06:02:26,300 INFO [train.py:996] (1/4) Epoch 10, batch 4350, loss[loss=0.2156, simple_loss=0.2862, pruned_loss=0.07255, over 21449.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3143, pruned_loss=0.07849, over 4274306.72 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:02:56,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1672872.0, ans=0.2 2023-06-24 06:03:05,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.531e+02 6.687e+02 1.042e+03 1.785e+03 5.548e+03, threshold=2.083e+03, percent-clipped=20.0 2023-06-24 06:03:10,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1672932.0, ans=10.0 2023-06-24 06:04:03,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1673112.0, ans=0.035 2023-06-24 06:04:04,325 INFO [train.py:996] (1/4) Epoch 10, batch 4400, loss[loss=0.219, simple_loss=0.2941, pruned_loss=0.07198, over 21752.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3107, pruned_loss=0.07814, over 4266316.72 frames. ], batch size: 124, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:04:30,144 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-24 06:05:07,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1673292.0, ans=0.0 2023-06-24 06:05:15,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1673292.0, ans=0.0 2023-06-24 06:05:38,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=12.0 2023-06-24 06:05:45,308 INFO [train.py:996] (1/4) Epoch 10, batch 4450, loss[loss=0.2717, simple_loss=0.3659, pruned_loss=0.0887, over 21844.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.32, pruned_loss=0.0804, over 4266838.99 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:05:45,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1673412.0, ans=0.0 2023-06-24 06:06:01,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1673412.0, ans=0.125 2023-06-24 06:06:30,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.122e+02 7.408e+02 1.039e+03 1.368e+03 2.536e+03, threshold=2.077e+03, percent-clipped=7.0 2023-06-24 06:07:23,220 INFO [train.py:996] (1/4) Epoch 10, batch 4500, loss[loss=0.2469, simple_loss=0.3099, pruned_loss=0.09191, over 21701.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3177, pruned_loss=0.08116, over 4275012.38 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:07:33,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1673712.0, ans=0.1 2023-06-24 06:07:38,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1673712.0, ans=0.125 2023-06-24 06:07:44,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1673772.0, ans=0.125 2023-06-24 06:07:48,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.84 vs. limit=15.0 2023-06-24 06:07:57,343 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1673772.0, ans=0.2 2023-06-24 06:08:22,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1673832.0, ans=0.0 2023-06-24 06:08:34,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1673892.0, ans=0.125 2023-06-24 06:08:38,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1673892.0, ans=0.0 2023-06-24 06:08:53,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1673952.0, ans=0.125 2023-06-24 06:09:08,468 INFO [train.py:996] (1/4) Epoch 10, batch 4550, loss[loss=0.291, simple_loss=0.3672, pruned_loss=0.1074, over 21813.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3196, pruned_loss=0.08106, over 4279778.66 frames. ], batch size: 124, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:09:14,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-24 06:09:47,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1674072.0, ans=0.0 2023-06-24 06:09:52,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1674132.0, ans=0.125 2023-06-24 06:09:55,278 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.449e+02 7.044e+02 9.475e+02 1.574e+03 2.834e+03, threshold=1.895e+03, percent-clipped=10.0 2023-06-24 06:10:11,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1674192.0, ans=0.0 2023-06-24 06:10:27,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1674252.0, ans=0.0 2023-06-24 06:10:39,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1674252.0, ans=0.04949747468305833 2023-06-24 06:10:46,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1674252.0, ans=0.125 2023-06-24 06:10:49,477 INFO [train.py:996] (1/4) Epoch 10, batch 4600, loss[loss=0.2335, simple_loss=0.3033, pruned_loss=0.08186, over 21221.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3248, pruned_loss=0.08372, over 4281172.06 frames. ], batch size: 143, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:10:58,155 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-24 06:11:37,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-24 06:11:47,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-24 06:12:24,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1674552.0, ans=0.125 2023-06-24 06:12:27,176 INFO [train.py:996] (1/4) Epoch 10, batch 4650, loss[loss=0.1921, simple_loss=0.262, pruned_loss=0.06108, over 21881.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3192, pruned_loss=0.08244, over 4286287.43 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:12:50,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1674672.0, ans=0.0 2023-06-24 06:12:52,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1674672.0, ans=0.125 2023-06-24 06:13:10,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1674732.0, ans=0.2 2023-06-24 06:13:18,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.211e+02 6.703e+02 9.514e+02 1.360e+03 2.442e+03, threshold=1.903e+03, percent-clipped=9.0 2023-06-24 06:13:21,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1674732.0, ans=0.1 2023-06-24 06:13:24,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1674732.0, ans=0.125 2023-06-24 06:13:35,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1674792.0, ans=0.0 2023-06-24 06:13:50,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1674852.0, ans=0.0 2023-06-24 06:14:05,764 INFO [train.py:996] (1/4) Epoch 10, batch 4700, loss[loss=0.2126, simple_loss=0.2734, pruned_loss=0.07591, over 21591.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3096, pruned_loss=0.07997, over 4287628.49 frames. ], batch size: 415, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:14:20,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1674912.0, ans=0.125 2023-06-24 06:14:49,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1675032.0, ans=0.2 2023-06-24 06:15:06,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1675092.0, ans=0.04949747468305833 2023-06-24 06:15:30,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1675152.0, ans=0.125 2023-06-24 06:15:33,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1675152.0, ans=0.125 2023-06-24 06:15:44,559 INFO [train.py:996] (1/4) Epoch 10, batch 4750, loss[loss=0.2532, simple_loss=0.3261, pruned_loss=0.09017, over 21890.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3046, pruned_loss=0.08036, over 4287189.75 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:15:48,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1675212.0, ans=0.125 2023-06-24 06:16:16,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675272.0, ans=0.1 2023-06-24 06:16:35,097 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.386e+02 6.636e+02 9.717e+02 1.456e+03 3.310e+03, threshold=1.943e+03, percent-clipped=9.0 2023-06-24 06:16:51,251 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1675392.0, ans=0.0 2023-06-24 06:17:27,602 INFO [train.py:996] (1/4) Epoch 10, batch 4800, loss[loss=0.1771, simple_loss=0.2428, pruned_loss=0.05576, over 21192.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3038, pruned_loss=0.07936, over 4289616.73 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:17:45,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1675512.0, ans=0.1 2023-06-24 06:17:47,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1675572.0, ans=0.09899494936611666 2023-06-24 06:18:11,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1675632.0, ans=0.125 2023-06-24 06:18:37,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1675692.0, ans=0.04949747468305833 2023-06-24 06:19:05,392 INFO [train.py:996] (1/4) Epoch 10, batch 4850, loss[loss=0.2329, simple_loss=0.311, pruned_loss=0.07745, over 21631.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3022, pruned_loss=0.07898, over 4284851.98 frames. ], batch size: 389, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:19:24,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1675872.0, ans=0.0 2023-06-24 06:19:30,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1675872.0, ans=0.125 2023-06-24 06:19:53,825 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.198e+02 7.060e+02 1.085e+03 1.594e+03 2.809e+03, threshold=2.169e+03, percent-clipped=13.0 2023-06-24 06:20:00,959 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-24 06:20:08,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1675992.0, ans=0.125 2023-06-24 06:20:36,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1676052.0, ans=0.05 2023-06-24 06:20:45,023 INFO [train.py:996] (1/4) Epoch 10, batch 4900, loss[loss=0.2315, simple_loss=0.325, pruned_loss=0.06906, over 21498.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3046, pruned_loss=0.07925, over 4280038.72 frames. ], batch size: 194, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:20:59,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1676112.0, ans=0.125 2023-06-24 06:21:22,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1676172.0, ans=0.125 2023-06-24 06:22:23,103 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2023-06-24 06:22:29,702 INFO [train.py:996] (1/4) Epoch 10, batch 4950, loss[loss=0.2387, simple_loss=0.328, pruned_loss=0.07475, over 20642.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3079, pruned_loss=0.07739, over 4277158.03 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:22:33,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1676412.0, ans=0.125 2023-06-24 06:22:58,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1676472.0, ans=0.125 2023-06-24 06:23:12,528 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.855e+02 5.943e+02 9.817e+02 1.512e+03 3.334e+03, threshold=1.963e+03, percent-clipped=7.0 2023-06-24 06:23:14,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1676532.0, ans=0.1 2023-06-24 06:23:23,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1676592.0, ans=0.0 2023-06-24 06:23:23,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1676592.0, ans=0.2 2023-06-24 06:23:37,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1676592.0, ans=0.125 2023-06-24 06:23:45,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1676652.0, ans=0.125 2023-06-24 06:24:03,590 INFO [train.py:996] (1/4) Epoch 10, batch 5000, loss[loss=0.248, simple_loss=0.3501, pruned_loss=0.07295, over 21282.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3103, pruned_loss=0.07519, over 4281807.94 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:24:16,594 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.099e-02 2023-06-24 06:25:18,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1676892.0, ans=0.0 2023-06-24 06:25:41,901 INFO [train.py:996] (1/4) Epoch 10, batch 5050, loss[loss=0.2215, simple_loss=0.2895, pruned_loss=0.07679, over 21722.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3117, pruned_loss=0.07705, over 4288586.35 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:26:16,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1677072.0, ans=0.0 2023-06-24 06:26:30,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.758e+02 6.527e+02 8.985e+02 1.399e+03 2.450e+03, threshold=1.797e+03, percent-clipped=5.0 2023-06-24 06:26:49,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1677192.0, ans=0.0 2023-06-24 06:27:04,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1677252.0, ans=0.125 2023-06-24 06:27:21,208 INFO [train.py:996] (1/4) Epoch 10, batch 5100, loss[loss=0.2134, simple_loss=0.295, pruned_loss=0.0659, over 21788.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3091, pruned_loss=0.0779, over 4290302.86 frames. ], batch size: 414, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:27:21,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1677312.0, ans=0.125 2023-06-24 06:27:40,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1677372.0, ans=0.125 2023-06-24 06:29:01,709 INFO [train.py:996] (1/4) Epoch 10, batch 5150, loss[loss=0.2687, simple_loss=0.3401, pruned_loss=0.09861, over 21710.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3083, pruned_loss=0.07817, over 4287676.28 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:29:08,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1677612.0, ans=0.125 2023-06-24 06:29:36,582 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-24 06:29:50,178 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 6.243e+02 9.247e+02 1.548e+03 4.552e+03, threshold=1.849e+03, percent-clipped=17.0 2023-06-24 06:29:52,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-24 06:30:29,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1677852.0, ans=0.1 2023-06-24 06:30:31,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1677852.0, ans=0.1 2023-06-24 06:30:41,413 INFO [train.py:996] (1/4) Epoch 10, batch 5200, loss[loss=0.2331, simple_loss=0.3317, pruned_loss=0.06721, over 21663.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3101, pruned_loss=0.0791, over 4292423.25 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:30:53,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1677912.0, ans=0.125 2023-06-24 06:31:18,331 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-24 06:31:48,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-24 06:32:07,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678152.0, ans=0.1 2023-06-24 06:32:12,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1678152.0, ans=0.2 2023-06-24 06:32:20,317 INFO [train.py:996] (1/4) Epoch 10, batch 5250, loss[loss=0.2564, simple_loss=0.335, pruned_loss=0.08894, over 21838.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3153, pruned_loss=0.07828, over 4289974.82 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:32:49,927 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:32:54,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1678272.0, ans=0.0 2023-06-24 06:33:00,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1678272.0, ans=0.125 2023-06-24 06:33:08,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678332.0, ans=0.1 2023-06-24 06:33:09,830 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.205e+02 5.630e+02 8.256e+02 1.146e+03 2.990e+03, threshold=1.651e+03, percent-clipped=4.0 2023-06-24 06:33:10,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1678332.0, ans=0.0 2023-06-24 06:33:31,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1678392.0, ans=0.125 2023-06-24 06:33:48,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678452.0, ans=0.1 2023-06-24 06:34:00,831 INFO [train.py:996] (1/4) Epoch 10, batch 5300, loss[loss=0.2278, simple_loss=0.2949, pruned_loss=0.08034, over 21841.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3126, pruned_loss=0.07847, over 4296368.97 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:34:42,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678632.0, ans=0.1 2023-06-24 06:35:03,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1678692.0, ans=0.125 2023-06-24 06:35:04,169 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-24 06:35:05,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-24 06:35:11,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1678692.0, ans=0.125 2023-06-24 06:35:23,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=1678752.0, ans=12.0 2023-06-24 06:35:38,841 INFO [train.py:996] (1/4) Epoch 10, batch 5350, loss[loss=0.2783, simple_loss=0.3341, pruned_loss=0.1113, over 21833.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3116, pruned_loss=0.08057, over 4303783.24 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:35:39,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1678812.0, ans=0.0 2023-06-24 06:36:12,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1678872.0, ans=0.125 2023-06-24 06:36:23,613 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.431e+02 6.516e+02 8.462e+02 1.218e+03 2.526e+03, threshold=1.692e+03, percent-clipped=10.0 2023-06-24 06:36:24,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1678932.0, ans=0.125 2023-06-24 06:36:35,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1678992.0, ans=0.2 2023-06-24 06:37:13,521 INFO [train.py:996] (1/4) Epoch 10, batch 5400, loss[loss=0.2314, simple_loss=0.3121, pruned_loss=0.07537, over 21770.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3101, pruned_loss=0.08075, over 4297175.80 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:37:44,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1679172.0, ans=0.125 2023-06-24 06:38:58,906 INFO [train.py:996] (1/4) Epoch 10, batch 5450, loss[loss=0.2385, simple_loss=0.3448, pruned_loss=0.06605, over 21359.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3106, pruned_loss=0.07822, over 4293332.47 frames. ], batch size: 194, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:39:36,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1679472.0, ans=0.125 2023-06-24 06:39:53,563 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 6.727e+02 1.128e+03 1.838e+03 3.883e+03, threshold=2.256e+03, percent-clipped=29.0 2023-06-24 06:40:15,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1679592.0, ans=0.125 2023-06-24 06:40:43,788 INFO [train.py:996] (1/4) Epoch 10, batch 5500, loss[loss=0.2215, simple_loss=0.3189, pruned_loss=0.06205, over 21750.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3158, pruned_loss=0.07569, over 4275826.23 frames. ], batch size: 332, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:42:02,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1679892.0, ans=0.125 2023-06-24 06:42:31,509 INFO [train.py:996] (1/4) Epoch 10, batch 5550, loss[loss=0.2137, simple_loss=0.3041, pruned_loss=0.06161, over 21648.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3161, pruned_loss=0.07376, over 4270369.89 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:42:35,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1680012.0, ans=0.0 2023-06-24 06:42:38,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1680012.0, ans=0.0 2023-06-24 06:42:47,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1680012.0, ans=0.02 2023-06-24 06:43:16,342 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.799e+02 8.739e+02 1.452e+03 3.739e+03, threshold=1.748e+03, percent-clipped=10.0 2023-06-24 06:44:04,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1680252.0, ans=0.1 2023-06-24 06:44:06,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1680252.0, ans=0.05 2023-06-24 06:44:16,152 INFO [train.py:996] (1/4) Epoch 10, batch 5600, loss[loss=0.2196, simple_loss=0.3542, pruned_loss=0.04249, over 19797.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3153, pruned_loss=0.07118, over 4269979.50 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:44:20,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-24 06:44:23,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-24 06:44:42,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1680372.0, ans=0.125 2023-06-24 06:45:06,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1680432.0, ans=0.125 2023-06-24 06:45:07,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1680432.0, ans=0.125 2023-06-24 06:45:11,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1680432.0, ans=0.125 2023-06-24 06:45:55,112 INFO [train.py:996] (1/4) Epoch 10, batch 5650, loss[loss=0.2124, simple_loss=0.2785, pruned_loss=0.07314, over 21351.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3176, pruned_loss=0.07302, over 4272030.42 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:46:01,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1680612.0, ans=0.0 2023-06-24 06:46:03,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1680612.0, ans=0.1 2023-06-24 06:46:46,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.674e+02 6.342e+02 8.278e+02 1.256e+03 3.323e+03, threshold=1.656e+03, percent-clipped=10.0 2023-06-24 06:47:25,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1680852.0, ans=0.125 2023-06-24 06:47:34,656 INFO [train.py:996] (1/4) Epoch 10, batch 5700, loss[loss=0.2026, simple_loss=0.2823, pruned_loss=0.06143, over 21554.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3169, pruned_loss=0.07545, over 4275551.55 frames. ], batch size: 195, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:47:37,464 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.50 vs. limit=10.0 2023-06-24 06:47:41,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1680912.0, ans=0.1 2023-06-24 06:47:50,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1680972.0, ans=0.04949747468305833 2023-06-24 06:47:51,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1680972.0, ans=0.125 2023-06-24 06:49:05,434 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-24 06:49:15,869 INFO [train.py:996] (1/4) Epoch 10, batch 5750, loss[loss=0.1901, simple_loss=0.2827, pruned_loss=0.04873, over 21444.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3115, pruned_loss=0.07269, over 4276488.99 frames. ], batch size: 212, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:49:28,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1681212.0, ans=0.0 2023-06-24 06:49:55,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1681272.0, ans=0.125 2023-06-24 06:50:12,191 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.981e+02 6.985e+02 1.085e+03 1.966e+03 4.482e+03, threshold=2.170e+03, percent-clipped=31.0 2023-06-24 06:50:23,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1681392.0, ans=0.07 2023-06-24 06:50:53,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1681452.0, ans=0.125 2023-06-24 06:50:55,682 INFO [train.py:996] (1/4) Epoch 10, batch 5800, loss[loss=0.2719, simple_loss=0.3798, pruned_loss=0.08199, over 19961.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3107, pruned_loss=0.07102, over 4267487.92 frames. ], batch size: 702, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:51:02,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1681512.0, ans=0.125 2023-06-24 06:51:25,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1681572.0, ans=0.125 2023-06-24 06:51:45,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=15.0 2023-06-24 06:52:10,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-24 06:52:23,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1681752.0, ans=0.125 2023-06-24 06:52:26,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1681752.0, ans=0.0 2023-06-24 06:52:40,277 INFO [train.py:996] (1/4) Epoch 10, batch 5850, loss[loss=0.1793, simple_loss=0.2833, pruned_loss=0.03767, over 21657.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3108, pruned_loss=0.06835, over 4273578.28 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:52:53,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1681812.0, ans=0.125 2023-06-24 06:53:36,620 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.591e+02 5.271e+02 8.161e+02 1.450e+03 2.978e+03, threshold=1.632e+03, percent-clipped=6.0 2023-06-24 06:53:40,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1681932.0, ans=0.125 2023-06-24 06:53:53,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=22.5 2023-06-24 06:54:00,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1682052.0, ans=0.125 2023-06-24 06:54:23,940 INFO [train.py:996] (1/4) Epoch 10, batch 5900, loss[loss=0.1868, simple_loss=0.2688, pruned_loss=0.0524, over 21693.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.3037, pruned_loss=0.06404, over 4280517.30 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:55:05,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1682232.0, ans=0.125 2023-06-24 06:56:02,447 INFO [train.py:996] (1/4) Epoch 10, batch 5950, loss[loss=0.2019, simple_loss=0.2623, pruned_loss=0.07074, over 21269.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3011, pruned_loss=0.06632, over 4278849.20 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:56:26,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1682472.0, ans=0.2 2023-06-24 06:56:52,538 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.371e+02 5.849e+02 7.878e+02 1.100e+03 2.007e+03, threshold=1.576e+03, percent-clipped=6.0 2023-06-24 06:56:59,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-24 06:57:20,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1682652.0, ans=0.125 2023-06-24 06:57:40,442 INFO [train.py:996] (1/4) Epoch 10, batch 6000, loss[loss=0.2418, simple_loss=0.2971, pruned_loss=0.09325, over 21736.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2977, pruned_loss=0.06939, over 4277079.90 frames. ], batch size: 351, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 06:57:40,443 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 06:57:59,722 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2611, simple_loss=0.3564, pruned_loss=0.0829, over 1796401.00 frames. 2023-06-24 06:57:59,722 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 06:58:24,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1682772.0, ans=0.2 2023-06-24 06:59:25,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-24 06:59:38,449 INFO [train.py:996] (1/4) Epoch 10, batch 6050, loss[loss=0.256, simple_loss=0.3878, pruned_loss=0.0621, over 20806.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2921, pruned_loss=0.06991, over 4279757.73 frames. ], batch size: 607, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:00:26,526 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.624e+02 5.765e+02 7.695e+02 1.085e+03 2.275e+03, threshold=1.539e+03, percent-clipped=10.0 2023-06-24 07:01:13,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1683252.0, ans=0.125 2023-06-24 07:01:17,086 INFO [train.py:996] (1/4) Epoch 10, batch 6100, loss[loss=0.2554, simple_loss=0.3207, pruned_loss=0.09499, over 21801.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2912, pruned_loss=0.06889, over 4281553.75 frames. ], batch size: 112, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:02:23,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1683492.0, ans=0.125 2023-06-24 07:02:59,567 INFO [train.py:996] (1/4) Epoch 10, batch 6150, loss[loss=0.2897, simple_loss=0.3498, pruned_loss=0.1148, over 21726.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2961, pruned_loss=0.07227, over 4273915.35 frames. ], batch size: 415, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:03:10,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1683612.0, ans=0.125 2023-06-24 07:03:46,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1683732.0, ans=0.125 2023-06-24 07:03:51,703 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.961e+02 6.743e+02 9.296e+02 1.382e+03 3.230e+03, threshold=1.859e+03, percent-clipped=18.0 2023-06-24 07:04:14,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1683852.0, ans=0.2 2023-06-24 07:04:15,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=8.0 2023-06-24 07:04:38,179 INFO [train.py:996] (1/4) Epoch 10, batch 6200, loss[loss=0.284, simple_loss=0.4129, pruned_loss=0.07757, over 20770.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2998, pruned_loss=0.07312, over 4277152.26 frames. ], batch size: 607, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:05:05,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1683972.0, ans=0.0 2023-06-24 07:05:26,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1684032.0, ans=0.125 2023-06-24 07:06:13,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1684152.0, ans=0.5 2023-06-24 07:06:16,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1684212.0, ans=0.0 2023-06-24 07:06:17,941 INFO [train.py:996] (1/4) Epoch 10, batch 6250, loss[loss=0.2113, simple_loss=0.3133, pruned_loss=0.05458, over 21639.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3034, pruned_loss=0.07291, over 4279330.16 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:06:24,555 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:06:28,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=12.0 2023-06-24 07:06:30,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1684212.0, ans=0.125 2023-06-24 07:06:54,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1684332.0, ans=0.125 2023-06-24 07:07:09,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.397e+02 7.386e+02 1.187e+03 1.704e+03 4.027e+03, threshold=2.375e+03, percent-clipped=21.0 2023-06-24 07:07:12,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1684332.0, ans=0.0 2023-06-24 07:07:12,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-24 07:07:55,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1684512.0, ans=0.0 2023-06-24 07:07:56,069 INFO [train.py:996] (1/4) Epoch 10, batch 6300, loss[loss=0.2221, simple_loss=0.293, pruned_loss=0.07559, over 21606.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3062, pruned_loss=0.07189, over 4281885.21 frames. ], batch size: 548, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:07:59,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1684512.0, ans=0.025 2023-06-24 07:09:34,396 INFO [train.py:996] (1/4) Epoch 10, batch 6350, loss[loss=0.2643, simple_loss=0.3249, pruned_loss=0.1019, over 21625.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3096, pruned_loss=0.07516, over 4286583.82 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:09:34,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1684812.0, ans=0.0 2023-06-24 07:10:20,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1684932.0, ans=0.125 2023-06-24 07:10:27,525 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.068e+02 5.999e+02 8.518e+02 1.321e+03 2.305e+03, threshold=1.704e+03, percent-clipped=0.0 2023-06-24 07:10:53,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1684992.0, ans=0.0 2023-06-24 07:10:58,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1685052.0, ans=0.04949747468305833 2023-06-24 07:11:14,616 INFO [train.py:996] (1/4) Epoch 10, batch 6400, loss[loss=0.2609, simple_loss=0.3263, pruned_loss=0.09774, over 22016.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3156, pruned_loss=0.08016, over 4292379.09 frames. ], batch size: 317, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:11:24,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1685112.0, ans=0.125 2023-06-24 07:11:45,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1685172.0, ans=0.125 2023-06-24 07:12:16,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1685232.0, ans=0.125 2023-06-24 07:12:27,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2023-06-24 07:12:34,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1685292.0, ans=0.1 2023-06-24 07:12:53,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1685352.0, ans=0.125 2023-06-24 07:12:59,592 INFO [train.py:996] (1/4) Epoch 10, batch 6450, loss[loss=0.2199, simple_loss=0.3168, pruned_loss=0.06154, over 21201.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3181, pruned_loss=0.07945, over 4291016.14 frames. ], batch size: 548, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:13:53,912 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:13:56,603 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.370e+02 9.137e+02 1.216e+03 1.629e+03 2.950e+03, threshold=2.432e+03, percent-clipped=21.0 2023-06-24 07:14:18,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1685652.0, ans=0.0 2023-06-24 07:14:41,258 INFO [train.py:996] (1/4) Epoch 10, batch 6500, loss[loss=0.2405, simple_loss=0.3006, pruned_loss=0.09019, over 21785.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3118, pruned_loss=0.0787, over 4289598.81 frames. ], batch size: 102, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:14:53,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1685712.0, ans=0.125 2023-06-24 07:15:00,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1685772.0, ans=0.0 2023-06-24 07:15:02,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-24 07:15:40,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1685892.0, ans=0.0 2023-06-24 07:15:56,273 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-24 07:15:58,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1685952.0, ans=0.0 2023-06-24 07:16:13,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1685952.0, ans=0.125 2023-06-24 07:16:19,614 INFO [train.py:996] (1/4) Epoch 10, batch 6550, loss[loss=0.247, simple_loss=0.3202, pruned_loss=0.08689, over 21819.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3094, pruned_loss=0.0776, over 4292553.18 frames. ], batch size: 351, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:16:40,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1686072.0, ans=0.0 2023-06-24 07:16:50,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1686072.0, ans=0.125 2023-06-24 07:17:04,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1686132.0, ans=0.125 2023-06-24 07:17:15,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1686132.0, ans=0.2 2023-06-24 07:17:18,461 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.101e+02 5.692e+02 8.877e+02 1.428e+03 2.273e+03, threshold=1.775e+03, percent-clipped=0.0 2023-06-24 07:17:25,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1686192.0, ans=0.125 2023-06-24 07:17:30,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1686192.0, ans=0.125 2023-06-24 07:17:48,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1686252.0, ans=0.125 2023-06-24 07:17:48,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1686252.0, ans=0.125 2023-06-24 07:18:03,375 INFO [train.py:996] (1/4) Epoch 10, batch 6600, loss[loss=0.2446, simple_loss=0.2992, pruned_loss=0.09504, over 21491.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3057, pruned_loss=0.07809, over 4283871.81 frames. ], batch size: 441, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:18:20,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1686372.0, ans=0.0 2023-06-24 07:18:49,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-24 07:19:01,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1686492.0, ans=0.2 2023-06-24 07:19:33,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1686552.0, ans=0.125 2023-06-24 07:19:37,660 INFO [train.py:996] (1/4) Epoch 10, batch 6650, loss[loss=0.2347, simple_loss=0.312, pruned_loss=0.07873, over 21563.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.299, pruned_loss=0.07478, over 4278932.38 frames. ], batch size: 442, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:20:22,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1686732.0, ans=0.0 2023-06-24 07:20:31,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.031e+02 5.733e+02 1.031e+03 1.471e+03 3.342e+03, threshold=2.062e+03, percent-clipped=12.0 2023-06-24 07:20:49,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1686792.0, ans=0.035 2023-06-24 07:21:15,691 INFO [train.py:996] (1/4) Epoch 10, batch 6700, loss[loss=0.2012, simple_loss=0.2795, pruned_loss=0.06141, over 21640.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2945, pruned_loss=0.07506, over 4278462.96 frames. ], batch size: 391, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:22:16,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1687092.0, ans=0.05 2023-06-24 07:22:17,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1687092.0, ans=0.0 2023-06-24 07:22:53,764 INFO [train.py:996] (1/4) Epoch 10, batch 6750, loss[loss=0.201, simple_loss=0.2558, pruned_loss=0.07314, over 20304.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2927, pruned_loss=0.07608, over 4281971.29 frames. ], batch size: 703, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:23:11,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1687272.0, ans=0.0 2023-06-24 07:23:23,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1687272.0, ans=0.125 2023-06-24 07:23:47,793 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.403e+02 6.480e+02 8.146e+02 1.101e+03 1.861e+03, threshold=1.629e+03, percent-clipped=0.0 2023-06-24 07:24:32,370 INFO [train.py:996] (1/4) Epoch 10, batch 6800, loss[loss=0.24, simple_loss=0.2998, pruned_loss=0.09008, over 21767.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2951, pruned_loss=0.07874, over 4291321.23 frames. ], batch size: 333, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:24:35,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1687512.0, ans=0.125 2023-06-24 07:24:37,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1687512.0, ans=0.1 2023-06-24 07:24:43,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1687512.0, ans=0.125 2023-06-24 07:24:51,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687572.0, ans=0.1 2023-06-24 07:25:00,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1687572.0, ans=0.04949747468305833 2023-06-24 07:26:10,781 INFO [train.py:996] (1/4) Epoch 10, batch 6850, loss[loss=0.2071, simple_loss=0.3441, pruned_loss=0.03502, over 20771.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.293, pruned_loss=0.07932, over 4275831.33 frames. ], batch size: 607, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:26:18,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1687812.0, ans=0.2 2023-06-24 07:26:18,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687812.0, ans=0.1 2023-06-24 07:26:25,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-24 07:26:32,099 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.00 vs. limit=15.0 2023-06-24 07:26:50,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1687872.0, ans=0.125 2023-06-24 07:27:00,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1687932.0, ans=0.125 2023-06-24 07:27:05,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1687932.0, ans=0.0 2023-06-24 07:27:08,129 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.455e+02 6.000e+02 9.900e+02 1.479e+03 3.025e+03, threshold=1.980e+03, percent-clipped=16.0 2023-06-24 07:27:51,245 INFO [train.py:996] (1/4) Epoch 10, batch 6900, loss[loss=0.2398, simple_loss=0.3314, pruned_loss=0.07403, over 21742.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2951, pruned_loss=0.07931, over 4283126.15 frames. ], batch size: 441, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:27:54,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1688112.0, ans=22.5 2023-06-24 07:27:58,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1688112.0, ans=0.125 2023-06-24 07:27:58,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1688112.0, ans=0.125 2023-06-24 07:28:03,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1688112.0, ans=0.0 2023-06-24 07:28:17,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1688172.0, ans=0.0 2023-06-24 07:28:50,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1688292.0, ans=0.2 2023-06-24 07:29:33,005 INFO [train.py:996] (1/4) Epoch 10, batch 6950, loss[loss=0.2929, simple_loss=0.3543, pruned_loss=0.1157, over 21450.00 frames. ], tot_loss[loss=0.224, simple_loss=0.296, pruned_loss=0.07601, over 4285048.38 frames. ], batch size: 471, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:30:05,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1688472.0, ans=0.0 2023-06-24 07:30:07,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 07:30:22,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-24 07:30:25,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1688532.0, ans=0.125 2023-06-24 07:30:25,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1688532.0, ans=0.1 2023-06-24 07:30:34,219 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 6.339e+02 9.938e+02 1.554e+03 2.681e+03, threshold=1.988e+03, percent-clipped=10.0 2023-06-24 07:30:47,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1688592.0, ans=0.04949747468305833 2023-06-24 07:30:56,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1688652.0, ans=0.125 2023-06-24 07:31:11,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1688712.0, ans=0.0 2023-06-24 07:31:12,646 INFO [train.py:996] (1/4) Epoch 10, batch 7000, loss[loss=0.2123, simple_loss=0.2733, pruned_loss=0.0756, over 21374.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2999, pruned_loss=0.07838, over 4278469.46 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:32:38,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1688952.0, ans=0.0 2023-06-24 07:32:51,861 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:32:52,824 INFO [train.py:996] (1/4) Epoch 10, batch 7050, loss[loss=0.1721, simple_loss=0.2601, pruned_loss=0.04204, over 21558.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2972, pruned_loss=0.07631, over 4277303.44 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:33:02,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1689012.0, ans=0.95 2023-06-24 07:33:16,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-24 07:34:00,087 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.677e+02 8.213e+02 1.284e+03 1.969e+03 3.755e+03, threshold=2.569e+03, percent-clipped=21.0 2023-06-24 07:34:13,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=22.5 2023-06-24 07:34:41,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1689252.0, ans=0.125 2023-06-24 07:34:43,749 INFO [train.py:996] (1/4) Epoch 10, batch 7100, loss[loss=0.2214, simple_loss=0.3023, pruned_loss=0.07022, over 21822.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3051, pruned_loss=0.07748, over 4268130.20 frames. ], batch size: 333, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:35:47,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1689492.0, ans=0.0 2023-06-24 07:36:24,355 INFO [train.py:996] (1/4) Epoch 10, batch 7150, loss[loss=0.2424, simple_loss=0.3189, pruned_loss=0.08293, over 21711.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3021, pruned_loss=0.07538, over 4269126.81 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:37:18,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1689732.0, ans=0.125 2023-06-24 07:37:20,926 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.107e+02 6.261e+02 9.228e+02 1.360e+03 3.235e+03, threshold=1.846e+03, percent-clipped=6.0 2023-06-24 07:37:23,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1689792.0, ans=0.1 2023-06-24 07:38:04,280 INFO [train.py:996] (1/4) Epoch 10, batch 7200, loss[loss=0.246, simple_loss=0.3127, pruned_loss=0.0896, over 21411.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3042, pruned_loss=0.07767, over 4264945.88 frames. ], batch size: 194, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:38:14,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1689912.0, ans=0.125 2023-06-24 07:39:23,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1690092.0, ans=0.125 2023-06-24 07:39:26,828 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:39:44,069 INFO [train.py:996] (1/4) Epoch 10, batch 7250, loss[loss=0.1879, simple_loss=0.2494, pruned_loss=0.06322, over 21468.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3008, pruned_loss=0.07782, over 4262110.78 frames. ], batch size: 212, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:40:17,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-24 07:40:27,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690332.0, ans=0.1 2023-06-24 07:40:45,629 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.252e+02 6.130e+02 8.584e+02 1.247e+03 2.821e+03, threshold=1.717e+03, percent-clipped=3.0 2023-06-24 07:41:08,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1690452.0, ans=0.125 2023-06-24 07:41:22,775 INFO [train.py:996] (1/4) Epoch 10, batch 7300, loss[loss=0.2154, simple_loss=0.2827, pruned_loss=0.07407, over 21731.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.294, pruned_loss=0.07676, over 4265924.97 frames. ], batch size: 112, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:41:24,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1690512.0, ans=0.125 2023-06-24 07:42:11,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1690632.0, ans=0.0 2023-06-24 07:42:38,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1690692.0, ans=0.125 2023-06-24 07:43:08,164 INFO [train.py:996] (1/4) Epoch 10, batch 7350, loss[loss=0.2117, simple_loss=0.288, pruned_loss=0.06774, over 16336.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2923, pruned_loss=0.07742, over 4261444.54 frames. ], batch size: 60, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:43:38,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1690872.0, ans=0.125 2023-06-24 07:43:40,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1690872.0, ans=0.125 2023-06-24 07:43:56,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1690932.0, ans=0.125 2023-06-24 07:44:04,792 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-24 07:44:06,755 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.919e+02 7.422e+02 1.084e+03 1.496e+03 4.269e+03, threshold=2.168e+03, percent-clipped=20.0 2023-06-24 07:44:30,500 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.032e-02 2023-06-24 07:44:49,784 INFO [train.py:996] (1/4) Epoch 10, batch 7400, loss[loss=0.254, simple_loss=0.3296, pruned_loss=0.0892, over 21423.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2987, pruned_loss=0.07859, over 4251566.36 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:44:57,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.87 vs. limit=10.0 2023-06-24 07:45:09,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1691172.0, ans=0.125 2023-06-24 07:45:09,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1691172.0, ans=0.0 2023-06-24 07:45:46,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1691232.0, ans=0.125 2023-06-24 07:46:08,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1691352.0, ans=0.2 2023-06-24 07:46:09,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1691352.0, ans=0.125 2023-06-24 07:46:16,679 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:46:17,274 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-06-24 07:46:29,140 INFO [train.py:996] (1/4) Epoch 10, batch 7450, loss[loss=0.2447, simple_loss=0.2998, pruned_loss=0.09483, over 21616.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2972, pruned_loss=0.07752, over 4258054.66 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:46:45,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691412.0, ans=0.1 2023-06-24 07:47:23,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691532.0, ans=0.1 2023-06-24 07:47:32,874 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 6.014e+02 8.284e+02 1.438e+03 2.557e+03, threshold=1.657e+03, percent-clipped=4.0 2023-06-24 07:48:15,245 INFO [train.py:996] (1/4) Epoch 10, batch 7500, loss[loss=0.2377, simple_loss=0.3226, pruned_loss=0.07639, over 21295.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3028, pruned_loss=0.07838, over 4265516.85 frames. ], batch size: 176, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:48:46,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1691772.0, ans=0.125 2023-06-24 07:48:49,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1691772.0, ans=0.125 2023-06-24 07:49:12,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1691892.0, ans=10.0 2023-06-24 07:49:37,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1691952.0, ans=0.125 2023-06-24 07:49:43,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=1691952.0, ans=22.5 2023-06-24 07:49:44,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691952.0, ans=0.1 2023-06-24 07:49:56,512 INFO [train.py:996] (1/4) Epoch 10, batch 7550, loss[loss=0.2922, simple_loss=0.3825, pruned_loss=0.1009, over 21471.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3121, pruned_loss=0.0784, over 4272304.81 frames. ], batch size: 507, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:50:40,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1692132.0, ans=0.025 2023-06-24 07:50:51,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-24 07:50:52,206 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.375e+02 7.111e+02 1.164e+03 1.789e+03 2.953e+03, threshold=2.328e+03, percent-clipped=32.0 2023-06-24 07:51:22,568 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-24 07:51:26,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-24 07:51:34,169 INFO [train.py:996] (1/4) Epoch 10, batch 7600, loss[loss=0.2418, simple_loss=0.3146, pruned_loss=0.08448, over 22076.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3088, pruned_loss=0.07772, over 4275741.85 frames. ], batch size: 119, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:51:35,212 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-24 07:51:50,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1692312.0, ans=0.0 2023-06-24 07:51:54,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1692372.0, ans=0.0 2023-06-24 07:52:05,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1692372.0, ans=0.0 2023-06-24 07:52:22,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1692432.0, ans=0.125 2023-06-24 07:52:48,746 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-24 07:52:54,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1692552.0, ans=0.0 2023-06-24 07:53:09,058 INFO [train.py:996] (1/4) Epoch 10, batch 7650, loss[loss=0.1987, simple_loss=0.2925, pruned_loss=0.05245, over 20783.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3078, pruned_loss=0.07863, over 4275529.87 frames. ], batch size: 609, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:53:35,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=1692672.0, ans=15.0 2023-06-24 07:53:57,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-24 07:54:11,429 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.476e+02 5.649e+02 6.864e+02 9.584e+02 1.979e+03, threshold=1.373e+03, percent-clipped=0.0 2023-06-24 07:54:58,732 INFO [train.py:996] (1/4) Epoch 10, batch 7700, loss[loss=0.245, simple_loss=0.3166, pruned_loss=0.08666, over 21375.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3107, pruned_loss=0.08185, over 4282049.82 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:54:59,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1692912.0, ans=0.125 2023-06-24 07:56:06,020 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-24 07:56:40,760 INFO [train.py:996] (1/4) Epoch 10, batch 7750, loss[loss=0.3041, simple_loss=0.404, pruned_loss=0.1021, over 21641.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3134, pruned_loss=0.08125, over 4270746.72 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 07:56:56,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1693212.0, ans=0.0 2023-06-24 07:57:42,760 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 8.175e+02 1.275e+03 1.822e+03 5.282e+03, threshold=2.550e+03, percent-clipped=41.0 2023-06-24 07:58:17,952 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.82 vs. limit=22.5 2023-06-24 07:58:21,736 INFO [train.py:996] (1/4) Epoch 10, batch 7800, loss[loss=0.2645, simple_loss=0.3413, pruned_loss=0.09386, over 21851.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3167, pruned_loss=0.08277, over 4257854.95 frames. ], batch size: 372, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 07:58:30,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1693512.0, ans=0.0 2023-06-24 07:59:14,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1693632.0, ans=0.125 2023-06-24 07:59:20,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1693692.0, ans=0.0 2023-06-24 07:59:44,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-24 07:59:46,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=22.5 2023-06-24 07:59:47,690 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-24 07:59:58,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=12.0 2023-06-24 08:00:00,783 INFO [train.py:996] (1/4) Epoch 10, batch 7850, loss[loss=0.2192, simple_loss=0.2782, pruned_loss=0.08007, over 21286.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3103, pruned_loss=0.08223, over 4257940.64 frames. ], batch size: 177, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:00:48,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1693932.0, ans=0.0 2023-06-24 08:01:02,403 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.400e+02 8.758e+02 1.300e+03 4.376e+03, threshold=1.752e+03, percent-clipped=3.0 2023-06-24 08:01:04,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1693992.0, ans=0.0 2023-06-24 08:01:10,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1693992.0, ans=0.125 2023-06-24 08:01:15,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1693992.0, ans=0.125 2023-06-24 08:01:15,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1693992.0, ans=0.125 2023-06-24 08:01:33,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1694052.0, ans=0.5 2023-06-24 08:01:40,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1694112.0, ans=0.125 2023-06-24 08:01:41,743 INFO [train.py:996] (1/4) Epoch 10, batch 7900, loss[loss=0.3365, simple_loss=0.4268, pruned_loss=0.1231, over 21490.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3052, pruned_loss=0.08129, over 4260399.70 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:01:59,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1694112.0, ans=0.2 2023-06-24 08:02:14,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1694172.0, ans=0.2 2023-06-24 08:02:29,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1694232.0, ans=0.2 2023-06-24 08:03:00,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1694292.0, ans=0.04949747468305833 2023-06-24 08:03:11,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1694352.0, ans=0.125 2023-06-24 08:03:29,368 INFO [train.py:996] (1/4) Epoch 10, batch 7950, loss[loss=0.2353, simple_loss=0.3233, pruned_loss=0.07363, over 21909.00 frames. ], tot_loss[loss=0.235, simple_loss=0.311, pruned_loss=0.0795, over 4259900.47 frames. ], batch size: 316, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:03:56,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-24 08:04:00,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1694472.0, ans=0.125 2023-06-24 08:04:14,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-24 08:04:36,165 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.376e+02 7.152e+02 9.880e+02 1.480e+03 4.841e+03, threshold=1.976e+03, percent-clipped=16.0 2023-06-24 08:04:57,023 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-24 08:05:16,266 INFO [train.py:996] (1/4) Epoch 10, batch 8000, loss[loss=0.2457, simple_loss=0.3234, pruned_loss=0.08399, over 21764.00 frames. ], tot_loss[loss=0.241, simple_loss=0.317, pruned_loss=0.08254, over 4260876.75 frames. ], batch size: 332, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:05:24,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1694712.0, ans=0.125 2023-06-24 08:05:55,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1694772.0, ans=0.125 2023-06-24 08:06:51,116 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-24 08:07:06,346 INFO [train.py:996] (1/4) Epoch 10, batch 8050, loss[loss=0.2588, simple_loss=0.3393, pruned_loss=0.08918, over 21881.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3206, pruned_loss=0.08242, over 4262844.20 frames. ], batch size: 317, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:07:11,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1695012.0, ans=0.125 2023-06-24 08:07:26,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-24 08:07:56,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1695132.0, ans=0.125 2023-06-24 08:08:07,474 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.312e+02 7.120e+02 8.807e+02 1.513e+03 2.630e+03, threshold=1.761e+03, percent-clipped=8.0 2023-06-24 08:08:22,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1695192.0, ans=0.125 2023-06-24 08:08:24,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1695252.0, ans=0.125 2023-06-24 08:08:46,532 INFO [train.py:996] (1/4) Epoch 10, batch 8100, loss[loss=0.2354, simple_loss=0.3093, pruned_loss=0.08079, over 21862.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3175, pruned_loss=0.0825, over 4268993.65 frames. ], batch size: 107, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:10:36,968 INFO [train.py:996] (1/4) Epoch 10, batch 8150, loss[loss=0.3368, simple_loss=0.4242, pruned_loss=0.1247, over 21529.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3255, pruned_loss=0.08461, over 4270129.83 frames. ], batch size: 509, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:11:42,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1695792.0, ans=0.1 2023-06-24 08:11:43,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.039e+02 7.483e+02 1.136e+03 1.755e+03 5.961e+03, threshold=2.271e+03, percent-clipped=24.0 2023-06-24 08:12:02,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1695852.0, ans=0.125 2023-06-24 08:12:17,698 INFO [train.py:996] (1/4) Epoch 10, batch 8200, loss[loss=0.1849, simple_loss=0.2494, pruned_loss=0.0602, over 21514.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3209, pruned_loss=0.08358, over 4272045.62 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:12:26,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1695912.0, ans=0.125 2023-06-24 08:13:40,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1696152.0, ans=0.2 2023-06-24 08:13:43,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1696152.0, ans=0.125 2023-06-24 08:13:57,389 INFO [train.py:996] (1/4) Epoch 10, batch 8250, loss[loss=0.2186, simple_loss=0.3171, pruned_loss=0.06005, over 21713.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3182, pruned_loss=0.08228, over 4277994.07 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:13:57,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1696212.0, ans=0.1 2023-06-24 08:14:38,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1696332.0, ans=0.125 2023-06-24 08:14:38,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1696332.0, ans=0.1 2023-06-24 08:15:02,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.88 vs. limit=10.0 2023-06-24 08:15:04,598 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.899e+02 6.467e+02 8.535e+02 1.267e+03 3.280e+03, threshold=1.707e+03, percent-clipped=4.0 2023-06-24 08:15:38,173 INFO [train.py:996] (1/4) Epoch 10, batch 8300, loss[loss=0.2515, simple_loss=0.3371, pruned_loss=0.08293, over 21607.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3152, pruned_loss=0.07895, over 4274475.07 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:15:50,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1696512.0, ans=0.0 2023-06-24 08:16:13,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-24 08:16:14,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1696572.0, ans=0.1 2023-06-24 08:16:35,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1696632.0, ans=0.0 2023-06-24 08:16:54,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1696692.0, ans=0.125 2023-06-24 08:16:54,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-24 08:17:10,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1696752.0, ans=0.1 2023-06-24 08:17:18,904 INFO [train.py:996] (1/4) Epoch 10, batch 8350, loss[loss=0.1891, simple_loss=0.2736, pruned_loss=0.05231, over 21597.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3138, pruned_loss=0.07686, over 4263989.63 frames. ], batch size: 263, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:17:36,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1696812.0, ans=0.1 2023-06-24 08:18:10,299 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.22 vs. limit=15.0 2023-06-24 08:18:29,887 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 6.071e+02 7.237e+02 1.103e+03 3.221e+03, threshold=1.447e+03, percent-clipped=5.0 2023-06-24 08:18:59,505 INFO [train.py:996] (1/4) Epoch 10, batch 8400, loss[loss=0.2039, simple_loss=0.2963, pruned_loss=0.05573, over 21701.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3102, pruned_loss=0.07419, over 4261025.64 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:19:16,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1697112.0, ans=0.1 2023-06-24 08:19:29,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1697172.0, ans=0.95 2023-06-24 08:20:04,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1697232.0, ans=0.0 2023-06-24 08:20:28,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-24 08:20:34,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1697352.0, ans=0.0 2023-06-24 08:20:39,161 INFO [train.py:996] (1/4) Epoch 10, batch 8450, loss[loss=0.2787, simple_loss=0.3377, pruned_loss=0.1098, over 21811.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3087, pruned_loss=0.07437, over 4271106.01 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:21:07,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1697472.0, ans=0.0 2023-06-24 08:21:41,551 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-24 08:21:51,751 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 9.053e+02 1.298e+03 1.951e+03 3.847e+03, threshold=2.596e+03, percent-clipped=39.0 2023-06-24 08:21:54,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.48 vs. limit=10.0 2023-06-24 08:22:23,931 INFO [train.py:996] (1/4) Epoch 10, batch 8500, loss[loss=0.196, simple_loss=0.258, pruned_loss=0.06697, over 21641.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3057, pruned_loss=0.07601, over 4264936.08 frames. ], batch size: 247, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:23:30,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1697892.0, ans=0.125 2023-06-24 08:23:40,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-24 08:24:03,945 INFO [train.py:996] (1/4) Epoch 10, batch 8550, loss[loss=0.2077, simple_loss=0.2749, pruned_loss=0.07021, over 21973.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.309, pruned_loss=0.07827, over 4268677.30 frames. ], batch size: 103, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:24:18,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1698012.0, ans=0.1 2023-06-24 08:24:53,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.56 vs. limit=10.0 2023-06-24 08:25:07,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1698192.0, ans=0.125 2023-06-24 08:25:09,880 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.596e+02 7.231e+02 1.146e+03 1.740e+03 4.216e+03, threshold=2.291e+03, percent-clipped=13.0 2023-06-24 08:25:22,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1698252.0, ans=0.0 2023-06-24 08:25:35,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1698252.0, ans=0.09899494936611666 2023-06-24 08:25:48,537 INFO [train.py:996] (1/4) Epoch 10, batch 8600, loss[loss=0.1776, simple_loss=0.2377, pruned_loss=0.05878, over 20775.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3147, pruned_loss=0.0802, over 4278904.47 frames. ], batch size: 609, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:26:03,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-24 08:26:14,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1698372.0, ans=0.04949747468305833 2023-06-24 08:26:44,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1698432.0, ans=0.125 2023-06-24 08:26:45,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1698492.0, ans=0.125 2023-06-24 08:27:28,183 INFO [train.py:996] (1/4) Epoch 10, batch 8650, loss[loss=0.267, simple_loss=0.3633, pruned_loss=0.08532, over 21607.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3206, pruned_loss=0.08181, over 4278603.30 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:28:28,429 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.818e+02 6.800e+02 1.041e+03 1.467e+03 2.492e+03, threshold=2.082e+03, percent-clipped=1.0 2023-06-24 08:28:58,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1698912.0, ans=0.125 2023-06-24 08:28:59,772 INFO [train.py:996] (1/4) Epoch 10, batch 8700, loss[loss=0.2583, simple_loss=0.3097, pruned_loss=0.1034, over 21235.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.312, pruned_loss=0.07839, over 4284139.96 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:29:44,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1698972.0, ans=0.2 2023-06-24 08:29:46,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1699032.0, ans=0.125 2023-06-24 08:30:17,648 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:30:37,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1699152.0, ans=0.0 2023-06-24 08:30:47,018 INFO [train.py:996] (1/4) Epoch 10, batch 8750, loss[loss=0.221, simple_loss=0.2868, pruned_loss=0.07762, over 21727.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3064, pruned_loss=0.07852, over 4280762.55 frames. ], batch size: 230, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:31:31,713 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.83 vs. limit=6.0 2023-06-24 08:31:50,107 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 7.163e+02 1.016e+03 1.538e+03 3.044e+03, threshold=2.032e+03, percent-clipped=7.0 2023-06-24 08:32:06,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1699452.0, ans=0.0 2023-06-24 08:32:32,752 INFO [train.py:996] (1/4) Epoch 10, batch 8800, loss[loss=0.3334, simple_loss=0.3991, pruned_loss=0.1339, over 21774.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3144, pruned_loss=0.08124, over 4277567.84 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:32:36,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1699512.0, ans=0.0 2023-06-24 08:32:59,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699572.0, ans=0.1 2023-06-24 08:33:15,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1699632.0, ans=0.125 2023-06-24 08:33:17,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1699632.0, ans=0.0 2023-06-24 08:34:18,610 INFO [train.py:996] (1/4) Epoch 10, batch 8850, loss[loss=0.2398, simple_loss=0.3403, pruned_loss=0.06965, over 15796.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3202, pruned_loss=0.08244, over 4266868.06 frames. ], batch size: 61, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:34:32,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1699872.0, ans=0.0 2023-06-24 08:34:42,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1699872.0, ans=0.125 2023-06-24 08:34:56,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1699932.0, ans=0.2 2023-06-24 08:35:17,538 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.300e+02 6.165e+02 8.058e+02 1.044e+03 1.938e+03, threshold=1.612e+03, percent-clipped=0.0 2023-06-24 08:35:58,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1700112.0, ans=0.125 2023-06-24 08:35:59,454 INFO [train.py:996] (1/4) Epoch 10, batch 8900, loss[loss=0.2271, simple_loss=0.2924, pruned_loss=0.08091, over 15410.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3142, pruned_loss=0.08123, over 4259584.07 frames. ], batch size: 62, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:36:20,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1700172.0, ans=0.1 2023-06-24 08:36:43,847 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=12.0 2023-06-24 08:37:09,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1700292.0, ans=0.0 2023-06-24 08:37:43,137 INFO [train.py:996] (1/4) Epoch 10, batch 8950, loss[loss=0.2172, simple_loss=0.3376, pruned_loss=0.04842, over 19791.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.315, pruned_loss=0.08013, over 4253746.27 frames. ], batch size: 702, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:37:58,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1700472.0, ans=0.07 2023-06-24 08:38:14,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1700472.0, ans=0.125 2023-06-24 08:38:33,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1700532.0, ans=0.0 2023-06-24 08:38:43,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1700532.0, ans=0.125 2023-06-24 08:38:56,968 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.364e+02 1.062e+03 1.597e+03 2.316e+03 4.236e+03, threshold=3.193e+03, percent-clipped=50.0 2023-06-24 08:38:57,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1700592.0, ans=0.0 2023-06-24 08:39:22,940 INFO [train.py:996] (1/4) Epoch 10, batch 9000, loss[loss=0.2226, simple_loss=0.2971, pruned_loss=0.07403, over 21879.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3113, pruned_loss=0.0805, over 4250451.18 frames. ], batch size: 373, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:39:22,940 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 08:39:39,586 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2679, simple_loss=0.3599, pruned_loss=0.08793, over 1796401.00 frames. 2023-06-24 08:39:39,586 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 08:40:41,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1700832.0, ans=0.2 2023-06-24 08:41:21,510 INFO [train.py:996] (1/4) Epoch 10, batch 9050, loss[loss=0.2287, simple_loss=0.3103, pruned_loss=0.07353, over 21732.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3059, pruned_loss=0.07706, over 4261114.48 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:41:39,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1701012.0, ans=0.2 2023-06-24 08:41:53,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1701072.0, ans=0.125 2023-06-24 08:42:34,895 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1701192.0, ans=0.04949747468305833 2023-06-24 08:42:36,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701192.0, ans=0.1 2023-06-24 08:42:37,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.440e+02 8.915e+02 1.381e+03 2.151e+03 3.467e+03, threshold=2.763e+03, percent-clipped=3.0 2023-06-24 08:42:51,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-24 08:42:53,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1701252.0, ans=0.0 2023-06-24 08:43:08,693 INFO [train.py:996] (1/4) Epoch 10, batch 9100, loss[loss=0.1788, simple_loss=0.2571, pruned_loss=0.05025, over 15597.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3101, pruned_loss=0.0792, over 4259324.29 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:43:33,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1701372.0, ans=0.0 2023-06-24 08:43:51,566 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1701372.0, ans=10.0 2023-06-24 08:44:23,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1701492.0, ans=0.2 2023-06-24 08:44:26,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1701552.0, ans=0.05 2023-06-24 08:44:30,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1701552.0, ans=0.125 2023-06-24 08:44:49,851 INFO [train.py:996] (1/4) Epoch 10, batch 9150, loss[loss=0.2306, simple_loss=0.3228, pruned_loss=0.06919, over 21820.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3149, pruned_loss=0.07776, over 4265977.34 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:45:12,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1701612.0, ans=0.125 2023-06-24 08:45:13,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1701612.0, ans=0.0 2023-06-24 08:45:59,086 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 7.155e+02 1.028e+03 1.622e+03 3.048e+03, threshold=2.056e+03, percent-clipped=2.0 2023-06-24 08:46:31,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1701852.0, ans=0.0 2023-06-24 08:46:40,279 INFO [train.py:996] (1/4) Epoch 10, batch 9200, loss[loss=0.3186, simple_loss=0.3852, pruned_loss=0.126, over 21467.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.316, pruned_loss=0.07654, over 4269358.20 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:47:19,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1702032.0, ans=0.125 2023-06-24 08:47:21,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1702032.0, ans=0.125 2023-06-24 08:47:35,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1702092.0, ans=0.125 2023-06-24 08:47:41,456 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-24 08:47:42,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1702092.0, ans=0.125 2023-06-24 08:48:09,987 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1702152.0, ans=0.125 2023-06-24 08:48:20,290 INFO [train.py:996] (1/4) Epoch 10, batch 9250, loss[loss=0.3034, simple_loss=0.3888, pruned_loss=0.1089, over 19785.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3186, pruned_loss=0.07827, over 4267643.93 frames. ], batch size: 702, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:48:44,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1702272.0, ans=0.125 2023-06-24 08:48:49,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1702272.0, ans=0.125 2023-06-24 08:49:05,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1702332.0, ans=0.125 2023-06-24 08:49:15,646 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:49:17,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-24 08:49:21,337 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.372e+02 7.455e+02 9.438e+02 1.547e+03 2.905e+03, threshold=1.888e+03, percent-clipped=9.0 2023-06-24 08:49:51,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.77 vs. limit=6.0 2023-06-24 08:50:06,188 INFO [train.py:996] (1/4) Epoch 10, batch 9300, loss[loss=0.2928, simple_loss=0.3752, pruned_loss=0.1052, over 21612.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3131, pruned_loss=0.07838, over 4260507.49 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:50:16,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1702512.0, ans=0.1 2023-06-24 08:50:16,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1702512.0, ans=0.125 2023-06-24 08:50:53,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1702632.0, ans=0.07 2023-06-24 08:51:32,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702752.0, ans=0.1 2023-06-24 08:51:42,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702812.0, ans=0.1 2023-06-24 08:51:43,775 INFO [train.py:996] (1/4) Epoch 10, batch 9350, loss[loss=0.2531, simple_loss=0.3306, pruned_loss=0.08777, over 21316.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3211, pruned_loss=0.08015, over 4253800.58 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:52:30,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702932.0, ans=0.1 2023-06-24 08:52:38,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1702932.0, ans=0.0 2023-06-24 08:52:44,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1702992.0, ans=0.2 2023-06-24 08:53:01,223 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.560e+02 6.681e+02 9.424e+02 1.664e+03 4.543e+03, threshold=1.885e+03, percent-clipped=14.0 2023-06-24 08:53:10,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-24 08:53:25,343 INFO [train.py:996] (1/4) Epoch 10, batch 9400, loss[loss=0.259, simple_loss=0.3094, pruned_loss=0.1043, over 21278.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3225, pruned_loss=0.08135, over 4262036.36 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:53:25,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1703112.0, ans=0.2 2023-06-24 08:53:32,274 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:53:43,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1703172.0, ans=0.2 2023-06-24 08:54:04,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1703232.0, ans=0.1 2023-06-24 08:54:04,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1703232.0, ans=0.125 2023-06-24 08:55:04,525 INFO [train.py:996] (1/4) Epoch 10, batch 9450, loss[loss=0.1977, simple_loss=0.2671, pruned_loss=0.0642, over 21644.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3143, pruned_loss=0.08038, over 4257075.26 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:55:04,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1703412.0, ans=0.1 2023-06-24 08:55:17,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1703412.0, ans=0.125 2023-06-24 08:56:19,384 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.658e+02 7.388e+02 1.013e+03 1.627e+03 3.415e+03, threshold=2.026e+03, percent-clipped=14.0 2023-06-24 08:56:23,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1703592.0, ans=0.0 2023-06-24 08:56:33,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1703652.0, ans=0.2 2023-06-24 08:56:37,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1703652.0, ans=0.0 2023-06-24 08:56:43,453 INFO [train.py:996] (1/4) Epoch 10, batch 9500, loss[loss=0.2355, simple_loss=0.2999, pruned_loss=0.08551, over 21417.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3068, pruned_loss=0.07916, over 4251398.50 frames. ], batch size: 508, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 08:56:44,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-24 08:57:52,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1703892.0, ans=0.125 2023-06-24 08:57:56,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-24 08:58:00,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1703892.0, ans=0.125 2023-06-24 08:58:20,295 INFO [train.py:996] (1/4) Epoch 10, batch 9550, loss[loss=0.3097, simple_loss=0.366, pruned_loss=0.1267, over 21441.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3126, pruned_loss=0.08171, over 4248441.29 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 08:58:21,285 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-24 08:58:45,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1704072.0, ans=0.05 2023-06-24 08:59:18,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1704132.0, ans=0.0 2023-06-24 08:59:33,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.417e+02 6.895e+02 1.023e+03 1.419e+03 2.349e+03, threshold=2.046e+03, percent-clipped=3.0 2023-06-24 08:59:47,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704252.0, ans=0.1 2023-06-24 08:59:57,571 INFO [train.py:996] (1/4) Epoch 10, batch 9600, loss[loss=0.2353, simple_loss=0.3044, pruned_loss=0.08313, over 21878.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3174, pruned_loss=0.08283, over 4248258.78 frames. ], batch size: 118, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:00:02,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-24 09:00:07,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-24 09:00:35,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1704432.0, ans=0.125 2023-06-24 09:01:11,419 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-24 09:01:37,650 INFO [train.py:996] (1/4) Epoch 10, batch 9650, loss[loss=0.269, simple_loss=0.3392, pruned_loss=0.0994, over 21743.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3159, pruned_loss=0.0818, over 4255302.42 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:02:07,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1704672.0, ans=0.125 2023-06-24 09:02:55,081 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 6.745e+02 1.012e+03 1.363e+03 3.649e+03, threshold=2.025e+03, percent-clipped=11.0 2023-06-24 09:03:10,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1704852.0, ans=0.0 2023-06-24 09:03:17,715 INFO [train.py:996] (1/4) Epoch 10, batch 9700, loss[loss=0.2371, simple_loss=0.3127, pruned_loss=0.0808, over 21911.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3204, pruned_loss=0.08211, over 4257426.31 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:03:18,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1704912.0, ans=0.0 2023-06-24 09:03:23,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1704912.0, ans=0.0 2023-06-24 09:03:35,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1704912.0, ans=0.2 2023-06-24 09:04:28,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1705092.0, ans=0.125 2023-06-24 09:04:37,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1705092.0, ans=0.1 2023-06-24 09:04:37,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1705092.0, ans=0.0 2023-06-24 09:04:55,681 INFO [train.py:996] (1/4) Epoch 10, batch 9750, loss[loss=0.1987, simple_loss=0.2575, pruned_loss=0.06996, over 21128.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3145, pruned_loss=0.08052, over 4258859.33 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:05:16,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1705272.0, ans=0.0 2023-06-24 09:05:20,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1705272.0, ans=0.125 2023-06-24 09:05:57,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-24 09:05:59,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-24 09:06:10,522 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.138e+02 7.196e+02 1.029e+03 1.714e+03 4.123e+03, threshold=2.059e+03, percent-clipped=13.0 2023-06-24 09:06:32,708 INFO [train.py:996] (1/4) Epoch 10, batch 9800, loss[loss=0.2035, simple_loss=0.2777, pruned_loss=0.06464, over 21671.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3122, pruned_loss=0.08024, over 4259158.34 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:06:49,459 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:07:51,340 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=22.5 2023-06-24 09:07:55,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1705752.0, ans=0.05 2023-06-24 09:08:01,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1705752.0, ans=0.125 2023-06-24 09:08:06,343 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:08:10,627 INFO [train.py:996] (1/4) Epoch 10, batch 9850, loss[loss=0.2183, simple_loss=0.2807, pruned_loss=0.07798, over 21732.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3103, pruned_loss=0.081, over 4268377.03 frames. ], batch size: 264, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:08:30,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-24 09:08:31,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1705872.0, ans=0.125 2023-06-24 09:08:34,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1705872.0, ans=0.125 2023-06-24 09:09:26,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 7.595e+02 1.021e+03 1.343e+03 2.731e+03, threshold=2.043e+03, percent-clipped=9.0 2023-06-24 09:09:35,777 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-24 09:09:49,333 INFO [train.py:996] (1/4) Epoch 10, batch 9900, loss[loss=0.2084, simple_loss=0.2965, pruned_loss=0.06018, over 19884.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3073, pruned_loss=0.08085, over 4245238.90 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:10:31,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-24 09:11:13,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1706352.0, ans=0.07 2023-06-24 09:11:27,333 INFO [train.py:996] (1/4) Epoch 10, batch 9950, loss[loss=0.2269, simple_loss=0.2899, pruned_loss=0.08199, over 21925.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3075, pruned_loss=0.08138, over 4242220.54 frames. ], batch size: 373, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:11:35,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-24 09:12:19,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1706532.0, ans=10.0 2023-06-24 09:12:44,781 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 7.241e+02 1.069e+03 1.520e+03 2.876e+03, threshold=2.138e+03, percent-clipped=9.0 2023-06-24 09:13:12,483 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-24 09:13:12,906 INFO [train.py:996] (1/4) Epoch 10, batch 10000, loss[loss=0.2726, simple_loss=0.3408, pruned_loss=0.1022, over 21788.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3037, pruned_loss=0.08097, over 4245444.18 frames. ], batch size: 124, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:14:10,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1706832.0, ans=0.125 2023-06-24 09:14:34,328 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-24 09:14:54,919 INFO [train.py:996] (1/4) Epoch 10, batch 10050, loss[loss=0.2592, simple_loss=0.3296, pruned_loss=0.09446, over 21367.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3063, pruned_loss=0.08155, over 4249770.54 frames. ], batch size: 131, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:15:01,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1707012.0, ans=0.1 2023-06-24 09:15:37,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1707132.0, ans=0.0 2023-06-24 09:15:53,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1707192.0, ans=0.125 2023-06-24 09:16:11,749 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 6.781e+02 9.769e+02 1.554e+03 3.220e+03, threshold=1.954e+03, percent-clipped=12.0 2023-06-24 09:16:27,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1707252.0, ans=0.125 2023-06-24 09:16:30,131 INFO [train.py:996] (1/4) Epoch 10, batch 10100, loss[loss=0.1745, simple_loss=0.234, pruned_loss=0.05753, over 20778.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.305, pruned_loss=0.08037, over 4258317.53 frames. ], batch size: 608, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:17:21,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-24 09:17:32,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1707492.0, ans=0.125 2023-06-24 09:17:56,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1707552.0, ans=0.0 2023-06-24 09:17:56,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1707552.0, ans=0.2 2023-06-24 09:18:13,795 INFO [train.py:996] (1/4) Epoch 10, batch 10150, loss[loss=0.27, simple_loss=0.3407, pruned_loss=0.09964, over 21384.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.309, pruned_loss=0.08217, over 4261546.71 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:18:51,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1707732.0, ans=0.0 2023-06-24 09:19:02,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1707732.0, ans=0.2 2023-06-24 09:19:06,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1707792.0, ans=22.5 2023-06-24 09:19:25,309 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.627e+02 7.106e+02 9.650e+02 1.431e+03 2.478e+03, threshold=1.930e+03, percent-clipped=8.0 2023-06-24 09:19:25,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1707792.0, ans=0.125 2023-06-24 09:19:54,010 INFO [train.py:996] (1/4) Epoch 10, batch 10200, loss[loss=0.1906, simple_loss=0.2772, pruned_loss=0.05196, over 21223.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3072, pruned_loss=0.07921, over 4263242.42 frames. ], batch size: 176, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:20:13,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1707972.0, ans=0.125 2023-06-24 09:20:40,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=15.0 2023-06-24 09:20:46,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1708092.0, ans=0.2 2023-06-24 09:20:55,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1708092.0, ans=0.0 2023-06-24 09:21:18,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1708152.0, ans=0.0 2023-06-24 09:21:33,130 INFO [train.py:996] (1/4) Epoch 10, batch 10250, loss[loss=0.2481, simple_loss=0.3283, pruned_loss=0.08393, over 21210.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3023, pruned_loss=0.07349, over 4272267.24 frames. ], batch size: 143, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:22:06,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1708272.0, ans=0.125 2023-06-24 09:22:07,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1708272.0, ans=0.125 2023-06-24 09:22:13,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-24 09:22:32,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1708392.0, ans=0.0 2023-06-24 09:22:46,004 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.201e+02 5.691e+02 8.251e+02 1.364e+03 2.412e+03, threshold=1.650e+03, percent-clipped=9.0 2023-06-24 09:23:21,846 INFO [train.py:996] (1/4) Epoch 10, batch 10300, loss[loss=0.2264, simple_loss=0.3205, pruned_loss=0.06618, over 21762.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3045, pruned_loss=0.07466, over 4274905.69 frames. ], batch size: 247, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:23:22,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1708512.0, ans=0.0 2023-06-24 09:24:09,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1708632.0, ans=0.125 2023-06-24 09:24:13,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1708632.0, ans=0.125 2023-06-24 09:25:04,360 INFO [train.py:996] (1/4) Epoch 10, batch 10350, loss[loss=0.3063, simple_loss=0.3921, pruned_loss=0.1102, over 21475.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3103, pruned_loss=0.07647, over 4272451.94 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:25:07,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1708812.0, ans=0.2 2023-06-24 09:25:23,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1708872.0, ans=0.125 2023-06-24 09:25:23,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1708872.0, ans=0.95 2023-06-24 09:25:43,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1708932.0, ans=0.125 2023-06-24 09:25:48,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.94 vs. limit=10.0 2023-06-24 09:26:01,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1708932.0, ans=0.125 2023-06-24 09:26:07,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1708992.0, ans=0.0 2023-06-24 09:26:21,114 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.286e+02 6.885e+02 1.073e+03 1.600e+03 3.112e+03, threshold=2.146e+03, percent-clipped=24.0 2023-06-24 09:26:21,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1708992.0, ans=0.04949747468305833 2023-06-24 09:26:34,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1709052.0, ans=0.125 2023-06-24 09:26:41,154 INFO [train.py:996] (1/4) Epoch 10, batch 10400, loss[loss=0.1417, simple_loss=0.1975, pruned_loss=0.04294, over 21414.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3032, pruned_loss=0.0748, over 4261758.48 frames. ], batch size: 131, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:27:33,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1709232.0, ans=0.0 2023-06-24 09:27:46,966 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:28:17,486 INFO [train.py:996] (1/4) Epoch 10, batch 10450, loss[loss=0.2371, simple_loss=0.3383, pruned_loss=0.06795, over 20757.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3071, pruned_loss=0.07768, over 4265257.15 frames. ], batch size: 608, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:28:44,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1709472.0, ans=0.1 2023-06-24 09:29:12,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1709532.0, ans=0.0 2023-06-24 09:29:26,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1709592.0, ans=0.125 2023-06-24 09:29:37,713 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.221e+02 7.690e+02 1.222e+03 1.916e+03 3.478e+03, threshold=2.445e+03, percent-clipped=16.0 2023-06-24 09:29:56,560 INFO [train.py:996] (1/4) Epoch 10, batch 10500, loss[loss=0.2176, simple_loss=0.2834, pruned_loss=0.0759, over 21749.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.306, pruned_loss=0.07606, over 4263012.62 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:29:58,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1709712.0, ans=0.125 2023-06-24 09:30:08,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1709712.0, ans=0.0 2023-06-24 09:31:01,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1709832.0, ans=0.1 2023-06-24 09:31:06,797 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-24 09:31:20,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1709952.0, ans=0.2 2023-06-24 09:31:20,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1709952.0, ans=0.2 2023-06-24 09:31:35,616 INFO [train.py:996] (1/4) Epoch 10, batch 10550, loss[loss=0.1856, simple_loss=0.2517, pruned_loss=0.05976, over 21632.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3019, pruned_loss=0.07451, over 4239699.67 frames. ], batch size: 231, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:32:01,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1710072.0, ans=0.0 2023-06-24 09:32:12,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1710072.0, ans=0.125 2023-06-24 09:32:20,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1710132.0, ans=0.125 2023-06-24 09:32:21,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1710132.0, ans=0.0 2023-06-24 09:32:32,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1710132.0, ans=0.2 2023-06-24 09:32:44,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1710192.0, ans=0.125 2023-06-24 09:32:55,116 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.399e+02 7.494e+02 1.006e+03 1.488e+03 3.263e+03, threshold=2.013e+03, percent-clipped=2.0 2023-06-24 09:33:15,650 INFO [train.py:996] (1/4) Epoch 10, batch 10600, loss[loss=0.2049, simple_loss=0.2855, pruned_loss=0.06209, over 21464.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2952, pruned_loss=0.073, over 4252260.68 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:33:16,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1710312.0, ans=0.0 2023-06-24 09:34:22,288 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:34:46,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1710552.0, ans=0.125 2023-06-24 09:34:57,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1710552.0, ans=0.07 2023-06-24 09:34:59,852 INFO [train.py:996] (1/4) Epoch 10, batch 10650, loss[loss=0.1676, simple_loss=0.2518, pruned_loss=0.04167, over 21680.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.297, pruned_loss=0.07228, over 4256783.16 frames. ], batch size: 247, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:35:25,017 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=22.5 2023-06-24 09:35:57,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1710732.0, ans=0.125 2023-06-24 09:36:16,434 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.325e+02 7.408e+02 1.241e+03 1.885e+03 3.956e+03, threshold=2.481e+03, percent-clipped=17.0 2023-06-24 09:36:31,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1710852.0, ans=0.125 2023-06-24 09:36:45,452 INFO [train.py:996] (1/4) Epoch 10, batch 10700, loss[loss=0.2429, simple_loss=0.3121, pruned_loss=0.08687, over 21637.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2955, pruned_loss=0.07196, over 4264444.36 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:38:12,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1711152.0, ans=0.04949747468305833 2023-06-24 09:38:33,076 INFO [train.py:996] (1/4) Epoch 10, batch 10750, loss[loss=0.2636, simple_loss=0.3578, pruned_loss=0.08466, over 20706.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.308, pruned_loss=0.07665, over 4264034.92 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:38:53,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1711272.0, ans=0.2 2023-06-24 09:38:56,959 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-24 09:39:05,539 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2023-06-24 09:39:16,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1711332.0, ans=0.2 2023-06-24 09:39:50,491 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.281e+02 7.385e+02 1.038e+03 1.565e+03 3.899e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-24 09:40:03,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1711452.0, ans=0.125 2023-06-24 09:40:20,323 INFO [train.py:996] (1/4) Epoch 10, batch 10800, loss[loss=0.2489, simple_loss=0.3242, pruned_loss=0.08683, over 21727.00 frames. ], tot_loss[loss=0.234, simple_loss=0.313, pruned_loss=0.07752, over 4266104.26 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:40:22,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1711512.0, ans=0.125 2023-06-24 09:41:11,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1711692.0, ans=0.125 2023-06-24 09:41:41,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1711752.0, ans=0.0 2023-06-24 09:41:43,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1711752.0, ans=0.0 2023-06-24 09:41:55,362 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:41:59,467 INFO [train.py:996] (1/4) Epoch 10, batch 10850, loss[loss=0.2102, simple_loss=0.277, pruned_loss=0.07165, over 21537.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3125, pruned_loss=0.07846, over 4268661.90 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:41:59,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1711812.0, ans=0.0 2023-06-24 09:42:29,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.48 vs. limit=22.5 2023-06-24 09:42:32,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1711932.0, ans=0.125 2023-06-24 09:42:34,264 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:43:15,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1711992.0, ans=0.0 2023-06-24 09:43:18,089 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.679e+02 6.562e+02 9.748e+02 1.395e+03 3.143e+03, threshold=1.950e+03, percent-clipped=4.0 2023-06-24 09:43:27,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1712052.0, ans=0.125 2023-06-24 09:43:38,805 INFO [train.py:996] (1/4) Epoch 10, batch 10900, loss[loss=0.253, simple_loss=0.3755, pruned_loss=0.06527, over 20802.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3073, pruned_loss=0.07639, over 4269751.70 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:43:55,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1712172.0, ans=0.025 2023-06-24 09:45:12,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1712352.0, ans=0.2 2023-06-24 09:45:18,167 INFO [train.py:996] (1/4) Epoch 10, batch 10950, loss[loss=0.2043, simple_loss=0.2721, pruned_loss=0.06819, over 21242.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3034, pruned_loss=0.07438, over 4267374.21 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:45:36,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1712472.0, ans=0.125 2023-06-24 09:46:12,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1712532.0, ans=0.125 2023-06-24 09:46:29,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1712592.0, ans=0.125 2023-06-24 09:46:31,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1712592.0, ans=0.125 2023-06-24 09:46:31,939 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-24 09:46:35,548 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 6.641e+02 1.100e+03 1.576e+03 3.666e+03, threshold=2.199e+03, percent-clipped=18.0 2023-06-24 09:46:37,984 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:46:56,703 INFO [train.py:996] (1/4) Epoch 10, batch 11000, loss[loss=0.2727, simple_loss=0.3289, pruned_loss=0.1082, over 21838.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3016, pruned_loss=0.07557, over 4275513.92 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:47:06,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1712712.0, ans=0.2 2023-06-24 09:48:35,822 INFO [train.py:996] (1/4) Epoch 10, batch 11050, loss[loss=0.245, simple_loss=0.2945, pruned_loss=0.09775, over 21677.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2985, pruned_loss=0.07685, over 4269824.57 frames. ], batch size: 416, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:48:56,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-24 09:49:11,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1713132.0, ans=0.125 2023-06-24 09:49:39,295 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-24 09:49:52,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.455e+02 6.589e+02 8.856e+02 1.146e+03 2.849e+03, threshold=1.771e+03, percent-clipped=5.0 2023-06-24 09:50:13,219 INFO [train.py:996] (1/4) Epoch 10, batch 11100, loss[loss=0.2269, simple_loss=0.2966, pruned_loss=0.0786, over 21666.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2971, pruned_loss=0.07697, over 4258477.04 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:50:48,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1713432.0, ans=0.125 2023-06-24 09:51:22,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-24 09:51:25,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1713492.0, ans=0.1 2023-06-24 09:51:54,352 INFO [train.py:996] (1/4) Epoch 10, batch 11150, loss[loss=0.2407, simple_loss=0.3186, pruned_loss=0.08142, over 20690.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2966, pruned_loss=0.07747, over 4265287.83 frames. ], batch size: 608, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:52:24,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1713732.0, ans=0.0 2023-06-24 09:52:25,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1713732.0, ans=0.125 2023-06-24 09:52:27,686 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:53:03,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.72 vs. limit=6.0 2023-06-24 09:53:11,662 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.257e+02 6.370e+02 9.682e+02 1.600e+03 2.878e+03, threshold=1.936e+03, percent-clipped=17.0 2023-06-24 09:53:28,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1713852.0, ans=0.1 2023-06-24 09:53:33,222 INFO [train.py:996] (1/4) Epoch 10, batch 11200, loss[loss=0.1953, simple_loss=0.2504, pruned_loss=0.07007, over 20244.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2941, pruned_loss=0.07621, over 4265610.35 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:54:10,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1714032.0, ans=10.0 2023-06-24 09:54:34,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-24 09:54:46,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1714092.0, ans=0.025 2023-06-24 09:55:08,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1714152.0, ans=0.0 2023-06-24 09:55:12,185 INFO [train.py:996] (1/4) Epoch 10, batch 11250, loss[loss=0.2235, simple_loss=0.3059, pruned_loss=0.07055, over 21659.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2929, pruned_loss=0.07676, over 4268726.37 frames. ], batch size: 389, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:55:17,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1714212.0, ans=0.125 2023-06-24 09:56:18,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1714392.0, ans=0.0 2023-06-24 09:56:26,626 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.030e+02 7.779e+02 1.066e+03 3.071e+03, threshold=1.556e+03, percent-clipped=6.0 2023-06-24 09:56:39,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1714452.0, ans=0.125 2023-06-24 09:56:47,862 INFO [train.py:996] (1/4) Epoch 10, batch 11300, loss[loss=0.2111, simple_loss=0.29, pruned_loss=0.0661, over 21700.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2955, pruned_loss=0.07718, over 4272586.88 frames. ], batch size: 389, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:56:55,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1714512.0, ans=0.0 2023-06-24 09:57:13,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1714572.0, ans=0.125 2023-06-24 09:57:51,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1714692.0, ans=0.125 2023-06-24 09:58:16,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1714752.0, ans=0.0 2023-06-24 09:58:25,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1714752.0, ans=0.2 2023-06-24 09:58:28,629 INFO [train.py:996] (1/4) Epoch 10, batch 11350, loss[loss=0.1919, simple_loss=0.2669, pruned_loss=0.05839, over 21291.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2968, pruned_loss=0.07628, over 4275816.87 frames. ], batch size: 143, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:58:48,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.52 vs. limit=5.0 2023-06-24 09:58:50,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1714872.0, ans=0.0 2023-06-24 09:58:52,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1714872.0, ans=0.1 2023-06-24 09:59:53,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.336e+02 5.946e+02 8.078e+02 1.222e+03 2.329e+03, threshold=1.616e+03, percent-clipped=17.0 2023-06-24 10:00:05,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1715052.0, ans=0.125 2023-06-24 10:00:10,840 INFO [train.py:996] (1/4) Epoch 10, batch 11400, loss[loss=0.2262, simple_loss=0.2984, pruned_loss=0.07705, over 21327.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3018, pruned_loss=0.07856, over 4273572.92 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:01:03,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1715232.0, ans=0.5 2023-06-24 10:01:38,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-24 10:01:51,734 INFO [train.py:996] (1/4) Epoch 10, batch 11450, loss[loss=0.2438, simple_loss=0.3088, pruned_loss=0.08938, over 20061.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3042, pruned_loss=0.07801, over 4278671.16 frames. ], batch size: 707, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:02:13,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2023-06-24 10:02:26,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1715472.0, ans=0.1 2023-06-24 10:02:32,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1715472.0, ans=0.2 2023-06-24 10:02:44,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1715532.0, ans=0.125 2023-06-24 10:03:14,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715592.0, ans=0.1 2023-06-24 10:03:14,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1715592.0, ans=0.0 2023-06-24 10:03:15,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1715652.0, ans=0.125 2023-06-24 10:03:16,707 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.172e+02 6.928e+02 8.683e+02 1.191e+03 2.555e+03, threshold=1.737e+03, percent-clipped=7.0 2023-06-24 10:03:33,692 INFO [train.py:996] (1/4) Epoch 10, batch 11500, loss[loss=0.2251, simple_loss=0.3274, pruned_loss=0.06144, over 21853.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3071, pruned_loss=0.07884, over 4283086.11 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:04:05,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715772.0, ans=0.1 2023-06-24 10:04:12,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1715772.0, ans=0.125 2023-06-24 10:04:35,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1715832.0, ans=0.1 2023-06-24 10:04:40,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1715892.0, ans=0.125 2023-06-24 10:04:45,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1715892.0, ans=0.035 2023-06-24 10:05:16,095 INFO [train.py:996] (1/4) Epoch 10, batch 11550, loss[loss=0.3045, simple_loss=0.4057, pruned_loss=0.1017, over 21693.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.313, pruned_loss=0.07933, over 4281736.06 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:05:36,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-24 10:05:53,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1716072.0, ans=0.0 2023-06-24 10:06:12,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1716132.0, ans=0.0 2023-06-24 10:06:27,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1716192.0, ans=0.0 2023-06-24 10:06:36,603 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.588e+02 7.008e+02 1.217e+03 2.057e+03 3.971e+03, threshold=2.435e+03, percent-clipped=35.0 2023-06-24 10:07:02,454 INFO [train.py:996] (1/4) Epoch 10, batch 11600, loss[loss=0.2256, simple_loss=0.317, pruned_loss=0.06713, over 21871.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3275, pruned_loss=0.0814, over 4275801.01 frames. ], batch size: 118, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:07:15,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1716312.0, ans=0.125 2023-06-24 10:08:37,728 INFO [train.py:996] (1/4) Epoch 10, batch 11650, loss[loss=0.2777, simple_loss=0.3745, pruned_loss=0.09042, over 21721.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3334, pruned_loss=0.08147, over 4269213.28 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:09:08,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1716672.0, ans=0.125 2023-06-24 10:09:35,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-24 10:09:49,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1716792.0, ans=0.2 2023-06-24 10:09:51,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2023-06-24 10:09:51,855 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 6.606e+02 1.019e+03 1.524e+03 3.241e+03, threshold=2.038e+03, percent-clipped=8.0 2023-06-24 10:09:54,288 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=22.5 2023-06-24 10:10:15,240 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-24 10:10:16,866 INFO [train.py:996] (1/4) Epoch 10, batch 11700, loss[loss=0.226, simple_loss=0.3149, pruned_loss=0.06853, over 20015.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3252, pruned_loss=0.0812, over 4266734.28 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:10:19,166 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:10:52,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1717032.0, ans=0.125 2023-06-24 10:10:56,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1717032.0, ans=0.125 2023-06-24 10:11:03,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1717032.0, ans=0.125 2023-06-24 10:11:55,187 INFO [train.py:996] (1/4) Epoch 10, batch 11750, loss[loss=0.2401, simple_loss=0.2973, pruned_loss=0.09149, over 21778.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3158, pruned_loss=0.08034, over 4275210.00 frames. ], batch size: 118, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:12:20,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.11 vs. limit=15.0 2023-06-24 10:12:29,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1717272.0, ans=0.125 2023-06-24 10:12:39,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1717332.0, ans=0.0 2023-06-24 10:13:15,600 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.135e+02 6.168e+02 8.880e+02 1.254e+03 3.045e+03, threshold=1.776e+03, percent-clipped=3.0 2023-06-24 10:13:19,830 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-24 10:13:28,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1717452.0, ans=0.125 2023-06-24 10:13:34,955 INFO [train.py:996] (1/4) Epoch 10, batch 11800, loss[loss=0.2277, simple_loss=0.3347, pruned_loss=0.06037, over 21921.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3166, pruned_loss=0.08242, over 4268898.75 frames. ], batch size: 372, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:13:59,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1717572.0, ans=0.125 2023-06-24 10:14:04,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1717572.0, ans=0.125 2023-06-24 10:14:10,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-24 10:14:22,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1717632.0, ans=0.125 2023-06-24 10:14:44,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 10:15:19,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=1717812.0, ans=22.5 2023-06-24 10:15:19,758 INFO [train.py:996] (1/4) Epoch 10, batch 11850, loss[loss=0.2126, simple_loss=0.3075, pruned_loss=0.05885, over 21657.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3173, pruned_loss=0.08103, over 4280573.00 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:15:53,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1717932.0, ans=0.2 2023-06-24 10:16:19,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1717992.0, ans=0.1 2023-06-24 10:16:38,312 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-24 10:16:41,804 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 6.048e+02 1.109e+03 1.865e+03 3.176e+03, threshold=2.218e+03, percent-clipped=24.0 2023-06-24 10:17:01,691 INFO [train.py:996] (1/4) Epoch 10, batch 11900, loss[loss=0.2557, simple_loss=0.3684, pruned_loss=0.0715, over 19731.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3189, pruned_loss=0.0786, over 4279221.86 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:17:04,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1718112.0, ans=0.125 2023-06-24 10:17:15,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1718112.0, ans=0.0 2023-06-24 10:17:19,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1718172.0, ans=0.125 2023-06-24 10:17:33,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1718232.0, ans=0.125 2023-06-24 10:17:34,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1718232.0, ans=10.0 2023-06-24 10:18:15,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1718292.0, ans=0.2 2023-06-24 10:18:31,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1718352.0, ans=0.05 2023-06-24 10:18:42,222 INFO [train.py:996] (1/4) Epoch 10, batch 11950, loss[loss=0.2001, simple_loss=0.3011, pruned_loss=0.04951, over 21812.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3196, pruned_loss=0.07623, over 4281463.30 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:18:42,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1718412.0, ans=0.125 2023-06-24 10:18:47,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1718412.0, ans=0.125 2023-06-24 10:19:28,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1718532.0, ans=0.0 2023-06-24 10:19:51,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1718592.0, ans=0.125 2023-06-24 10:20:07,042 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.251e+02 6.812e+02 1.209e+03 1.807e+03 3.979e+03, threshold=2.418e+03, percent-clipped=18.0 2023-06-24 10:20:21,815 INFO [train.py:996] (1/4) Epoch 10, batch 12000, loss[loss=0.2256, simple_loss=0.2936, pruned_loss=0.07882, over 21973.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.312, pruned_loss=0.07425, over 4275446.12 frames. ], batch size: 103, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:20:21,816 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 10:20:37,782 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2579, simple_loss=0.3537, pruned_loss=0.08105, over 1796401.00 frames. 2023-06-24 10:20:37,783 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 10:21:09,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1718772.0, ans=0.025 2023-06-24 10:21:11,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1718772.0, ans=0.2 2023-06-24 10:21:32,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1718832.0, ans=0.0 2023-06-24 10:21:58,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1718892.0, ans=0.125 2023-06-24 10:22:03,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1718952.0, ans=0.0 2023-06-24 10:22:17,076 INFO [train.py:996] (1/4) Epoch 10, batch 12050, loss[loss=0.2271, simple_loss=0.2923, pruned_loss=0.08092, over 21306.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3079, pruned_loss=0.07613, over 4276341.73 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:22:38,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1719072.0, ans=0.0 2023-06-24 10:22:46,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1719072.0, ans=0.125 2023-06-24 10:23:04,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1719132.0, ans=0.0 2023-06-24 10:23:04,709 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-24 10:23:09,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-24 10:23:24,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.18 vs. limit=12.0 2023-06-24 10:23:44,275 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 7.636e+02 1.060e+03 1.403e+03 2.653e+03, threshold=2.120e+03, percent-clipped=2.0 2023-06-24 10:23:46,296 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:24:01,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1719252.0, ans=0.0 2023-06-24 10:24:03,744 INFO [train.py:996] (1/4) Epoch 10, batch 12100, loss[loss=0.2583, simple_loss=0.3268, pruned_loss=0.09487, over 21484.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3125, pruned_loss=0.07995, over 4279776.23 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:24:12,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1719312.0, ans=0.04949747468305833 2023-06-24 10:25:27,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1719552.0, ans=0.0 2023-06-24 10:25:27,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1719552.0, ans=0.125 2023-06-24 10:25:53,252 INFO [train.py:996] (1/4) Epoch 10, batch 12150, loss[loss=0.3036, simple_loss=0.3958, pruned_loss=0.1057, over 21499.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3169, pruned_loss=0.08031, over 4270419.47 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:26:59,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1719792.0, ans=0.0 2023-06-24 10:27:01,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-24 10:27:22,059 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.407e+02 7.316e+02 1.017e+03 1.333e+03 3.987e+03, threshold=2.033e+03, percent-clipped=9.0 2023-06-24 10:27:27,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1719852.0, ans=0.0 2023-06-24 10:27:30,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.69 vs. limit=22.5 2023-06-24 10:27:34,152 INFO [train.py:996] (1/4) Epoch 10, batch 12200, loss[loss=0.2372, simple_loss=0.2912, pruned_loss=0.0916, over 21698.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3139, pruned_loss=0.07908, over 4269521.31 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:28:11,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1719972.0, ans=0.1 2023-06-24 10:28:49,262 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-24 10:29:13,865 INFO [train.py:996] (1/4) Epoch 10, batch 12250, loss[loss=0.1385, simple_loss=0.2048, pruned_loss=0.03612, over 21766.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3062, pruned_loss=0.07577, over 4262072.50 frames. ], batch size: 107, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:29:52,094 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-24 10:30:01,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1720332.0, ans=0.2 2023-06-24 10:30:36,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1720452.0, ans=0.0 2023-06-24 10:30:37,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 6.057e+02 9.242e+02 1.403e+03 3.346e+03, threshold=1.848e+03, percent-clipped=10.0 2023-06-24 10:30:52,760 INFO [train.py:996] (1/4) Epoch 10, batch 12300, loss[loss=0.2482, simple_loss=0.3574, pruned_loss=0.06947, over 21210.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2993, pruned_loss=0.07126, over 4262279.75 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:30:55,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1720512.0, ans=0.1 2023-06-24 10:31:14,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1720572.0, ans=0.125 2023-06-24 10:31:31,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1720572.0, ans=0.125 2023-06-24 10:32:38,250 INFO [train.py:996] (1/4) Epoch 10, batch 12350, loss[loss=0.2364, simple_loss=0.3055, pruned_loss=0.08363, over 21465.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3018, pruned_loss=0.07162, over 4265918.57 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:33:44,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1720992.0, ans=0.125 2023-06-24 10:33:50,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1721052.0, ans=0.125 2023-06-24 10:33:56,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.324e+02 6.849e+02 9.475e+02 1.484e+03 4.503e+03, threshold=1.895e+03, percent-clipped=12.0 2023-06-24 10:34:17,406 INFO [train.py:996] (1/4) Epoch 10, batch 12400, loss[loss=0.2607, simple_loss=0.3224, pruned_loss=0.09954, over 21490.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3061, pruned_loss=0.07531, over 4278703.13 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:35:02,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1721232.0, ans=0.125 2023-06-24 10:35:08,085 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-24 10:35:56,490 INFO [train.py:996] (1/4) Epoch 10, batch 12450, loss[loss=0.2547, simple_loss=0.325, pruned_loss=0.09218, over 21380.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3087, pruned_loss=0.07761, over 4284488.09 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:36:06,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1721412.0, ans=0.0 2023-06-24 10:36:10,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-24 10:36:11,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1721412.0, ans=0.125 2023-06-24 10:36:47,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1721532.0, ans=0.125 2023-06-24 10:36:49,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1721532.0, ans=0.2 2023-06-24 10:37:26,127 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.427e+02 8.062e+02 1.038e+03 1.545e+03 2.621e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-24 10:37:37,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1721652.0, ans=0.125 2023-06-24 10:37:42,462 INFO [train.py:996] (1/4) Epoch 10, batch 12500, loss[loss=0.2699, simple_loss=0.356, pruned_loss=0.09197, over 21656.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3195, pruned_loss=0.08122, over 4285090.65 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:38:02,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1721772.0, ans=0.1 2023-06-24 10:38:52,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1721892.0, ans=0.125 2023-06-24 10:39:23,805 INFO [train.py:996] (1/4) Epoch 10, batch 12550, loss[loss=0.2386, simple_loss=0.3023, pruned_loss=0.08747, over 21203.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3265, pruned_loss=0.08351, over 4283569.18 frames. ], batch size: 608, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:39:51,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1722072.0, ans=0.125 2023-06-24 10:40:27,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1722192.0, ans=0.125 2023-06-24 10:40:48,773 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.945e+02 6.642e+02 8.862e+02 1.444e+03 2.963e+03, threshold=1.772e+03, percent-clipped=6.0 2023-06-24 10:40:58,124 INFO [train.py:996] (1/4) Epoch 10, batch 12600, loss[loss=0.2006, simple_loss=0.291, pruned_loss=0.05516, over 21634.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3249, pruned_loss=0.08233, over 4286341.57 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:41:05,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1722312.0, ans=0.2 2023-06-24 10:41:24,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1722372.0, ans=0.125 2023-06-24 10:42:11,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1722492.0, ans=0.125 2023-06-24 10:42:11,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-24 10:42:30,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1722552.0, ans=0.0 2023-06-24 10:42:36,260 INFO [train.py:996] (1/4) Epoch 10, batch 12650, loss[loss=0.1517, simple_loss=0.1945, pruned_loss=0.05444, over 16437.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3172, pruned_loss=0.0791, over 4274772.71 frames. ], batch size: 60, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:43:01,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-24 10:43:21,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1722732.0, ans=0.125 2023-06-24 10:43:24,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1722732.0, ans=0.035 2023-06-24 10:43:50,007 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-24 10:44:06,399 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.090e+02 7.127e+02 1.026e+03 1.420e+03 2.601e+03, threshold=2.052e+03, percent-clipped=16.0 2023-06-24 10:44:16,325 INFO [train.py:996] (1/4) Epoch 10, batch 12700, loss[loss=0.2685, simple_loss=0.3361, pruned_loss=0.1004, over 21806.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3154, pruned_loss=0.08013, over 4277886.46 frames. ], batch size: 118, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:44:52,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1722972.0, ans=0.2 2023-06-24 10:45:54,654 INFO [train.py:996] (1/4) Epoch 10, batch 12750, loss[loss=0.2118, simple_loss=0.2961, pruned_loss=0.06374, over 21792.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3175, pruned_loss=0.08097, over 4271713.39 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:46:43,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1723332.0, ans=0.125 2023-06-24 10:46:50,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723332.0, ans=0.1 2023-06-24 10:47:18,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 6.526e+02 8.799e+02 1.342e+03 3.585e+03, threshold=1.760e+03, percent-clipped=6.0 2023-06-24 10:47:33,314 INFO [train.py:996] (1/4) Epoch 10, batch 12800, loss[loss=0.2439, simple_loss=0.3093, pruned_loss=0.08927, over 21544.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3154, pruned_loss=0.08088, over 4277130.59 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:47:36,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-24 10:47:37,493 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-24 10:47:57,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1723572.0, ans=0.125 2023-06-24 10:48:17,852 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-24 10:49:18,854 INFO [train.py:996] (1/4) Epoch 10, batch 12850, loss[loss=0.2128, simple_loss=0.3006, pruned_loss=0.06247, over 21595.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3178, pruned_loss=0.08214, over 4280930.30 frames. ], batch size: 263, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:49:38,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1723872.0, ans=0.0 2023-06-24 10:49:42,494 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2023-06-24 10:49:49,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1723872.0, ans=0.125 2023-06-24 10:50:13,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1723932.0, ans=0.125 2023-06-24 10:50:48,497 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.383e+02 5.816e+02 7.846e+02 1.216e+03 2.443e+03, threshold=1.569e+03, percent-clipped=11.0 2023-06-24 10:50:56,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1724052.0, ans=0.0 2023-06-24 10:51:02,586 INFO [train.py:996] (1/4) Epoch 10, batch 12900, loss[loss=0.2766, simple_loss=0.3572, pruned_loss=0.09806, over 21512.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3138, pruned_loss=0.07835, over 4277607.28 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:51:29,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1724172.0, ans=0.1 2023-06-24 10:51:55,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1724232.0, ans=0.125 2023-06-24 10:52:43,504 INFO [train.py:996] (1/4) Epoch 10, batch 12950, loss[loss=0.2796, simple_loss=0.3511, pruned_loss=0.1041, over 21726.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.313, pruned_loss=0.07677, over 4272992.64 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:52:51,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1724412.0, ans=0.125 2023-06-24 10:52:51,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-24 10:53:12,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1724472.0, ans=0.2 2023-06-24 10:54:08,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1724652.0, ans=0.125 2023-06-24 10:54:08,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1724652.0, ans=0.125 2023-06-24 10:54:15,817 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.837e+02 8.466e+02 1.346e+03 1.826e+03 3.659e+03, threshold=2.691e+03, percent-clipped=37.0 2023-06-24 10:54:23,658 INFO [train.py:996] (1/4) Epoch 10, batch 13000, loss[loss=0.17, simple_loss=0.2413, pruned_loss=0.0493, over 20993.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3125, pruned_loss=0.07746, over 4265008.59 frames. ], batch size: 143, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:55:33,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1724892.0, ans=0.125 2023-06-24 10:55:51,317 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:55:53,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1724952.0, ans=0.0 2023-06-24 10:56:01,967 INFO [train.py:996] (1/4) Epoch 10, batch 13050, loss[loss=0.2186, simple_loss=0.2939, pruned_loss=0.07167, over 21888.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3075, pruned_loss=0.07488, over 4268989.40 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:56:04,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1725012.0, ans=0.2 2023-06-24 10:56:26,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1725072.0, ans=0.125 2023-06-24 10:56:31,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1725072.0, ans=0.125 2023-06-24 10:56:45,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1725132.0, ans=0.1 2023-06-24 10:56:50,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1725132.0, ans=0.125 2023-06-24 10:57:14,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1725192.0, ans=0.07 2023-06-24 10:57:16,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1725192.0, ans=0.125 2023-06-24 10:57:18,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1725192.0, ans=0.125 2023-06-24 10:57:33,642 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.137e+02 5.824e+02 8.081e+02 1.133e+03 2.445e+03, threshold=1.616e+03, percent-clipped=0.0 2023-06-24 10:57:41,735 INFO [train.py:996] (1/4) Epoch 10, batch 13100, loss[loss=0.205, simple_loss=0.2958, pruned_loss=0.05705, over 21769.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3108, pruned_loss=0.07574, over 4271307.96 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:57:55,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1725312.0, ans=0.125 2023-06-24 10:58:03,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1725372.0, ans=0.125 2023-06-24 10:58:18,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1725372.0, ans=0.125 2023-06-24 10:58:18,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1725372.0, ans=0.1 2023-06-24 10:58:25,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1725432.0, ans=0.2 2023-06-24 10:59:08,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1725552.0, ans=0.125 2023-06-24 10:59:21,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1725552.0, ans=0.0 2023-06-24 10:59:28,021 INFO [train.py:996] (1/4) Epoch 10, batch 13150, loss[loss=0.258, simple_loss=0.3704, pruned_loss=0.07278, over 20832.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3136, pruned_loss=0.07826, over 4278060.28 frames. ], batch size: 607, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:59:52,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1725672.0, ans=0.0 2023-06-24 11:00:38,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1725792.0, ans=0.125 2023-06-24 11:00:57,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-24 11:00:59,518 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 8.375e+02 1.328e+03 1.823e+03 3.736e+03, threshold=2.655e+03, percent-clipped=31.0 2023-06-24 11:01:07,605 INFO [train.py:996] (1/4) Epoch 10, batch 13200, loss[loss=0.253, simple_loss=0.3294, pruned_loss=0.08826, over 21419.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3147, pruned_loss=0.07924, over 4280364.65 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:01:11,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1725912.0, ans=0.0 2023-06-24 11:01:23,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1725912.0, ans=0.0 2023-06-24 11:02:52,531 INFO [train.py:996] (1/4) Epoch 10, batch 13250, loss[loss=0.2298, simple_loss=0.3175, pruned_loss=0.0711, over 21854.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3139, pruned_loss=0.08134, over 4274352.86 frames. ], batch size: 371, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:03:03,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-24 11:03:25,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1726272.0, ans=0.0 2023-06-24 11:03:57,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1726392.0, ans=0.0 2023-06-24 11:04:00,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1726392.0, ans=0.125 2023-06-24 11:04:24,297 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.098e+02 9.382e+02 1.293e+03 1.907e+03 4.949e+03, threshold=2.585e+03, percent-clipped=10.0 2023-06-24 11:04:27,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.55 vs. limit=6.0 2023-06-24 11:04:32,057 INFO [train.py:996] (1/4) Epoch 10, batch 13300, loss[loss=0.2743, simple_loss=0.3488, pruned_loss=0.0999, over 21755.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.315, pruned_loss=0.08062, over 4275428.87 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:04:34,853 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-24 11:05:01,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1726572.0, ans=0.2 2023-06-24 11:05:10,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1726632.0, ans=0.0 2023-06-24 11:05:58,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1726752.0, ans=0.0 2023-06-24 11:06:14,240 INFO [train.py:996] (1/4) Epoch 10, batch 13350, loss[loss=0.2213, simple_loss=0.3044, pruned_loss=0.06905, over 21373.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3198, pruned_loss=0.0836, over 4277222.46 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:06:25,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1726812.0, ans=0.1 2023-06-24 11:06:58,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1726932.0, ans=0.125 2023-06-24 11:07:24,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1727052.0, ans=0.125 2023-06-24 11:07:26,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1727052.0, ans=0.125 2023-06-24 11:07:39,413 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.530e+02 6.336e+02 8.525e+02 1.254e+03 2.418e+03, threshold=1.705e+03, percent-clipped=0.0 2023-06-24 11:07:50,834 INFO [train.py:996] (1/4) Epoch 10, batch 13400, loss[loss=0.2247, simple_loss=0.2945, pruned_loss=0.07748, over 21601.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3218, pruned_loss=0.08486, over 4281073.20 frames. ], batch size: 548, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:07:54,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1727112.0, ans=0.125 2023-06-24 11:08:22,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1727172.0, ans=0.125 2023-06-24 11:08:25,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-24 11:08:30,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1727232.0, ans=0.0 2023-06-24 11:09:27,803 INFO [train.py:996] (1/4) Epoch 10, batch 13450, loss[loss=0.2868, simple_loss=0.3465, pruned_loss=0.1136, over 21657.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3238, pruned_loss=0.08724, over 4274624.39 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:09:37,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1727412.0, ans=0.125 2023-06-24 11:09:49,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1727472.0, ans=0.0 2023-06-24 11:09:49,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1727472.0, ans=0.125 2023-06-24 11:09:51,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-24 11:10:06,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1727472.0, ans=0.125 2023-06-24 11:10:50,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1727652.0, ans=0.125 2023-06-24 11:10:52,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.94 vs. limit=15.0 2023-06-24 11:10:59,930 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.972e+02 7.908e+02 1.192e+03 1.823e+03 3.915e+03, threshold=2.384e+03, percent-clipped=24.0 2023-06-24 11:11:02,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1727652.0, ans=0.5 2023-06-24 11:11:06,217 INFO [train.py:996] (1/4) Epoch 10, batch 13500, loss[loss=0.1783, simple_loss=0.2349, pruned_loss=0.06084, over 21319.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3157, pruned_loss=0.08383, over 4270727.60 frames. ], batch size: 159, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:11:13,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-24 11:11:27,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1727772.0, ans=0.95 2023-06-24 11:11:29,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1727772.0, ans=0.0 2023-06-24 11:12:15,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1727892.0, ans=10.0 2023-06-24 11:12:16,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1727892.0, ans=0.1 2023-06-24 11:12:23,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1727892.0, ans=0.2 2023-06-24 11:12:50,304 INFO [train.py:996] (1/4) Epoch 10, batch 13550, loss[loss=0.2642, simple_loss=0.3647, pruned_loss=0.08183, over 21766.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3192, pruned_loss=0.08378, over 4273628.64 frames. ], batch size: 332, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:13:13,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1728072.0, ans=0.0 2023-06-24 11:13:27,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1728132.0, ans=0.2 2023-06-24 11:13:29,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-24 11:13:30,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1728132.0, ans=0.0 2023-06-24 11:13:40,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1728132.0, ans=0.2 2023-06-24 11:14:07,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1728252.0, ans=0.125 2023-06-24 11:14:08,413 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.60 vs. limit=10.0 2023-06-24 11:14:11,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1728252.0, ans=0.125 2023-06-24 11:14:14,897 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.411e+02 7.999e+02 1.250e+03 1.835e+03 3.854e+03, threshold=2.499e+03, percent-clipped=11.0 2023-06-24 11:14:17,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1728252.0, ans=0.0 2023-06-24 11:14:21,352 INFO [train.py:996] (1/4) Epoch 10, batch 13600, loss[loss=0.2734, simple_loss=0.3466, pruned_loss=0.1001, over 21581.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3211, pruned_loss=0.08505, over 4270896.88 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:15:59,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1728552.0, ans=0.5 2023-06-24 11:16:02,447 INFO [train.py:996] (1/4) Epoch 10, batch 13650, loss[loss=0.2218, simple_loss=0.2806, pruned_loss=0.08155, over 21513.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3162, pruned_loss=0.0811, over 4268764.09 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:16:03,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=12.0 2023-06-24 11:16:05,149 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-24 11:16:11,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-24 11:16:52,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1728732.0, ans=0.05 2023-06-24 11:16:56,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1728732.0, ans=0.125 2023-06-24 11:17:11,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1728792.0, ans=0.0 2023-06-24 11:17:12,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1728792.0, ans=0.125 2023-06-24 11:17:29,433 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.339e+02 6.847e+02 1.012e+03 1.771e+03 3.769e+03, threshold=2.024e+03, percent-clipped=10.0 2023-06-24 11:17:38,800 INFO [train.py:996] (1/4) Epoch 10, batch 13700, loss[loss=0.2315, simple_loss=0.359, pruned_loss=0.05202, over 19793.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3119, pruned_loss=0.07988, over 4270096.39 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:17:45,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1728912.0, ans=0.0 2023-06-24 11:18:38,377 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-24 11:18:48,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.89 vs. limit=6.0 2023-06-24 11:19:16,515 INFO [train.py:996] (1/4) Epoch 10, batch 13750, loss[loss=0.1637, simple_loss=0.215, pruned_loss=0.05619, over 21790.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3078, pruned_loss=0.0785, over 4263846.16 frames. ], batch size: 102, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:20:56,605 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.866e+02 6.962e+02 1.196e+03 1.876e+03 4.514e+03, threshold=2.392e+03, percent-clipped=21.0 2023-06-24 11:21:05,680 INFO [train.py:996] (1/4) Epoch 10, batch 13800, loss[loss=0.1843, simple_loss=0.2469, pruned_loss=0.06086, over 21866.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3132, pruned_loss=0.07709, over 4272932.11 frames. ], batch size: 107, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:21:12,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1729512.0, ans=0.0 2023-06-24 11:21:25,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1729572.0, ans=0.125 2023-06-24 11:21:27,537 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-24 11:22:10,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1729692.0, ans=0.0 2023-06-24 11:22:49,294 INFO [train.py:996] (1/4) Epoch 10, batch 13850, loss[loss=0.2257, simple_loss=0.3156, pruned_loss=0.06794, over 20680.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3182, pruned_loss=0.07806, over 4270354.24 frames. ], batch size: 608, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:23:25,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1729932.0, ans=0.1 2023-06-24 11:23:53,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1729992.0, ans=0.0 2023-06-24 11:24:07,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1730052.0, ans=0.125 2023-06-24 11:24:21,360 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 7.694e+02 1.020e+03 1.467e+03 3.637e+03, threshold=2.040e+03, percent-clipped=4.0 2023-06-24 11:24:25,892 INFO [train.py:996] (1/4) Epoch 10, batch 13900, loss[loss=0.2471, simple_loss=0.3059, pruned_loss=0.09417, over 20025.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.322, pruned_loss=0.0816, over 4271950.83 frames. ], batch size: 702, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:25:07,128 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-24 11:25:31,455 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:26:02,522 INFO [train.py:996] (1/4) Epoch 10, batch 13950, loss[loss=0.2602, simple_loss=0.3175, pruned_loss=0.1015, over 21325.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3217, pruned_loss=0.08375, over 4273881.42 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:26:30,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1730472.0, ans=0.125 2023-06-24 11:26:56,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1730592.0, ans=0.125 2023-06-24 11:26:56,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1730592.0, ans=0.09899494936611666 2023-06-24 11:27:07,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1730592.0, ans=0.0 2023-06-24 11:27:19,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1730652.0, ans=0.125 2023-06-24 11:27:32,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1730652.0, ans=0.0 2023-06-24 11:27:32,554 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-24 11:27:33,036 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.002e+02 6.836e+02 9.588e+02 1.525e+03 4.378e+03, threshold=1.918e+03, percent-clipped=10.0 2023-06-24 11:27:37,581 INFO [train.py:996] (1/4) Epoch 10, batch 14000, loss[loss=0.1884, simple_loss=0.2634, pruned_loss=0.05667, over 21367.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3176, pruned_loss=0.08051, over 4269818.17 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:27:58,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1730772.0, ans=0.125 2023-06-24 11:28:04,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1730772.0, ans=0.0 2023-06-24 11:28:08,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1730832.0, ans=0.125 2023-06-24 11:28:28,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1730832.0, ans=0.0 2023-06-24 11:28:28,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1730832.0, ans=0.2 2023-06-24 11:28:36,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1730892.0, ans=0.0 2023-06-24 11:29:07,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1730952.0, ans=0.04949747468305833 2023-06-24 11:29:13,001 INFO [train.py:996] (1/4) Epoch 10, batch 14050, loss[loss=0.2449, simple_loss=0.3017, pruned_loss=0.09411, over 21374.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3114, pruned_loss=0.07652, over 4273791.48 frames. ], batch size: 507, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:29:31,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1731072.0, ans=0.2 2023-06-24 11:30:00,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1731132.0, ans=0.125 2023-06-24 11:30:28,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1731252.0, ans=0.1 2023-06-24 11:30:45,126 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 8.037e+02 1.202e+03 1.798e+03 5.374e+03, threshold=2.404e+03, percent-clipped=21.0 2023-06-24 11:30:48,141 INFO [train.py:996] (1/4) Epoch 10, batch 14100, loss[loss=0.2647, simple_loss=0.3242, pruned_loss=0.1026, over 21354.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3074, pruned_loss=0.0769, over 4270316.52 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:31:09,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1731372.0, ans=0.0 2023-06-24 11:31:12,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1731372.0, ans=0.125 2023-06-24 11:31:48,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1731492.0, ans=0.0 2023-06-24 11:31:56,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1731492.0, ans=0.125 2023-06-24 11:31:59,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1731492.0, ans=0.025 2023-06-24 11:32:02,527 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:32:23,916 INFO [train.py:996] (1/4) Epoch 10, batch 14150, loss[loss=0.2384, simple_loss=0.3173, pruned_loss=0.07978, over 21833.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.31, pruned_loss=0.07774, over 4266507.49 frames. ], batch size: 102, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:32:30,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1731612.0, ans=0.125 2023-06-24 11:32:45,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1731672.0, ans=0.0 2023-06-24 11:33:34,357 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.57 vs. limit=8.0 2023-06-24 11:33:42,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1731852.0, ans=0.125 2023-06-24 11:33:43,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1731852.0, ans=0.1 2023-06-24 11:33:50,836 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 6.112e+02 7.492e+02 9.193e+02 1.976e+03, threshold=1.498e+03, percent-clipped=0.0 2023-06-24 11:33:58,956 INFO [train.py:996] (1/4) Epoch 10, batch 14200, loss[loss=0.2148, simple_loss=0.2789, pruned_loss=0.07541, over 20226.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3084, pruned_loss=0.07666, over 4276141.30 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:34:09,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1731912.0, ans=0.125 2023-06-24 11:34:25,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1731972.0, ans=0.05 2023-06-24 11:34:34,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1732032.0, ans=0.1 2023-06-24 11:34:34,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1732032.0, ans=0.125 2023-06-24 11:35:24,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1732152.0, ans=0.09899494936611666 2023-06-24 11:35:25,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=22.5 2023-06-24 11:35:29,037 INFO [train.py:996] (1/4) Epoch 10, batch 14250, loss[loss=0.2057, simple_loss=0.2698, pruned_loss=0.0708, over 21595.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3025, pruned_loss=0.07618, over 4271354.59 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:35:45,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-24 11:35:57,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-24 11:36:24,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1732332.0, ans=0.125 2023-06-24 11:37:02,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1732452.0, ans=0.125 2023-06-24 11:37:05,374 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 5.862e+02 7.753e+02 1.347e+03 3.974e+03, threshold=1.551e+03, percent-clipped=20.0 2023-06-24 11:37:08,447 INFO [train.py:996] (1/4) Epoch 10, batch 14300, loss[loss=0.3052, simple_loss=0.3961, pruned_loss=0.1072, over 21784.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3079, pruned_loss=0.07706, over 4268596.33 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:37:35,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1732572.0, ans=0.0 2023-06-24 11:37:36,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1732572.0, ans=0.125 2023-06-24 11:38:44,754 INFO [train.py:996] (1/4) Epoch 10, batch 14350, loss[loss=0.145, simple_loss=0.1991, pruned_loss=0.04548, over 16337.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3124, pruned_loss=0.07716, over 4256359.99 frames. ], batch size: 61, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:40:17,080 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.258e+02 7.078e+02 1.011e+03 1.349e+03 3.463e+03, threshold=2.022e+03, percent-clipped=22.0 2023-06-24 11:40:23,246 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-24 11:40:25,273 INFO [train.py:996] (1/4) Epoch 10, batch 14400, loss[loss=0.245, simple_loss=0.3063, pruned_loss=0.09187, over 21824.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3104, pruned_loss=0.07793, over 4261280.13 frames. ], batch size: 118, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:40:27,222 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1733112.0, ans=0.07 2023-06-24 11:41:03,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1733232.0, ans=0.2 2023-06-24 11:41:35,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1733352.0, ans=0.125 2023-06-24 11:41:54,336 INFO [train.py:996] (1/4) Epoch 10, batch 14450, loss[loss=0.202, simple_loss=0.2587, pruned_loss=0.07264, over 21236.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3053, pruned_loss=0.07862, over 4269448.13 frames. ], batch size: 548, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:42:36,285 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-24 11:42:38,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1733532.0, ans=0.125 2023-06-24 11:42:51,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1733592.0, ans=0.125 2023-06-24 11:43:03,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1733592.0, ans=0.125 2023-06-24 11:43:19,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 11:43:23,373 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.538e+02 6.420e+02 9.253e+02 1.365e+03 3.104e+03, threshold=1.851e+03, percent-clipped=3.0 2023-06-24 11:43:26,369 INFO [train.py:996] (1/4) Epoch 10, batch 14500, loss[loss=0.2109, simple_loss=0.3016, pruned_loss=0.06013, over 21536.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3023, pruned_loss=0.07807, over 4268816.44 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:43:51,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1733772.0, ans=0.025 2023-06-24 11:43:59,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-24 11:44:06,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1733832.0, ans=0.2 2023-06-24 11:44:25,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1733832.0, ans=0.2 2023-06-24 11:44:28,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1733892.0, ans=0.125 2023-06-24 11:44:39,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1733892.0, ans=0.0 2023-06-24 11:44:40,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1733892.0, ans=0.1 2023-06-24 11:44:51,007 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-24 11:45:08,985 INFO [train.py:996] (1/4) Epoch 10, batch 14550, loss[loss=0.2477, simple_loss=0.319, pruned_loss=0.0882, over 21376.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3064, pruned_loss=0.07943, over 4272765.68 frames. ], batch size: 549, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:45:22,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-24 11:45:34,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1734072.0, ans=0.0 2023-06-24 11:45:52,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1734132.0, ans=0.125 2023-06-24 11:46:20,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1734192.0, ans=0.125 2023-06-24 11:46:34,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1734252.0, ans=0.0 2023-06-24 11:46:42,612 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.617e+02 6.748e+02 9.842e+02 1.367e+03 3.226e+03, threshold=1.968e+03, percent-clipped=9.0 2023-06-24 11:46:45,783 INFO [train.py:996] (1/4) Epoch 10, batch 14600, loss[loss=0.2394, simple_loss=0.3278, pruned_loss=0.07547, over 21799.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3151, pruned_loss=0.08311, over 4278372.61 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:47:10,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1734372.0, ans=0.0 2023-06-24 11:47:35,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1734432.0, ans=0.0 2023-06-24 11:47:46,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1734492.0, ans=0.0 2023-06-24 11:48:17,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-24 11:48:21,373 INFO [train.py:996] (1/4) Epoch 10, batch 14650, loss[loss=0.1826, simple_loss=0.2599, pruned_loss=0.05267, over 21400.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3173, pruned_loss=0.08225, over 4283596.39 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:49:02,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1734732.0, ans=0.0 2023-06-24 11:49:19,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1734792.0, ans=0.125 2023-06-24 11:49:30,389 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:49:46,729 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-24 11:49:54,751 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 6.829e+02 9.850e+02 1.571e+03 3.523e+03, threshold=1.970e+03, percent-clipped=13.0 2023-06-24 11:49:57,777 INFO [train.py:996] (1/4) Epoch 10, batch 14700, loss[loss=0.2018, simple_loss=0.2986, pruned_loss=0.05253, over 21689.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3118, pruned_loss=0.07732, over 4276195.90 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:50:02,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1734912.0, ans=0.125 2023-06-24 11:50:06,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1734912.0, ans=0.04949747468305833 2023-06-24 11:50:42,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1735032.0, ans=0.125 2023-06-24 11:50:44,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1735032.0, ans=0.2 2023-06-24 11:50:47,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1735032.0, ans=0.125 2023-06-24 11:50:53,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.12 vs. limit=15.0 2023-06-24 11:51:36,605 INFO [train.py:996] (1/4) Epoch 10, batch 14750, loss[loss=0.2389, simple_loss=0.317, pruned_loss=0.08043, over 21557.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3148, pruned_loss=0.07876, over 4270440.52 frames. ], batch size: 194, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:51:44,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-24 11:52:49,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1735392.0, ans=0.125 2023-06-24 11:53:06,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1735452.0, ans=0.5 2023-06-24 11:53:10,540 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.743e+02 7.762e+02 1.072e+03 1.702e+03 3.196e+03, threshold=2.144e+03, percent-clipped=17.0 2023-06-24 11:53:13,747 INFO [train.py:996] (1/4) Epoch 10, batch 14800, loss[loss=0.2111, simple_loss=0.2837, pruned_loss=0.06929, over 21562.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3263, pruned_loss=0.08443, over 4275249.40 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:53:29,460 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=8.0 2023-06-24 11:53:53,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1735572.0, ans=0.0 2023-06-24 11:54:03,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1735632.0, ans=0.125 2023-06-24 11:54:06,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1735632.0, ans=0.1 2023-06-24 11:54:31,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.60 vs. limit=6.0 2023-06-24 11:54:36,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-24 11:55:02,599 INFO [train.py:996] (1/4) Epoch 10, batch 14850, loss[loss=0.2286, simple_loss=0.2948, pruned_loss=0.08126, over 21647.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3198, pruned_loss=0.08347, over 4266294.90 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:55:11,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1735812.0, ans=0.125 2023-06-24 11:55:35,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1735872.0, ans=0.0 2023-06-24 11:55:49,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1735932.0, ans=0.0 2023-06-24 11:56:24,194 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-24 11:56:37,528 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.296e+02 7.240e+02 1.054e+03 1.566e+03 3.588e+03, threshold=2.108e+03, percent-clipped=9.0 2023-06-24 11:56:40,615 INFO [train.py:996] (1/4) Epoch 10, batch 14900, loss[loss=0.224, simple_loss=0.2935, pruned_loss=0.07721, over 21624.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3215, pruned_loss=0.08526, over 4266006.86 frames. ], batch size: 112, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:57:12,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1736172.0, ans=0.125 2023-06-24 11:58:28,368 INFO [train.py:996] (1/4) Epoch 10, batch 14950, loss[loss=0.2813, simple_loss=0.3602, pruned_loss=0.1012, over 21419.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3235, pruned_loss=0.08541, over 4266477.12 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:58:33,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736412.0, ans=0.1 2023-06-24 11:59:23,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1736592.0, ans=0.125 2023-06-24 11:59:52,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1736652.0, ans=0.125 2023-06-24 12:00:04,939 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.178e+02 7.042e+02 9.483e+02 1.424e+03 2.881e+03, threshold=1.897e+03, percent-clipped=9.0 2023-06-24 12:00:06,694 INFO [train.py:996] (1/4) Epoch 10, batch 15000, loss[loss=0.2558, simple_loss=0.3366, pruned_loss=0.08745, over 20680.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3245, pruned_loss=0.08625, over 4268503.03 frames. ], batch size: 607, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:00:06,694 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 12:00:22,747 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2522, simple_loss=0.3488, pruned_loss=0.07776, over 1796401.00 frames. 2023-06-24 12:00:22,748 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 12:00:30,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-24 12:01:19,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1736892.0, ans=0.125 2023-06-24 12:01:34,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1736892.0, ans=0.0 2023-06-24 12:01:40,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1736892.0, ans=0.0 2023-06-24 12:02:00,958 INFO [train.py:996] (1/4) Epoch 10, batch 15050, loss[loss=0.2601, simple_loss=0.3423, pruned_loss=0.08893, over 21648.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3271, pruned_loss=0.0876, over 4272047.67 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:02:10,737 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:02:43,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1737132.0, ans=0.125 2023-06-24 12:03:11,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-24 12:03:18,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1737192.0, ans=0.2 2023-06-24 12:03:29,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1737252.0, ans=0.125 2023-06-24 12:03:29,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1737252.0, ans=0.0 2023-06-24 12:03:36,998 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 8.670e+02 1.430e+03 2.221e+03 3.965e+03, threshold=2.861e+03, percent-clipped=33.0 2023-06-24 12:03:38,491 INFO [train.py:996] (1/4) Epoch 10, batch 15100, loss[loss=0.2809, simple_loss=0.3539, pruned_loss=0.104, over 21832.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3295, pruned_loss=0.0868, over 4273395.34 frames. ], batch size: 118, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:03:39,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1737312.0, ans=0.125 2023-06-24 12:03:42,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1737312.0, ans=0.125 2023-06-24 12:03:46,113 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-24 12:04:41,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1737432.0, ans=0.0 2023-06-24 12:04:52,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1737492.0, ans=0.1 2023-06-24 12:05:15,769 INFO [train.py:996] (1/4) Epoch 10, batch 15150, loss[loss=0.2256, simple_loss=0.3004, pruned_loss=0.07541, over 21380.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3269, pruned_loss=0.0878, over 4278825.19 frames. ], batch size: 548, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:05:16,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-24 12:05:46,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1737672.0, ans=0.125 2023-06-24 12:05:46,676 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-24 12:06:21,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-06-24 12:06:46,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.42 vs. limit=15.0 2023-06-24 12:06:46,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1737852.0, ans=0.125 2023-06-24 12:06:55,787 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.279e+02 6.808e+02 1.091e+03 1.698e+03 5.270e+03, threshold=2.181e+03, percent-clipped=2.0 2023-06-24 12:07:02,091 INFO [train.py:996] (1/4) Epoch 10, batch 15200, loss[loss=0.2018, simple_loss=0.2672, pruned_loss=0.06823, over 21725.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3172, pruned_loss=0.084, over 4264431.45 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 12:07:05,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1737912.0, ans=0.04949747468305833 2023-06-24 12:08:02,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=22.5 2023-06-24 12:08:09,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1738092.0, ans=0.05 2023-06-24 12:08:32,601 INFO [train.py:996] (1/4) Epoch 10, batch 15250, loss[loss=0.2648, simple_loss=0.3299, pruned_loss=0.09983, over 21752.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3112, pruned_loss=0.08229, over 4261678.94 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:08:32,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1738212.0, ans=0.125 2023-06-24 12:09:34,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-24 12:09:57,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1738452.0, ans=0.0 2023-06-24 12:10:17,227 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.809e+02 9.550e+02 1.644e+03 2.423e+03 4.460e+03, threshold=3.287e+03, percent-clipped=35.0 2023-06-24 12:10:17,258 INFO [train.py:996] (1/4) Epoch 10, batch 15300, loss[loss=0.2443, simple_loss=0.3203, pruned_loss=0.08414, over 20713.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3143, pruned_loss=0.08496, over 4259190.23 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:10:17,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1738512.0, ans=0.125 2023-06-24 12:10:21,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1738512.0, ans=0.0 2023-06-24 12:10:22,496 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1738512.0, ans=0.0 2023-06-24 12:10:25,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1738512.0, ans=0.2 2023-06-24 12:11:16,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1738692.0, ans=0.125 2023-06-24 12:11:54,393 INFO [train.py:996] (1/4) Epoch 10, batch 15350, loss[loss=0.2637, simple_loss=0.3321, pruned_loss=0.09764, over 21604.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.319, pruned_loss=0.0874, over 4271332.08 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:13:24,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 7.880e+02 1.088e+03 1.633e+03 3.514e+03, threshold=2.175e+03, percent-clipped=1.0 2023-06-24 12:13:24,456 INFO [train.py:996] (1/4) Epoch 10, batch 15400, loss[loss=0.2684, simple_loss=0.3355, pruned_loss=0.1007, over 21867.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3199, pruned_loss=0.08548, over 4275586.27 frames. ], batch size: 414, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:14:26,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1739292.0, ans=0.125 2023-06-24 12:15:05,364 INFO [train.py:996] (1/4) Epoch 10, batch 15450, loss[loss=0.2191, simple_loss=0.3, pruned_loss=0.06906, over 21432.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3192, pruned_loss=0.08442, over 4269708.51 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:15:13,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1739412.0, ans=0.1 2023-06-24 12:16:07,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-24 12:16:43,401 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 7.285e+02 1.027e+03 1.632e+03 3.153e+03, threshold=2.054e+03, percent-clipped=10.0 2023-06-24 12:16:43,423 INFO [train.py:996] (1/4) Epoch 10, batch 15500, loss[loss=0.268, simple_loss=0.3354, pruned_loss=0.1003, over 21605.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3204, pruned_loss=0.08405, over 4258185.77 frames. ], batch size: 263, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:17:23,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-24 12:17:33,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1739832.0, ans=0.125 2023-06-24 12:18:21,660 INFO [train.py:996] (1/4) Epoch 10, batch 15550, loss[loss=0.285, simple_loss=0.3673, pruned_loss=0.1014, over 20038.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3189, pruned_loss=0.0828, over 4252021.41 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:18:41,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-24 12:18:51,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1740072.0, ans=0.2 2023-06-24 12:19:47,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1740252.0, ans=0.0 2023-06-24 12:19:51,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1740252.0, ans=0.125 2023-06-24 12:19:54,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1740252.0, ans=0.0 2023-06-24 12:19:58,918 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.243e+02 5.866e+02 9.218e+02 1.616e+03 3.082e+03, threshold=1.844e+03, percent-clipped=8.0 2023-06-24 12:19:58,940 INFO [train.py:996] (1/4) Epoch 10, batch 15600, loss[loss=0.2316, simple_loss=0.3107, pruned_loss=0.07629, over 21612.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3117, pruned_loss=0.08115, over 4252193.06 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:20:21,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1740312.0, ans=0.2 2023-06-24 12:21:21,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1740552.0, ans=0.0 2023-06-24 12:21:30,709 INFO [train.py:996] (1/4) Epoch 10, batch 15650, loss[loss=0.2101, simple_loss=0.2789, pruned_loss=0.07065, over 21856.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3089, pruned_loss=0.08002, over 4266695.47 frames. ], batch size: 373, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:21:53,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1740612.0, ans=0.0 2023-06-24 12:21:59,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1740672.0, ans=0.125 2023-06-24 12:22:25,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740792.0, ans=0.1 2023-06-24 12:23:06,127 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:23:07,103 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 7.365e+02 1.096e+03 1.379e+03 2.536e+03, threshold=2.192e+03, percent-clipped=6.0 2023-06-24 12:23:07,134 INFO [train.py:996] (1/4) Epoch 10, batch 15700, loss[loss=0.185, simple_loss=0.2508, pruned_loss=0.05961, over 21213.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3054, pruned_loss=0.079, over 4255184.67 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:23:17,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-24 12:24:25,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1741092.0, ans=0.125 2023-06-24 12:24:34,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1741152.0, ans=0.2 2023-06-24 12:24:43,541 INFO [train.py:996] (1/4) Epoch 10, batch 15750, loss[loss=0.2213, simple_loss=0.2899, pruned_loss=0.07631, over 21502.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3008, pruned_loss=0.07797, over 4267367.12 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:25:16,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1741272.0, ans=0.125 2023-06-24 12:25:42,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1741392.0, ans=0.125 2023-06-24 12:26:12,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1741512.0, ans=0.0 2023-06-24 12:26:13,777 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.443e+02 6.609e+02 9.142e+02 1.184e+03 2.398e+03, threshold=1.828e+03, percent-clipped=2.0 2023-06-24 12:26:13,809 INFO [train.py:996] (1/4) Epoch 10, batch 15800, loss[loss=0.2345, simple_loss=0.2936, pruned_loss=0.08768, over 21889.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2961, pruned_loss=0.07766, over 4267577.88 frames. ], batch size: 373, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:27:39,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1741752.0, ans=0.0 2023-06-24 12:27:49,621 INFO [train.py:996] (1/4) Epoch 10, batch 15850, loss[loss=0.2289, simple_loss=0.2934, pruned_loss=0.08224, over 21769.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.299, pruned_loss=0.07884, over 4266804.52 frames. ], batch size: 124, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:28:26,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1741872.0, ans=0.0 2023-06-24 12:28:29,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1741932.0, ans=0.125 2023-06-24 12:29:02,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1741992.0, ans=0.0 2023-06-24 12:29:08,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1741992.0, ans=0.04949747468305833 2023-06-24 12:29:17,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1742052.0, ans=0.125 2023-06-24 12:29:26,904 INFO [train.py:996] (1/4) Epoch 10, batch 15900, loss[loss=0.192, simple_loss=0.2637, pruned_loss=0.06016, over 21803.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2991, pruned_loss=0.07977, over 4273761.81 frames. ], batch size: 352, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:29:28,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 8.317e+02 1.237e+03 1.605e+03 4.098e+03, threshold=2.474e+03, percent-clipped=15.0 2023-06-24 12:30:54,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1742352.0, ans=12.0 2023-06-24 12:30:58,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1742352.0, ans=0.0 2023-06-24 12:31:04,689 INFO [train.py:996] (1/4) Epoch 10, batch 15950, loss[loss=0.1921, simple_loss=0.2758, pruned_loss=0.05419, over 21336.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2995, pruned_loss=0.07626, over 4279962.57 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:31:08,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1742412.0, ans=0.0 2023-06-24 12:31:31,568 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:32:43,179 INFO [train.py:996] (1/4) Epoch 10, batch 16000, loss[loss=0.2309, simple_loss=0.3273, pruned_loss=0.06727, over 21661.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3012, pruned_loss=0.07472, over 4276252.22 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:32:44,658 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 6.384e+02 8.996e+02 1.327e+03 2.604e+03, threshold=1.799e+03, percent-clipped=2.0 2023-06-24 12:32:58,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-24 12:33:05,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1742772.0, ans=0.5 2023-06-24 12:33:16,812 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:33:34,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1742832.0, ans=0.2 2023-06-24 12:33:53,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1742892.0, ans=0.1 2023-06-24 12:33:54,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1742892.0, ans=0.1 2023-06-24 12:34:08,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1742952.0, ans=0.2 2023-06-24 12:34:11,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.92 vs. limit=22.5 2023-06-24 12:34:20,853 INFO [train.py:996] (1/4) Epoch 10, batch 16050, loss[loss=0.2289, simple_loss=0.2799, pruned_loss=0.08894, over 20382.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3021, pruned_loss=0.07263, over 4271411.91 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:34:32,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1743012.0, ans=0.0 2023-06-24 12:34:50,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1743072.0, ans=0.2 2023-06-24 12:35:51,319 INFO [train.py:996] (1/4) Epoch 10, batch 16100, loss[loss=0.238, simple_loss=0.3063, pruned_loss=0.08488, over 21617.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3072, pruned_loss=0.07539, over 4276251.08 frames. ], batch size: 263, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:35:54,437 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 6.034e+02 8.242e+02 1.333e+03 2.832e+03, threshold=1.648e+03, percent-clipped=8.0 2023-06-24 12:36:03,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-24 12:36:04,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1743312.0, ans=0.2 2023-06-24 12:36:08,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1743372.0, ans=0.1 2023-06-24 12:36:28,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1743372.0, ans=0.2 2023-06-24 12:37:21,237 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:37:26,917 INFO [train.py:996] (1/4) Epoch 10, batch 16150, loss[loss=0.2587, simple_loss=0.3226, pruned_loss=0.09736, over 21775.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3069, pruned_loss=0.07779, over 4291482.13 frames. ], batch size: 441, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:38:24,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1743792.0, ans=0.125 2023-06-24 12:38:43,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1743852.0, ans=0.1 2023-06-24 12:38:46,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1743852.0, ans=0.125 2023-06-24 12:38:54,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1743852.0, ans=0.125 2023-06-24 12:39:05,206 INFO [train.py:996] (1/4) Epoch 10, batch 16200, loss[loss=0.2763, simple_loss=0.3471, pruned_loss=0.1028, over 21786.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3106, pruned_loss=0.07909, over 4287638.16 frames. ], batch size: 332, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:39:08,333 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.546e+02 7.202e+02 1.055e+03 1.408e+03 3.192e+03, threshold=2.110e+03, percent-clipped=15.0 2023-06-24 12:39:39,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1743972.0, ans=0.0 2023-06-24 12:39:50,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1744032.0, ans=0.0 2023-06-24 12:40:09,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1744092.0, ans=0.125 2023-06-24 12:40:10,008 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:40:12,278 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-24 12:40:25,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-24 12:40:37,343 INFO [train.py:996] (1/4) Epoch 10, batch 16250, loss[loss=0.2108, simple_loss=0.2908, pruned_loss=0.06538, over 21517.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3101, pruned_loss=0.07847, over 4275979.92 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:40:52,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1744212.0, ans=0.125 2023-06-24 12:41:34,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1744332.0, ans=0.2 2023-06-24 12:41:44,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-24 12:42:01,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-24 12:42:18,189 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=15.0 2023-06-24 12:42:18,495 INFO [train.py:996] (1/4) Epoch 10, batch 16300, loss[loss=0.2314, simple_loss=0.3456, pruned_loss=0.0586, over 19863.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3045, pruned_loss=0.07439, over 4277933.21 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:42:27,041 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.249e+02 6.511e+02 9.054e+02 1.473e+03 4.161e+03, threshold=1.811e+03, percent-clipped=10.0 2023-06-24 12:42:48,043 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.85 vs. limit=10.0 2023-06-24 12:42:53,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1744632.0, ans=0.125 2023-06-24 12:44:00,838 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-24 12:44:01,360 INFO [train.py:996] (1/4) Epoch 10, batch 16350, loss[loss=0.2328, simple_loss=0.31, pruned_loss=0.07779, over 19917.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3056, pruned_loss=0.07585, over 4276725.54 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 8.0 2023-06-24 12:44:28,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1744872.0, ans=0.125 2023-06-24 12:45:23,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-24 12:45:38,335 INFO [train.py:996] (1/4) Epoch 10, batch 16400, loss[loss=0.2269, simple_loss=0.2983, pruned_loss=0.07774, over 21374.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3129, pruned_loss=0.07859, over 4278458.21 frames. ], batch size: 144, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:45:41,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1745112.0, ans=0.125 2023-06-24 12:45:42,805 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.368e+02 7.745e+02 1.144e+03 1.661e+03 2.943e+03, threshold=2.288e+03, percent-clipped=22.0 2023-06-24 12:46:00,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1745172.0, ans=0.125 2023-06-24 12:46:25,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1745232.0, ans=0.0 2023-06-24 12:46:44,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1745292.0, ans=0.0 2023-06-24 12:46:51,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1745292.0, ans=0.1 2023-06-24 12:46:57,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1745352.0, ans=0.2 2023-06-24 12:47:16,261 INFO [train.py:996] (1/4) Epoch 10, batch 16450, loss[loss=0.2613, simple_loss=0.3214, pruned_loss=0.1006, over 21621.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3138, pruned_loss=0.08029, over 4283569.86 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:47:39,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1745472.0, ans=0.025 2023-06-24 12:47:47,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1745532.0, ans=0.125 2023-06-24 12:48:16,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1745592.0, ans=0.2 2023-06-24 12:48:23,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1745592.0, ans=0.125 2023-06-24 12:48:51,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1745652.0, ans=0.125 2023-06-24 12:48:52,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1745712.0, ans=0.125 2023-06-24 12:48:53,498 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.09 vs. limit=15.0 2023-06-24 12:48:53,694 INFO [train.py:996] (1/4) Epoch 10, batch 16500, loss[loss=0.1322, simple_loss=0.1733, pruned_loss=0.04554, over 16237.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.31, pruned_loss=0.07985, over 4283323.69 frames. ], batch size: 61, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:48:58,419 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.573e+02 7.722e+02 1.056e+03 1.682e+03 4.861e+03, threshold=2.112e+03, percent-clipped=4.0 2023-06-24 12:50:00,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1745892.0, ans=0.1 2023-06-24 12:50:31,383 INFO [train.py:996] (1/4) Epoch 10, batch 16550, loss[loss=0.238, simple_loss=0.3109, pruned_loss=0.08258, over 21461.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3074, pruned_loss=0.07711, over 4280942.94 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:50:39,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1746012.0, ans=0.125 2023-06-24 12:51:26,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1746132.0, ans=0.125 2023-06-24 12:51:37,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1746132.0, ans=0.07 2023-06-24 12:51:39,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1746192.0, ans=0.125 2023-06-24 12:52:16,514 INFO [train.py:996] (1/4) Epoch 10, batch 16600, loss[loss=0.3564, simple_loss=0.435, pruned_loss=0.1389, over 21407.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3163, pruned_loss=0.08041, over 4280650.11 frames. ], batch size: 507, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:52:21,402 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 8.131e+02 1.242e+03 1.757e+03 3.477e+03, threshold=2.484e+03, percent-clipped=12.0 2023-06-24 12:52:24,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1746312.0, ans=0.0 2023-06-24 12:53:11,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-24 12:54:00,393 INFO [train.py:996] (1/4) Epoch 10, batch 16650, loss[loss=0.3023, simple_loss=0.3693, pruned_loss=0.1177, over 21346.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3236, pruned_loss=0.0826, over 4278938.22 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:54:16,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1746612.0, ans=0.125 2023-06-24 12:54:20,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1746672.0, ans=0.0 2023-06-24 12:54:38,635 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-24 12:55:33,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1746852.0, ans=0.125 2023-06-24 12:55:44,456 INFO [train.py:996] (1/4) Epoch 10, batch 16700, loss[loss=0.2624, simple_loss=0.3802, pruned_loss=0.07229, over 19763.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3249, pruned_loss=0.08311, over 4278656.45 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:55:49,386 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.755e+02 7.026e+02 1.004e+03 1.401e+03 2.239e+03, threshold=2.007e+03, percent-clipped=0.0 2023-06-24 12:56:19,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1747032.0, ans=0.035 2023-06-24 12:56:36,232 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:56:57,507 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-24 12:57:30,025 INFO [train.py:996] (1/4) Epoch 10, batch 16750, loss[loss=0.2046, simple_loss=0.2559, pruned_loss=0.07662, over 20111.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3263, pruned_loss=0.08522, over 4274379.78 frames. ], batch size: 703, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:57:40,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1747212.0, ans=0.2 2023-06-24 12:58:09,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.50 vs. limit=15.0 2023-06-24 12:59:08,100 INFO [train.py:996] (1/4) Epoch 10, batch 16800, loss[loss=0.245, simple_loss=0.3153, pruned_loss=0.08732, over 21905.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3302, pruned_loss=0.08504, over 4280300.98 frames. ], batch size: 316, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:59:12,604 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.402e+02 6.959e+02 1.077e+03 1.675e+03 3.931e+03, threshold=2.154e+03, percent-clipped=17.0 2023-06-24 12:59:22,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1747512.0, ans=0.125 2023-06-24 12:59:54,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1747632.0, ans=0.2 2023-06-24 13:00:19,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1747692.0, ans=0.1 2023-06-24 13:00:23,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1747692.0, ans=15.0 2023-06-24 13:00:39,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1747752.0, ans=0.0 2023-06-24 13:00:43,550 INFO [train.py:996] (1/4) Epoch 10, batch 16850, loss[loss=0.2384, simple_loss=0.3115, pruned_loss=0.08265, over 21903.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3273, pruned_loss=0.08523, over 4286993.93 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:00:52,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-24 13:01:04,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1747872.0, ans=0.2 2023-06-24 13:01:24,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.52 vs. limit=10.0 2023-06-24 13:01:51,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1747992.0, ans=0.1 2023-06-24 13:01:58,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1747992.0, ans=0.1 2023-06-24 13:02:05,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-24 13:02:19,556 INFO [train.py:996] (1/4) Epoch 10, batch 16900, loss[loss=0.2035, simple_loss=0.2729, pruned_loss=0.06702, over 21615.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3222, pruned_loss=0.08337, over 4288827.81 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:02:30,795 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.364e+02 7.003e+02 1.142e+03 1.621e+03 3.220e+03, threshold=2.284e+03, percent-clipped=11.0 2023-06-24 13:02:49,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1748172.0, ans=0.1 2023-06-24 13:02:53,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1748172.0, ans=0.0 2023-06-24 13:03:15,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1748232.0, ans=0.2 2023-06-24 13:03:28,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1748292.0, ans=0.95 2023-06-24 13:03:42,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1748352.0, ans=0.125 2023-06-24 13:03:52,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1748352.0, ans=0.1 2023-06-24 13:03:56,480 INFO [train.py:996] (1/4) Epoch 10, batch 16950, loss[loss=0.2417, simple_loss=0.2997, pruned_loss=0.09189, over 21420.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3167, pruned_loss=0.08234, over 4288701.67 frames. ], batch size: 177, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:03:59,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-06-24 13:04:34,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1748472.0, ans=0.125 2023-06-24 13:05:10,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1748592.0, ans=0.0 2023-06-24 13:05:33,932 INFO [train.py:996] (1/4) Epoch 10, batch 17000, loss[loss=0.2401, simple_loss=0.3098, pruned_loss=0.08522, over 21287.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3145, pruned_loss=0.08319, over 4294759.47 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:05:44,493 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.290e+02 6.732e+02 9.381e+02 1.306e+03 2.679e+03, threshold=1.876e+03, percent-clipped=4.0 2023-06-24 13:05:50,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-24 13:06:05,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1748772.0, ans=0.125 2023-06-24 13:06:14,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1748772.0, ans=0.0 2023-06-24 13:06:51,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1748892.0, ans=0.0 2023-06-24 13:07:02,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1748952.0, ans=0.125 2023-06-24 13:07:15,594 INFO [train.py:996] (1/4) Epoch 10, batch 17050, loss[loss=0.2295, simple_loss=0.3119, pruned_loss=0.07359, over 21821.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3217, pruned_loss=0.08459, over 4294776.38 frames. ], batch size: 282, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:08:22,764 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-24 13:08:23,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1749192.0, ans=0.2 2023-06-24 13:08:26,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-24 13:08:46,025 INFO [train.py:996] (1/4) Epoch 10, batch 17100, loss[loss=0.2091, simple_loss=0.2802, pruned_loss=0.06895, over 21836.00 frames. ], tot_loss[loss=0.245, simple_loss=0.32, pruned_loss=0.08507, over 4298961.98 frames. ], batch size: 282, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:08:56,766 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.820e+02 8.115e+02 1.148e+03 1.810e+03 4.142e+03, threshold=2.296e+03, percent-clipped=21.0 2023-06-24 13:09:23,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1749372.0, ans=0.0 2023-06-24 13:09:43,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1749432.0, ans=0.125 2023-06-24 13:10:08,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1749552.0, ans=0.125 2023-06-24 13:10:26,342 INFO [train.py:996] (1/4) Epoch 10, batch 17150, loss[loss=0.187, simple_loss=0.2728, pruned_loss=0.05061, over 21784.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.315, pruned_loss=0.08389, over 4304556.23 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:10:32,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1749612.0, ans=0.1 2023-06-24 13:11:16,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1749732.0, ans=0.125 2023-06-24 13:11:33,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1749792.0, ans=0.0 2023-06-24 13:11:38,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1749792.0, ans=0.0 2023-06-24 13:11:55,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1749852.0, ans=0.125 2023-06-24 13:12:07,597 INFO [train.py:996] (1/4) Epoch 10, batch 17200, loss[loss=0.2275, simple_loss=0.304, pruned_loss=0.07549, over 21726.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3134, pruned_loss=0.08363, over 4302034.65 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 13:12:18,515 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.490e+02 5.928e+02 7.581e+02 1.081e+03 2.493e+03, threshold=1.516e+03, percent-clipped=2.0 2023-06-24 13:12:28,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1749972.0, ans=0.1 2023-06-24 13:12:30,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1749972.0, ans=0.125 2023-06-24 13:13:08,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1750092.0, ans=0.0 2023-06-24 13:13:13,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1750092.0, ans=10.0 2023-06-24 13:13:21,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750092.0, ans=0.1 2023-06-24 13:13:32,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1750152.0, ans=0.125 2023-06-24 13:13:46,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1750152.0, ans=0.2 2023-06-24 13:13:52,378 INFO [train.py:996] (1/4) Epoch 10, batch 17250, loss[loss=0.2945, simple_loss=0.3602, pruned_loss=0.1144, over 21369.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3154, pruned_loss=0.0854, over 4292044.49 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 13:13:54,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1750212.0, ans=0.025 2023-06-24 13:14:04,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1750212.0, ans=0.125 2023-06-24 13:14:33,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1750332.0, ans=0.1 2023-06-24 13:14:35,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-24 13:15:14,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1750452.0, ans=0.2 2023-06-24 13:15:21,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1750452.0, ans=0.125 2023-06-24 13:15:34,700 INFO [train.py:996] (1/4) Epoch 10, batch 17300, loss[loss=0.2466, simple_loss=0.3266, pruned_loss=0.08326, over 21702.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.325, pruned_loss=0.08926, over 4287427.24 frames. ], batch size: 113, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:15:39,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1750512.0, ans=0.0 2023-06-24 13:15:42,402 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.635e+02 9.609e+02 1.379e+03 2.737e+03, threshold=1.922e+03, percent-clipped=17.0 2023-06-24 13:15:54,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1750572.0, ans=0.0 2023-06-24 13:15:57,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1750572.0, ans=0.125 2023-06-24 13:16:08,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1750632.0, ans=0.125 2023-06-24 13:16:16,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1750632.0, ans=0.1 2023-06-24 13:16:20,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1750632.0, ans=0.07 2023-06-24 13:16:29,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1750692.0, ans=0.04949747468305833 2023-06-24 13:16:35,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.60 vs. limit=22.5 2023-06-24 13:16:50,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1750752.0, ans=0.0 2023-06-24 13:16:58,730 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2023-06-24 13:17:09,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1750752.0, ans=0.125 2023-06-24 13:17:11,831 INFO [train.py:996] (1/4) Epoch 10, batch 17350, loss[loss=0.2404, simple_loss=0.3055, pruned_loss=0.08767, over 21475.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.326, pruned_loss=0.08816, over 4282090.06 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:17:12,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1750812.0, ans=0.125 2023-06-24 13:17:23,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1750812.0, ans=0.125 2023-06-24 13:18:30,380 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.31 vs. limit=6.0 2023-06-24 13:18:37,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1751052.0, ans=0.0 2023-06-24 13:18:45,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1751052.0, ans=0.125 2023-06-24 13:18:49,864 INFO [train.py:996] (1/4) Epoch 10, batch 17400, loss[loss=0.2324, simple_loss=0.3153, pruned_loss=0.07477, over 21759.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.322, pruned_loss=0.08447, over 4277936.34 frames. ], batch size: 332, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:18:57,491 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.089e+02 6.083e+02 9.753e+02 1.322e+03 2.899e+03, threshold=1.951e+03, percent-clipped=8.0 2023-06-24 13:19:55,127 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-24 13:20:26,396 INFO [train.py:996] (1/4) Epoch 10, batch 17450, loss[loss=0.2067, simple_loss=0.2979, pruned_loss=0.05773, over 21784.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3193, pruned_loss=0.08207, over 4278046.63 frames. ], batch size: 371, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:21:36,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1751592.0, ans=0.125 2023-06-24 13:22:02,709 INFO [train.py:996] (1/4) Epoch 10, batch 17500, loss[loss=0.2667, simple_loss=0.3265, pruned_loss=0.1034, over 21758.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3142, pruned_loss=0.08019, over 4281434.43 frames. ], batch size: 441, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:22:12,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1751712.0, ans=0.125 2023-06-24 13:22:16,399 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.845e+02 5.907e+02 8.163e+02 1.225e+03 3.069e+03, threshold=1.633e+03, percent-clipped=7.0 2023-06-24 13:22:19,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1751712.0, ans=0.125 2023-06-24 13:22:44,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1751772.0, ans=0.1 2023-06-24 13:22:45,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1751832.0, ans=0.2 2023-06-24 13:23:39,109 INFO [train.py:996] (1/4) Epoch 10, batch 17550, loss[loss=0.2246, simple_loss=0.3136, pruned_loss=0.06782, over 21870.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3153, pruned_loss=0.07926, over 4281448.20 frames. ], batch size: 118, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:23:41,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.35 vs. limit=6.0 2023-06-24 13:23:45,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1752012.0, ans=0.0 2023-06-24 13:23:45,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1752012.0, ans=0.125 2023-06-24 13:24:22,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1752132.0, ans=0.125 2023-06-24 13:25:03,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-24 13:25:10,559 INFO [train.py:996] (1/4) Epoch 10, batch 17600, loss[loss=0.2246, simple_loss=0.3289, pruned_loss=0.06018, over 20746.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3176, pruned_loss=0.0799, over 4265525.51 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:25:24,367 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.544e+02 6.468e+02 8.117e+02 1.176e+03 4.887e+03, threshold=1.623e+03, percent-clipped=13.0 2023-06-24 13:25:25,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1752312.0, ans=0.04949747468305833 2023-06-24 13:26:13,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1752432.0, ans=0.0 2023-06-24 13:26:33,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1752552.0, ans=0.125 2023-06-24 13:26:51,781 INFO [train.py:996] (1/4) Epoch 10, batch 17650, loss[loss=0.1656, simple_loss=0.2311, pruned_loss=0.05001, over 21551.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3144, pruned_loss=0.07873, over 4274753.78 frames. ], batch size: 230, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:27:17,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1752672.0, ans=0.07 2023-06-24 13:27:56,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1752792.0, ans=0.0 2023-06-24 13:28:11,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1752852.0, ans=0.0 2023-06-24 13:28:33,528 INFO [train.py:996] (1/4) Epoch 10, batch 17700, loss[loss=0.2803, simple_loss=0.3546, pruned_loss=0.103, over 21710.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3106, pruned_loss=0.0774, over 4269785.03 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:28:40,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1752912.0, ans=0.0 2023-06-24 13:28:43,133 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-24 13:28:48,059 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.630e+02 6.330e+02 1.013e+03 1.605e+03 3.260e+03, threshold=2.027e+03, percent-clipped=24.0 2023-06-24 13:29:16,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1753032.0, ans=0.0 2023-06-24 13:29:37,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1753092.0, ans=0.125 2023-06-24 13:29:38,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1753092.0, ans=0.125 2023-06-24 13:29:46,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1753092.0, ans=0.0 2023-06-24 13:29:48,472 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-24 13:30:12,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1753152.0, ans=0.125 2023-06-24 13:30:16,878 INFO [train.py:996] (1/4) Epoch 10, batch 17750, loss[loss=0.2938, simple_loss=0.3683, pruned_loss=0.1097, over 21504.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3161, pruned_loss=0.0797, over 4273337.78 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:30:56,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=22.5 2023-06-24 13:31:01,703 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:31:51,466 INFO [train.py:996] (1/4) Epoch 10, batch 17800, loss[loss=0.1879, simple_loss=0.2785, pruned_loss=0.04864, over 21830.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3164, pruned_loss=0.07954, over 4273781.95 frames. ], batch size: 372, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:32:07,442 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.377e+02 6.218e+02 8.448e+02 1.392e+03 2.915e+03, threshold=1.690e+03, percent-clipped=12.0 2023-06-24 13:32:31,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1753632.0, ans=0.0 2023-06-24 13:33:28,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1753752.0, ans=0.0 2023-06-24 13:33:34,315 INFO [train.py:996] (1/4) Epoch 10, batch 17850, loss[loss=0.238, simple_loss=0.3151, pruned_loss=0.08039, over 21826.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3171, pruned_loss=0.07944, over 4274194.96 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:33:34,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1753812.0, ans=0.0 2023-06-24 13:33:53,643 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-24 13:34:28,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-24 13:35:12,052 INFO [train.py:996] (1/4) Epoch 10, batch 17900, loss[loss=0.2858, simple_loss=0.3654, pruned_loss=0.1031, over 21331.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3219, pruned_loss=0.08081, over 4271333.20 frames. ], batch size: 548, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:35:23,078 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.491e+02 6.124e+02 9.329e+02 1.248e+03 3.216e+03, threshold=1.866e+03, percent-clipped=9.0 2023-06-24 13:35:45,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1754172.0, ans=0.125 2023-06-24 13:36:52,252 INFO [train.py:996] (1/4) Epoch 10, batch 17950, loss[loss=0.1999, simple_loss=0.291, pruned_loss=0.05441, over 21607.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3212, pruned_loss=0.07825, over 4267664.68 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:36:57,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1754412.0, ans=0.125 2023-06-24 13:38:02,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1754592.0, ans=0.5 2023-06-24 13:38:07,524 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-24 13:38:16,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1754652.0, ans=0.05 2023-06-24 13:38:20,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1754652.0, ans=0.1 2023-06-24 13:38:28,326 INFO [train.py:996] (1/4) Epoch 10, batch 18000, loss[loss=0.2363, simple_loss=0.2974, pruned_loss=0.08754, over 21528.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3141, pruned_loss=0.07696, over 4272901.55 frames. ], batch size: 391, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:38:28,326 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 13:38:47,228 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2575, simple_loss=0.3533, pruned_loss=0.08085, over 1796401.00 frames. 2023-06-24 13:38:47,229 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 13:39:02,588 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 8.313e+02 1.378e+03 2.030e+03 3.547e+03, threshold=2.755e+03, percent-clipped=28.0 2023-06-24 13:39:43,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-24 13:40:00,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1754952.0, ans=0.0 2023-06-24 13:40:04,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-24 13:40:06,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=12.0 2023-06-24 13:40:19,208 INFO [train.py:996] (1/4) Epoch 10, batch 18050, loss[loss=0.1995, simple_loss=0.2778, pruned_loss=0.06058, over 21734.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3096, pruned_loss=0.07713, over 4270723.74 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:41:14,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1755132.0, ans=0.2 2023-06-24 13:41:25,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755192.0, ans=0.1 2023-06-24 13:41:37,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1755252.0, ans=0.125 2023-06-24 13:42:03,157 INFO [train.py:996] (1/4) Epoch 10, batch 18100, loss[loss=0.3103, simple_loss=0.3875, pruned_loss=0.1166, over 21463.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3141, pruned_loss=0.07926, over 4272542.81 frames. ], batch size: 471, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:42:12,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1755312.0, ans=0.0 2023-06-24 13:42:19,039 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.442e+02 6.227e+02 8.455e+02 1.236e+03 2.629e+03, threshold=1.691e+03, percent-clipped=0.0 2023-06-24 13:42:32,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1755372.0, ans=0.0 2023-06-24 13:42:59,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1755492.0, ans=0.125 2023-06-24 13:43:44,539 INFO [train.py:996] (1/4) Epoch 10, batch 18150, loss[loss=0.2227, simple_loss=0.2921, pruned_loss=0.07669, over 21713.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3161, pruned_loss=0.07931, over 4272001.55 frames. ], batch size: 333, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:44:34,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1755792.0, ans=0.2 2023-06-24 13:45:10,537 INFO [train.py:996] (1/4) Epoch 10, batch 18200, loss[loss=0.2327, simple_loss=0.3014, pruned_loss=0.08206, over 21193.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.309, pruned_loss=0.07888, over 4254239.08 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:45:30,455 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.409e+02 6.816e+02 9.910e+02 1.570e+03 3.771e+03, threshold=1.982e+03, percent-clipped=24.0 2023-06-24 13:45:38,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1755972.0, ans=0.125 2023-06-24 13:46:24,954 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:46:28,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-24 13:46:41,646 INFO [train.py:996] (1/4) Epoch 10, batch 18250, loss[loss=0.1861, simple_loss=0.2582, pruned_loss=0.05702, over 21693.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3013, pruned_loss=0.07634, over 4246128.55 frames. ], batch size: 124, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:46:57,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1756212.0, ans=0.125 2023-06-24 13:47:20,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-24 13:47:22,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1756332.0, ans=0.0 2023-06-24 13:47:23,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1756332.0, ans=0.125 2023-06-24 13:47:38,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1756332.0, ans=0.125 2023-06-24 13:48:12,426 INFO [train.py:996] (1/4) Epoch 10, batch 18300, loss[loss=0.2745, simple_loss=0.3809, pruned_loss=0.08404, over 21715.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3043, pruned_loss=0.07789, over 4259342.65 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:48:23,313 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.671e+02 6.079e+02 7.788e+02 1.352e+03 4.344e+03, threshold=1.558e+03, percent-clipped=12.0 2023-06-24 13:48:30,751 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-06-24 13:49:42,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-24 13:49:49,395 INFO [train.py:996] (1/4) Epoch 10, batch 18350, loss[loss=0.2436, simple_loss=0.3523, pruned_loss=0.06746, over 20733.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3088, pruned_loss=0.07669, over 4257765.51 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:49:54,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1756812.0, ans=0.125 2023-06-24 13:50:22,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1756872.0, ans=0.1 2023-06-24 13:50:55,804 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-24 13:51:27,724 INFO [train.py:996] (1/4) Epoch 10, batch 18400, loss[loss=0.1916, simple_loss=0.2631, pruned_loss=0.06008, over 21600.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3037, pruned_loss=0.07524, over 4245481.49 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 13:51:43,858 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 6.378e+02 8.856e+02 1.210e+03 2.743e+03, threshold=1.771e+03, percent-clipped=10.0 2023-06-24 13:51:52,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1757172.0, ans=0.0 2023-06-24 13:51:57,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1757172.0, ans=0.125 2023-06-24 13:52:30,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1757292.0, ans=0.125 2023-06-24 13:52:59,707 INFO [train.py:996] (1/4) Epoch 10, batch 18450, loss[loss=0.2157, simple_loss=0.2767, pruned_loss=0.07737, over 21093.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2983, pruned_loss=0.07186, over 4243551.78 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 13:53:23,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1757472.0, ans=0.05 2023-06-24 13:54:35,878 INFO [train.py:996] (1/4) Epoch 10, batch 18500, loss[loss=0.2103, simple_loss=0.2752, pruned_loss=0.07269, over 21608.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2931, pruned_loss=0.07087, over 4253854.46 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:54:56,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1757712.0, ans=0.5 2023-06-24 13:54:57,481 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.156e+02 5.768e+02 9.363e+02 1.391e+03 2.603e+03, threshold=1.873e+03, percent-clipped=9.0 2023-06-24 13:55:01,253 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:55:04,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.18 vs. limit=15.0 2023-06-24 13:55:10,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1757772.0, ans=0.04949747468305833 2023-06-24 13:55:24,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1757832.0, ans=0.0 2023-06-24 13:55:38,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1757892.0, ans=0.125 2023-06-24 13:56:12,640 INFO [train.py:996] (1/4) Epoch 10, batch 18550, loss[loss=0.2693, simple_loss=0.3157, pruned_loss=0.1114, over 21306.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2918, pruned_loss=0.07132, over 4240589.96 frames. ], batch size: 473, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:57:27,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1758252.0, ans=0.125 2023-06-24 13:57:27,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1758252.0, ans=0.125 2023-06-24 13:57:29,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1758252.0, ans=0.0 2023-06-24 13:57:34,648 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.58 vs. limit=6.0 2023-06-24 13:57:46,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1758252.0, ans=0.07 2023-06-24 13:57:49,295 INFO [train.py:996] (1/4) Epoch 10, batch 18600, loss[loss=0.2166, simple_loss=0.3003, pruned_loss=0.06647, over 21781.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2885, pruned_loss=0.07054, over 4241177.93 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:58:12,240 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.175e+02 6.384e+02 9.662e+02 1.486e+03 4.666e+03, threshold=1.932e+03, percent-clipped=18.0 2023-06-24 13:58:18,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1758372.0, ans=0.09899494936611666 2023-06-24 13:58:38,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1758432.0, ans=0.0 2023-06-24 13:58:39,712 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1758432.0, ans=0.125 2023-06-24 13:59:06,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1758552.0, ans=0.125 2023-06-24 13:59:25,724 INFO [train.py:996] (1/4) Epoch 10, batch 18650, loss[loss=0.252, simple_loss=0.3359, pruned_loss=0.08408, over 21700.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2901, pruned_loss=0.0715, over 4242131.15 frames. ], batch size: 415, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:00:05,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1758732.0, ans=0.125 2023-06-24 14:00:55,939 INFO [train.py:996] (1/4) Epoch 10, batch 18700, loss[loss=0.2061, simple_loss=0.2818, pruned_loss=0.06526, over 21946.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2895, pruned_loss=0.07303, over 4243172.65 frames. ], batch size: 113, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:01:12,556 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.606e+02 6.882e+02 9.841e+02 1.662e+03 3.485e+03, threshold=1.968e+03, percent-clipped=16.0 2023-06-24 14:01:30,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1758972.0, ans=0.04949747468305833 2023-06-24 14:01:50,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1759032.0, ans=0.125 2023-06-24 14:02:28,420 INFO [train.py:996] (1/4) Epoch 10, batch 18750, loss[loss=0.2574, simple_loss=0.331, pruned_loss=0.09193, over 21860.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.292, pruned_loss=0.07538, over 4243735.91 frames. ], batch size: 118, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:02:43,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1759212.0, ans=0.0 2023-06-24 14:02:46,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1759212.0, ans=0.125 2023-06-24 14:03:04,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1759272.0, ans=0.1 2023-06-24 14:03:09,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1759332.0, ans=0.125 2023-06-24 14:03:13,647 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.68 vs. limit=15.0 2023-06-24 14:03:14,521 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:03:17,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1759332.0, ans=0.125 2023-06-24 14:03:19,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1759332.0, ans=0.125 2023-06-24 14:03:28,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1759392.0, ans=0.125 2023-06-24 14:03:42,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1759392.0, ans=0.125 2023-06-24 14:04:00,676 INFO [train.py:996] (1/4) Epoch 10, batch 18800, loss[loss=0.1862, simple_loss=0.2622, pruned_loss=0.05513, over 21159.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2982, pruned_loss=0.0767, over 4248335.48 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:04:22,333 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.381e+02 6.431e+02 9.700e+02 1.560e+03 3.348e+03, threshold=1.940e+03, percent-clipped=15.0 2023-06-24 14:04:24,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1759572.0, ans=0.125 2023-06-24 14:05:31,413 INFO [train.py:996] (1/4) Epoch 10, batch 18850, loss[loss=0.2002, simple_loss=0.2575, pruned_loss=0.07142, over 21820.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2943, pruned_loss=0.07312, over 4232472.86 frames. ], batch size: 102, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:06:05,817 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.30 vs. limit=10.0 2023-06-24 14:06:16,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.93 vs. limit=6.0 2023-06-24 14:06:44,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1759992.0, ans=0.125 2023-06-24 14:06:55,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1760052.0, ans=0.0 2023-06-24 14:07:07,597 INFO [train.py:996] (1/4) Epoch 10, batch 18900, loss[loss=0.2327, simple_loss=0.294, pruned_loss=0.08567, over 21814.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2913, pruned_loss=0.07372, over 4246768.40 frames. ], batch size: 371, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:07:12,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1760112.0, ans=0.0 2023-06-24 14:07:14,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1760112.0, ans=0.1 2023-06-24 14:07:24,198 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 6.602e+02 8.408e+02 1.056e+03 2.556e+03, threshold=1.682e+03, percent-clipped=3.0 2023-06-24 14:07:41,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-06-24 14:07:48,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1760172.0, ans=0.2 2023-06-24 14:07:52,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1760232.0, ans=0.04949747468305833 2023-06-24 14:08:30,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1760352.0, ans=0.2 2023-06-24 14:08:44,280 INFO [train.py:996] (1/4) Epoch 10, batch 18950, loss[loss=0.2342, simple_loss=0.3091, pruned_loss=0.07966, over 21838.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2933, pruned_loss=0.07549, over 4244341.35 frames. ], batch size: 391, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:09:49,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1760592.0, ans=0.125 2023-06-24 14:10:26,159 INFO [train.py:996] (1/4) Epoch 10, batch 19000, loss[loss=0.311, simple_loss=0.3705, pruned_loss=0.1257, over 21353.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3045, pruned_loss=0.07818, over 4247284.22 frames. ], batch size: 507, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:10:31,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1760712.0, ans=0.0 2023-06-24 14:10:49,353 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.433e+02 8.812e+02 1.283e+03 1.934e+03 4.893e+03, threshold=2.566e+03, percent-clipped=32.0 2023-06-24 14:11:00,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1760772.0, ans=0.1 2023-06-24 14:11:30,542 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-24 14:11:35,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1760892.0, ans=0.0 2023-06-24 14:12:02,063 INFO [train.py:996] (1/4) Epoch 10, batch 19050, loss[loss=0.2743, simple_loss=0.382, pruned_loss=0.08324, over 20770.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3104, pruned_loss=0.08201, over 4253900.65 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:12:21,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1761012.0, ans=0.125 2023-06-24 14:13:08,150 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=15.0 2023-06-24 14:13:16,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1761252.0, ans=0.0 2023-06-24 14:13:32,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1761252.0, ans=0.125 2023-06-24 14:13:42,732 INFO [train.py:996] (1/4) Epoch 10, batch 19100, loss[loss=0.2225, simple_loss=0.2832, pruned_loss=0.08089, over 21892.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3072, pruned_loss=0.08058, over 4245352.51 frames. ], batch size: 107, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:13:46,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1761312.0, ans=0.125 2023-06-24 14:14:01,622 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.691e+02 6.507e+02 8.205e+02 1.177e+03 2.298e+03, threshold=1.641e+03, percent-clipped=0.0 2023-06-24 14:14:28,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1761432.0, ans=0.125 2023-06-24 14:14:43,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1761492.0, ans=0.0 2023-06-24 14:15:25,868 INFO [train.py:996] (1/4) Epoch 10, batch 19150, loss[loss=0.2415, simple_loss=0.3303, pruned_loss=0.07635, over 21578.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3098, pruned_loss=0.08141, over 4253922.94 frames. ], batch size: 230, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:16:04,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1761732.0, ans=0.0 2023-06-24 14:16:33,196 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.12 vs. limit=22.5 2023-06-24 14:16:43,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1761792.0, ans=0.2 2023-06-24 14:17:06,222 INFO [train.py:996] (1/4) Epoch 10, batch 19200, loss[loss=0.2254, simple_loss=0.3258, pruned_loss=0.06246, over 21341.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3186, pruned_loss=0.08118, over 4255610.12 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:17:14,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-24 14:17:22,001 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.091e+02 1.026e+03 1.602e+03 3.229e+03, threshold=2.053e+03, percent-clipped=24.0 2023-06-24 14:17:36,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1761972.0, ans=0.0 2023-06-24 14:17:46,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1762032.0, ans=0.125 2023-06-24 14:18:09,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1762092.0, ans=0.0 2023-06-24 14:18:11,115 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:18:22,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-24 14:18:23,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1762152.0, ans=0.2 2023-06-24 14:18:44,962 INFO [train.py:996] (1/4) Epoch 10, batch 19250, loss[loss=0.176, simple_loss=0.2727, pruned_loss=0.03966, over 21723.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3161, pruned_loss=0.07576, over 4265140.22 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:18:48,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1762212.0, ans=0.1 2023-06-24 14:19:12,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1762272.0, ans=0.0 2023-06-24 14:19:29,729 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:20:03,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1762392.0, ans=0.2 2023-06-24 14:20:17,391 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-24 14:20:20,877 INFO [train.py:996] (1/4) Epoch 10, batch 19300, loss[loss=0.228, simple_loss=0.2932, pruned_loss=0.08142, over 21823.00 frames. ], tot_loss[loss=0.232, simple_loss=0.313, pruned_loss=0.07552, over 4277475.32 frames. ], batch size: 124, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:20:36,655 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.942e+02 6.247e+02 8.864e+02 1.177e+03 3.202e+03, threshold=1.773e+03, percent-clipped=6.0 2023-06-24 14:20:55,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-24 14:21:35,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1762692.0, ans=0.0 2023-06-24 14:21:37,049 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1762692.0, ans=0.2 2023-06-24 14:21:38,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1762692.0, ans=0.125 2023-06-24 14:22:00,017 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-24 14:22:00,315 INFO [train.py:996] (1/4) Epoch 10, batch 19350, loss[loss=0.1901, simple_loss=0.2662, pruned_loss=0.05703, over 21239.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3082, pruned_loss=0.07303, over 4283749.98 frames. ], batch size: 159, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:22:26,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1762872.0, ans=0.025 2023-06-24 14:22:26,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-24 14:22:45,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1762932.0, ans=15.0 2023-06-24 14:22:50,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1762932.0, ans=0.125 2023-06-24 14:23:36,228 INFO [train.py:996] (1/4) Epoch 10, batch 19400, loss[loss=0.2855, simple_loss=0.3489, pruned_loss=0.1111, over 21609.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3069, pruned_loss=0.07299, over 4283338.73 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:23:57,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1763172.0, ans=0.1 2023-06-24 14:23:58,119 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.962e+02 7.006e+02 1.136e+03 1.736e+03 4.231e+03, threshold=2.271e+03, percent-clipped=24.0 2023-06-24 14:24:17,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1763232.0, ans=0.1 2023-06-24 14:25:11,471 INFO [train.py:996] (1/4) Epoch 10, batch 19450, loss[loss=0.2233, simple_loss=0.2883, pruned_loss=0.0791, over 21359.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3037, pruned_loss=0.07451, over 4285053.98 frames. ], batch size: 143, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:25:19,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1763412.0, ans=0.125 2023-06-24 14:25:52,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1763532.0, ans=0.125 2023-06-24 14:26:02,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1763532.0, ans=0.125 2023-06-24 14:26:48,818 INFO [train.py:996] (1/4) Epoch 10, batch 19500, loss[loss=0.2324, simple_loss=0.309, pruned_loss=0.07791, over 21698.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.301, pruned_loss=0.07538, over 4280401.39 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:26:49,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1763712.0, ans=0.125 2023-06-24 14:27:11,215 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.173e+02 6.380e+02 1.047e+03 1.511e+03 3.799e+03, threshold=2.095e+03, percent-clipped=7.0 2023-06-24 14:27:47,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1763832.0, ans=0.125 2023-06-24 14:28:10,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1763952.0, ans=0.125 2023-06-24 14:28:25,384 INFO [train.py:996] (1/4) Epoch 10, batch 19550, loss[loss=0.2246, simple_loss=0.3162, pruned_loss=0.06652, over 21787.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2966, pruned_loss=0.07404, over 4270635.53 frames. ], batch size: 282, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:29:01,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-24 14:29:05,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1764132.0, ans=0.0 2023-06-24 14:29:45,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1764252.0, ans=0.125 2023-06-24 14:29:49,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1764252.0, ans=0.0 2023-06-24 14:29:55,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1764252.0, ans=0.2 2023-06-24 14:30:01,064 INFO [train.py:996] (1/4) Epoch 10, batch 19600, loss[loss=0.291, simple_loss=0.3446, pruned_loss=0.1187, over 21645.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2983, pruned_loss=0.07546, over 4273233.19 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:30:28,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.111e+02 6.531e+02 1.025e+03 1.412e+03 3.718e+03, threshold=2.049e+03, percent-clipped=12.0 2023-06-24 14:30:38,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-24 14:30:40,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1764432.0, ans=0.125 2023-06-24 14:30:56,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1764432.0, ans=0.2 2023-06-24 14:30:57,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1764432.0, ans=0.1 2023-06-24 14:31:11,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1764492.0, ans=0.0 2023-06-24 14:31:21,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1764552.0, ans=0.125 2023-06-24 14:31:27,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1764552.0, ans=0.125 2023-06-24 14:31:38,435 INFO [train.py:996] (1/4) Epoch 10, batch 19650, loss[loss=0.2221, simple_loss=0.2958, pruned_loss=0.07421, over 21427.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3017, pruned_loss=0.07831, over 4275383.36 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:31:40,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1764612.0, ans=0.125 2023-06-24 14:32:55,052 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.90 vs. limit=22.5 2023-06-24 14:33:04,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1764852.0, ans=0.035 2023-06-24 14:33:11,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-24 14:33:28,006 INFO [train.py:996] (1/4) Epoch 10, batch 19700, loss[loss=0.2638, simple_loss=0.3543, pruned_loss=0.08666, over 21308.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3059, pruned_loss=0.07888, over 4273795.38 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:33:54,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 9.383e+02 1.272e+03 2.018e+03 4.455e+03, threshold=2.544e+03, percent-clipped=24.0 2023-06-24 14:34:00,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-24 14:34:02,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1764972.0, ans=0.0 2023-06-24 14:34:24,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1765092.0, ans=0.0 2023-06-24 14:34:27,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1765092.0, ans=0.125 2023-06-24 14:34:40,793 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-24 14:35:08,176 INFO [train.py:996] (1/4) Epoch 10, batch 19750, loss[loss=0.2963, simple_loss=0.4072, pruned_loss=0.09263, over 21255.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3165, pruned_loss=0.08051, over 4277162.82 frames. ], batch size: 549, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:35:25,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=1765212.0, ans=12.0 2023-06-24 14:36:28,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1765452.0, ans=15.0 2023-06-24 14:36:47,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1765512.0, ans=0.09899494936611666 2023-06-24 14:36:49,189 INFO [train.py:996] (1/4) Epoch 10, batch 19800, loss[loss=0.2251, simple_loss=0.3008, pruned_loss=0.07467, over 21910.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3157, pruned_loss=0.08085, over 4287994.80 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:36:52,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1765512.0, ans=0.0 2023-06-24 14:36:57,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.54 vs. limit=15.0 2023-06-24 14:37:10,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1765572.0, ans=0.0 2023-06-24 14:37:11,181 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.026e+02 9.250e+02 1.586e+03 2.402e+03 4.902e+03, threshold=3.172e+03, percent-clipped=21.0 2023-06-24 14:38:01,703 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-24 14:38:05,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1765752.0, ans=0.025 2023-06-24 14:38:07,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1765752.0, ans=0.125 2023-06-24 14:38:27,740 INFO [train.py:996] (1/4) Epoch 10, batch 19850, loss[loss=0.2234, simple_loss=0.3089, pruned_loss=0.06895, over 21699.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3069, pruned_loss=0.07587, over 4281613.61 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:38:34,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1765812.0, ans=0.1 2023-06-24 14:38:35,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1765812.0, ans=0.125 2023-06-24 14:38:46,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1765872.0, ans=0.1 2023-06-24 14:38:51,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765872.0, ans=0.1 2023-06-24 14:39:22,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1765992.0, ans=0.125 2023-06-24 14:39:49,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1766052.0, ans=0.125 2023-06-24 14:40:03,690 INFO [train.py:996] (1/4) Epoch 10, batch 19900, loss[loss=0.2181, simple_loss=0.2879, pruned_loss=0.07412, over 21517.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3086, pruned_loss=0.07406, over 4286481.71 frames. ], batch size: 195, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:40:16,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1766112.0, ans=0.0 2023-06-24 14:40:17,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-24 14:40:20,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.106e+02 8.584e+02 1.585e+03 3.373e+03, threshold=1.717e+03, percent-clipped=1.0 2023-06-24 14:40:23,145 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:40:30,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1766172.0, ans=0.125 2023-06-24 14:40:46,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1766232.0, ans=0.125 2023-06-24 14:41:36,322 INFO [train.py:996] (1/4) Epoch 10, batch 19950, loss[loss=0.1914, simple_loss=0.267, pruned_loss=0.05793, over 21507.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.304, pruned_loss=0.0736, over 4274746.48 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:41:38,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1766412.0, ans=0.0 2023-06-24 14:41:58,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1766472.0, ans=0.0 2023-06-24 14:42:52,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1766652.0, ans=0.0 2023-06-24 14:43:02,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1766652.0, ans=0.0 2023-06-24 14:43:09,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1766652.0, ans=0.0 2023-06-24 14:43:12,278 INFO [train.py:996] (1/4) Epoch 10, batch 20000, loss[loss=0.2056, simple_loss=0.2725, pruned_loss=0.0694, over 20753.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3064, pruned_loss=0.07421, over 4270410.27 frames. ], batch size: 608, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:43:27,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1766772.0, ans=0.0 2023-06-24 14:43:29,110 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.593e+02 7.271e+02 1.092e+03 1.631e+03 3.154e+03, threshold=2.184e+03, percent-clipped=18.0 2023-06-24 14:44:02,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1766832.0, ans=0.125 2023-06-24 14:44:16,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1766892.0, ans=10.0 2023-06-24 14:44:30,489 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:44:47,357 INFO [train.py:996] (1/4) Epoch 10, batch 20050, loss[loss=0.2236, simple_loss=0.3003, pruned_loss=0.07344, over 21865.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3068, pruned_loss=0.07594, over 4269114.06 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:44:52,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.27 vs. limit=10.0 2023-06-24 14:45:00,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=22.5 2023-06-24 14:45:02,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1767072.0, ans=0.5 2023-06-24 14:45:14,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1767072.0, ans=0.2 2023-06-24 14:45:27,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.84 vs. limit=10.0 2023-06-24 14:45:59,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-24 14:46:26,774 INFO [train.py:996] (1/4) Epoch 10, batch 20100, loss[loss=0.2871, simple_loss=0.3806, pruned_loss=0.0968, over 21685.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3098, pruned_loss=0.07832, over 4281211.10 frames. ], batch size: 389, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:46:27,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1767312.0, ans=0.125 2023-06-24 14:46:34,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-24 14:46:51,357 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.584e+02 6.086e+02 7.806e+02 1.176e+03 2.985e+03, threshold=1.561e+03, percent-clipped=5.0 2023-06-24 14:46:53,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1767372.0, ans=0.2 2023-06-24 14:47:00,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1767372.0, ans=0.1 2023-06-24 14:47:09,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-24 14:47:11,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1767432.0, ans=0.0 2023-06-24 14:48:00,013 INFO [train.py:996] (1/4) Epoch 10, batch 20150, loss[loss=0.2698, simple_loss=0.339, pruned_loss=0.1003, over 21710.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3178, pruned_loss=0.08209, over 4280570.64 frames. ], batch size: 332, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:49:43,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-24 14:49:51,321 INFO [train.py:996] (1/4) Epoch 10, batch 20200, loss[loss=0.1817, simple_loss=0.2161, pruned_loss=0.0736, over 16118.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3245, pruned_loss=0.08516, over 4276634.44 frames. ], batch size: 60, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:50:10,614 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.888e+02 8.305e+02 1.166e+03 1.860e+03 3.941e+03, threshold=2.331e+03, percent-clipped=33.0 2023-06-24 14:50:41,225 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-24 14:50:43,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1768032.0, ans=0.0 2023-06-24 14:50:49,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-24 14:50:50,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1768092.0, ans=0.2 2023-06-24 14:50:56,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1768092.0, ans=0.125 2023-06-24 14:51:29,551 INFO [train.py:996] (1/4) Epoch 10, batch 20250, loss[loss=0.228, simple_loss=0.3195, pruned_loss=0.0683, over 21692.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3263, pruned_loss=0.08452, over 4276001.62 frames. ], batch size: 389, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:51:56,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1768272.0, ans=0.1 2023-06-24 14:51:58,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-24 14:52:09,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-24 14:52:27,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1768392.0, ans=0.0 2023-06-24 14:52:36,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1768392.0, ans=0.125 2023-06-24 14:52:39,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1768452.0, ans=10.0 2023-06-24 14:52:39,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1768452.0, ans=0.0 2023-06-24 14:53:02,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1768452.0, ans=0.1 2023-06-24 14:53:05,423 INFO [train.py:996] (1/4) Epoch 10, batch 20300, loss[loss=0.2026, simple_loss=0.2774, pruned_loss=0.06391, over 21899.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3233, pruned_loss=0.0815, over 4280430.64 frames. ], batch size: 98, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:53:28,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 6.165e+02 8.569e+02 1.423e+03 2.886e+03, threshold=1.714e+03, percent-clipped=5.0 2023-06-24 14:54:28,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1768752.0, ans=0.125 2023-06-24 14:54:41,381 INFO [train.py:996] (1/4) Epoch 10, batch 20350, loss[loss=0.2264, simple_loss=0.2975, pruned_loss=0.07763, over 21887.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3239, pruned_loss=0.08175, over 4273771.97 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:56:19,307 INFO [train.py:996] (1/4) Epoch 10, batch 20400, loss[loss=0.3299, simple_loss=0.3906, pruned_loss=0.1346, over 21424.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3268, pruned_loss=0.08513, over 4279433.00 frames. ], batch size: 508, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:56:26,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1769112.0, ans=0.0 2023-06-24 14:56:42,156 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.538e+02 7.746e+02 1.148e+03 1.668e+03 3.679e+03, threshold=2.297e+03, percent-clipped=22.0 2023-06-24 14:56:44,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1769172.0, ans=0.125 2023-06-24 14:57:07,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-24 14:57:08,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1769232.0, ans=0.0 2023-06-24 14:57:55,687 INFO [train.py:996] (1/4) Epoch 10, batch 20450, loss[loss=0.2829, simple_loss=0.3451, pruned_loss=0.1103, over 21823.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3273, pruned_loss=0.08716, over 4285406.93 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:58:05,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1769412.0, ans=10.0 2023-06-24 14:58:32,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1769472.0, ans=0.125 2023-06-24 14:58:32,648 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=12.0 2023-06-24 14:59:20,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1769652.0, ans=0.0 2023-06-24 14:59:32,138 INFO [train.py:996] (1/4) Epoch 10, batch 20500, loss[loss=0.2604, simple_loss=0.32, pruned_loss=0.1004, over 21715.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3228, pruned_loss=0.08654, over 4284055.61 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:59:34,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1769712.0, ans=0.015 2023-06-24 14:59:42,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-24 15:00:01,822 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.514e+02 6.913e+02 8.863e+02 1.328e+03 2.262e+03, threshold=1.773e+03, percent-clipped=0.0 2023-06-24 15:00:03,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1769772.0, ans=0.1 2023-06-24 15:00:08,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1769772.0, ans=0.125 2023-06-24 15:00:22,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1769832.0, ans=0.125 2023-06-24 15:01:09,479 INFO [train.py:996] (1/4) Epoch 10, batch 20550, loss[loss=0.2382, simple_loss=0.3097, pruned_loss=0.0833, over 21572.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3153, pruned_loss=0.0852, over 4269982.97 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:01:28,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1770072.0, ans=0.125 2023-06-24 15:01:38,160 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:01:39,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1770072.0, ans=0.0 2023-06-24 15:02:37,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-24 15:02:46,094 INFO [train.py:996] (1/4) Epoch 10, batch 20600, loss[loss=0.2228, simple_loss=0.2925, pruned_loss=0.07653, over 21738.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3155, pruned_loss=0.08293, over 4277545.36 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:02:46,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1770312.0, ans=0.0 2023-06-24 15:03:15,189 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.661e+02 6.737e+02 1.120e+03 2.042e+03 4.837e+03, threshold=2.240e+03, percent-clipped=29.0 2023-06-24 15:03:59,338 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-24 15:04:15,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1770552.0, ans=0.125 2023-06-24 15:04:21,170 INFO [train.py:996] (1/4) Epoch 10, batch 20650, loss[loss=0.2639, simple_loss=0.3188, pruned_loss=0.1045, over 21472.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3111, pruned_loss=0.08252, over 4272269.84 frames. ], batch size: 508, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:04:36,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770612.0, ans=0.1 2023-06-24 15:05:58,603 INFO [train.py:996] (1/4) Epoch 10, batch 20700, loss[loss=0.1734, simple_loss=0.2569, pruned_loss=0.04497, over 21404.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3038, pruned_loss=0.07875, over 4258633.14 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:06:10,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1770912.0, ans=0.125 2023-06-24 15:06:23,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.424e+02 6.389e+02 9.253e+02 1.399e+03 2.647e+03, threshold=1.851e+03, percent-clipped=4.0 2023-06-24 15:06:39,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=22.5 2023-06-24 15:06:54,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1771092.0, ans=0.125 2023-06-24 15:07:42,283 INFO [train.py:996] (1/4) Epoch 10, batch 20750, loss[loss=0.2702, simple_loss=0.3517, pruned_loss=0.09431, over 21403.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3068, pruned_loss=0.07864, over 4253025.76 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:07:44,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2023-06-24 15:08:04,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-24 15:08:09,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1771272.0, ans=0.05 2023-06-24 15:08:30,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1771332.0, ans=0.125 2023-06-24 15:08:32,931 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-24 15:09:20,756 INFO [train.py:996] (1/4) Epoch 10, batch 20800, loss[loss=0.1982, simple_loss=0.2751, pruned_loss=0.06063, over 21596.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3106, pruned_loss=0.07883, over 4253413.53 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 15:09:30,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1771512.0, ans=0.125 2023-06-24 15:09:47,346 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.552e+02 1.010e+03 1.567e+03 2.337e+03 4.966e+03, threshold=3.135e+03, percent-clipped=39.0 2023-06-24 15:10:46,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1771752.0, ans=0.125 2023-06-24 15:10:54,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1771752.0, ans=0.1 2023-06-24 15:10:57,059 INFO [train.py:996] (1/4) Epoch 10, batch 20850, loss[loss=0.248, simple_loss=0.3176, pruned_loss=0.08921, over 21753.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3054, pruned_loss=0.07742, over 4253266.44 frames. ], batch size: 112, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:11:18,238 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:11:39,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1771932.0, ans=0.125 2023-06-24 15:11:50,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1771932.0, ans=0.125 2023-06-24 15:12:23,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1772052.0, ans=0.125 2023-06-24 15:12:33,602 INFO [train.py:996] (1/4) Epoch 10, batch 20900, loss[loss=0.2417, simple_loss=0.3084, pruned_loss=0.08751, over 21849.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3065, pruned_loss=0.07859, over 4267925.42 frames. ], batch size: 391, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:12:51,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1772112.0, ans=0.125 2023-06-24 15:12:59,930 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 6.808e+02 1.169e+03 1.577e+03 3.825e+03, threshold=2.338e+03, percent-clipped=3.0 2023-06-24 15:13:00,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1772172.0, ans=0.125 2023-06-24 15:13:17,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1772232.0, ans=0.125 2023-06-24 15:13:29,064 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.48 vs. limit=15.0 2023-06-24 15:13:29,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1772232.0, ans=0.0 2023-06-24 15:14:09,075 INFO [train.py:996] (1/4) Epoch 10, batch 20950, loss[loss=0.1729, simple_loss=0.2502, pruned_loss=0.04781, over 21296.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3007, pruned_loss=0.07487, over 4265278.45 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:14:09,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1772412.0, ans=0.0 2023-06-24 15:15:16,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1772592.0, ans=0.125 2023-06-24 15:15:29,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1772652.0, ans=0.125 2023-06-24 15:15:44,305 INFO [train.py:996] (1/4) Epoch 10, batch 21000, loss[loss=0.23, simple_loss=0.2966, pruned_loss=0.08168, over 21440.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2999, pruned_loss=0.07499, over 4258204.79 frames. ], batch size: 144, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:15:44,306 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 15:16:03,225 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2634, simple_loss=0.3598, pruned_loss=0.08347, over 1796401.00 frames. 2023-06-24 15:16:03,226 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 15:16:24,562 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.402e+02 6.763e+02 8.645e+02 1.170e+03 2.024e+03, threshold=1.729e+03, percent-clipped=0.0 2023-06-24 15:16:36,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1772772.0, ans=0.125 2023-06-24 15:16:47,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1772832.0, ans=0.0 2023-06-24 15:17:09,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1772892.0, ans=0.1 2023-06-24 15:17:20,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1772952.0, ans=0.0 2023-06-24 15:17:27,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1772952.0, ans=0.125 2023-06-24 15:17:33,530 INFO [train.py:996] (1/4) Epoch 10, batch 21050, loss[loss=0.2801, simple_loss=0.3212, pruned_loss=0.1195, over 21479.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2983, pruned_loss=0.0759, over 4263193.61 frames. ], batch size: 508, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:17:58,615 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:18:40,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1773192.0, ans=0.125 2023-06-24 15:18:46,641 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-24 15:19:03,205 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:19:08,750 INFO [train.py:996] (1/4) Epoch 10, batch 21100, loss[loss=0.198, simple_loss=0.2763, pruned_loss=0.05983, over 21681.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2944, pruned_loss=0.07556, over 4262992.60 frames. ], batch size: 333, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:19:30,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.11 vs. limit=10.0 2023-06-24 15:19:36,719 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.103e+02 5.905e+02 7.931e+02 1.116e+03 2.788e+03, threshold=1.586e+03, percent-clipped=2.0 2023-06-24 15:19:48,624 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-24 15:20:33,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1773552.0, ans=0.125 2023-06-24 15:20:45,216 INFO [train.py:996] (1/4) Epoch 10, batch 21150, loss[loss=0.2304, simple_loss=0.2912, pruned_loss=0.08477, over 16057.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2896, pruned_loss=0.07579, over 4250342.03 frames. ], batch size: 61, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:21:34,187 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773732.0, ans=0.1 2023-06-24 15:21:50,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-24 15:21:54,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1773792.0, ans=0.2 2023-06-24 15:22:01,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1773852.0, ans=0.0 2023-06-24 15:22:15,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1773852.0, ans=0.0 2023-06-24 15:22:21,503 INFO [train.py:996] (1/4) Epoch 10, batch 21200, loss[loss=0.1864, simple_loss=0.2548, pruned_loss=0.05901, over 21246.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2854, pruned_loss=0.07459, over 4254253.66 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:22:49,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.351e+02 6.353e+02 8.503e+02 1.111e+03 2.488e+03, threshold=1.701e+03, percent-clipped=3.0 2023-06-24 15:23:22,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774092.0, ans=0.1 2023-06-24 15:23:35,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1774092.0, ans=0.2 2023-06-24 15:23:57,720 INFO [train.py:996] (1/4) Epoch 10, batch 21250, loss[loss=0.2305, simple_loss=0.2973, pruned_loss=0.0818, over 21640.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2852, pruned_loss=0.07488, over 4255074.53 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:24:58,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1774392.0, ans=0.125 2023-06-24 15:25:25,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1774452.0, ans=0.125 2023-06-24 15:25:32,741 INFO [train.py:996] (1/4) Epoch 10, batch 21300, loss[loss=0.2767, simple_loss=0.3511, pruned_loss=0.1011, over 21820.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2905, pruned_loss=0.07598, over 4263860.43 frames. ], batch size: 391, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:25:36,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1774512.0, ans=0.0 2023-06-24 15:25:59,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774572.0, ans=0.1 2023-06-24 15:26:02,204 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 7.124e+02 9.830e+02 1.357e+03 3.184e+03, threshold=1.966e+03, percent-clipped=15.0 2023-06-24 15:27:10,410 INFO [train.py:996] (1/4) Epoch 10, batch 21350, loss[loss=0.2109, simple_loss=0.3005, pruned_loss=0.06067, over 21768.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2944, pruned_loss=0.07668, over 4272290.39 frames. ], batch size: 298, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:27:10,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1774812.0, ans=0.2 2023-06-24 15:27:20,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1774812.0, ans=0.125 2023-06-24 15:27:21,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1774812.0, ans=0.125 2023-06-24 15:27:22,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-24 15:27:52,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1774932.0, ans=0.04949747468305833 2023-06-24 15:27:57,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774932.0, ans=0.1 2023-06-24 15:28:14,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1774992.0, ans=0.0 2023-06-24 15:28:30,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1774992.0, ans=0.0 2023-06-24 15:28:39,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1775052.0, ans=0.2 2023-06-24 15:28:45,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1775052.0, ans=0.125 2023-06-24 15:28:45,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1775052.0, ans=0.0 2023-06-24 15:28:48,143 INFO [train.py:996] (1/4) Epoch 10, batch 21400, loss[loss=0.2207, simple_loss=0.3201, pruned_loss=0.06068, over 20982.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2981, pruned_loss=0.07661, over 4275781.28 frames. ], batch size: 607, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:29:23,246 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.641e+02 5.834e+02 7.962e+02 1.308e+03 2.363e+03, threshold=1.592e+03, percent-clipped=6.0 2023-06-24 15:29:26,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1775172.0, ans=0.1 2023-06-24 15:29:41,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1775232.0, ans=0.1 2023-06-24 15:29:45,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1775232.0, ans=0.0 2023-06-24 15:30:24,977 INFO [train.py:996] (1/4) Epoch 10, batch 21450, loss[loss=0.2686, simple_loss=0.3341, pruned_loss=0.1016, over 21870.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.303, pruned_loss=0.07898, over 4283689.70 frames. ], batch size: 371, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:30:39,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1775412.0, ans=0.125 2023-06-24 15:30:47,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1775412.0, ans=0.0 2023-06-24 15:31:59,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.18 vs. limit=10.0 2023-06-24 15:32:06,988 INFO [train.py:996] (1/4) Epoch 10, batch 21500, loss[loss=0.2355, simple_loss=0.2972, pruned_loss=0.08689, over 15771.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3026, pruned_loss=0.07969, over 4271447.34 frames. ], batch size: 64, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:32:13,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1775712.0, ans=0.125 2023-06-24 15:32:36,251 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.266e+02 7.584e+02 1.027e+03 1.446e+03 3.225e+03, threshold=2.054e+03, percent-clipped=19.0 2023-06-24 15:33:19,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1775892.0, ans=0.2 2023-06-24 15:33:33,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1775952.0, ans=0.125 2023-06-24 15:33:38,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1776012.0, ans=0.09899494936611666 2023-06-24 15:33:45,358 INFO [train.py:996] (1/4) Epoch 10, batch 21550, loss[loss=0.2418, simple_loss=0.2935, pruned_loss=0.09508, over 21373.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2957, pruned_loss=0.07724, over 4277599.08 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:35:03,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1776252.0, ans=0.2 2023-06-24 15:35:22,995 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-06-24 15:35:25,112 INFO [train.py:996] (1/4) Epoch 10, batch 21600, loss[loss=0.2103, simple_loss=0.2994, pruned_loss=0.06063, over 21613.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2934, pruned_loss=0.07683, over 4273412.90 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:35:39,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1776312.0, ans=0.2 2023-06-24 15:35:41,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1776312.0, ans=0.125 2023-06-24 15:36:01,305 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 8.052e+02 1.212e+03 1.997e+03 4.912e+03, threshold=2.424e+03, percent-clipped=21.0 2023-06-24 15:36:11,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1776432.0, ans=0.125 2023-06-24 15:36:11,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1776432.0, ans=0.0 2023-06-24 15:36:13,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-24 15:36:31,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1776492.0, ans=0.0 2023-06-24 15:36:59,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1776552.0, ans=0.0 2023-06-24 15:37:01,710 INFO [train.py:996] (1/4) Epoch 10, batch 21650, loss[loss=0.2294, simple_loss=0.331, pruned_loss=0.06389, over 19922.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2967, pruned_loss=0.07495, over 4267338.20 frames. ], batch size: 703, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:37:38,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-24 15:38:03,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1776792.0, ans=0.0 2023-06-24 15:38:08,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1776792.0, ans=0.125 2023-06-24 15:38:14,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1776792.0, ans=0.2 2023-06-24 15:38:27,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1776852.0, ans=0.125 2023-06-24 15:38:37,623 INFO [train.py:996] (1/4) Epoch 10, batch 21700, loss[loss=0.1981, simple_loss=0.2743, pruned_loss=0.06098, over 21956.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2937, pruned_loss=0.07159, over 4254691.70 frames. ], batch size: 113, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:38:47,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1776912.0, ans=0.0 2023-06-24 15:38:56,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1776972.0, ans=0.125 2023-06-24 15:38:58,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1776972.0, ans=0.2 2023-06-24 15:39:12,929 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.660e+02 9.555e+02 1.550e+03 3.491e+03, threshold=1.911e+03, percent-clipped=7.0 2023-06-24 15:39:35,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777032.0, ans=0.1 2023-06-24 15:39:42,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1777092.0, ans=0.1 2023-06-24 15:39:54,214 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-24 15:40:01,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1777152.0, ans=0.2 2023-06-24 15:40:12,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1777212.0, ans=0.125 2023-06-24 15:40:13,241 INFO [train.py:996] (1/4) Epoch 10, batch 21750, loss[loss=0.2819, simple_loss=0.314, pruned_loss=0.1249, over 21428.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.29, pruned_loss=0.07184, over 4257197.38 frames. ], batch size: 511, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:40:52,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1777332.0, ans=0.5 2023-06-24 15:41:12,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1777392.0, ans=0.0 2023-06-24 15:41:16,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-24 15:41:27,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1777392.0, ans=0.125 2023-06-24 15:41:50,266 INFO [train.py:996] (1/4) Epoch 10, batch 21800, loss[loss=0.2194, simple_loss=0.3389, pruned_loss=0.04996, over 20853.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2903, pruned_loss=0.07317, over 4258650.74 frames. ], batch size: 607, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:41:50,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1777512.0, ans=0.0 2023-06-24 15:42:08,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1777512.0, ans=0.125 2023-06-24 15:42:25,933 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.465e+02 6.460e+02 8.682e+02 1.144e+03 2.406e+03, threshold=1.736e+03, percent-clipped=3.0 2023-06-24 15:42:32,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1777632.0, ans=0.0 2023-06-24 15:42:59,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1777692.0, ans=0.125 2023-06-24 15:43:13,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1777752.0, ans=0.125 2023-06-24 15:43:25,427 INFO [train.py:996] (1/4) Epoch 10, batch 21850, loss[loss=0.2225, simple_loss=0.2811, pruned_loss=0.08197, over 21360.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2978, pruned_loss=0.07518, over 4263192.23 frames. ], batch size: 177, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:43:40,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1777812.0, ans=0.125 2023-06-24 15:44:11,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1777932.0, ans=0.125 2023-06-24 15:44:36,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1777992.0, ans=0.125 2023-06-24 15:45:02,995 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1778052.0, ans=0.1 2023-06-24 15:45:05,297 INFO [train.py:996] (1/4) Epoch 10, batch 21900, loss[loss=0.1765, simple_loss=0.2269, pruned_loss=0.06302, over 20770.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2967, pruned_loss=0.07591, over 4259732.38 frames. ], batch size: 609, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:45:36,506 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.519e+02 8.112e+02 1.126e+03 1.862e+03 4.122e+03, threshold=2.252e+03, percent-clipped=27.0 2023-06-24 15:45:46,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-24 15:45:49,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1778232.0, ans=10.0 2023-06-24 15:46:00,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1778292.0, ans=10.0 2023-06-24 15:46:25,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-24 15:46:42,343 INFO [train.py:996] (1/4) Epoch 10, batch 21950, loss[loss=0.1733, simple_loss=0.241, pruned_loss=0.05284, over 21204.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2911, pruned_loss=0.07345, over 4249865.37 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:46:58,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-24 15:48:18,387 INFO [train.py:996] (1/4) Epoch 10, batch 22000, loss[loss=0.2194, simple_loss=0.2821, pruned_loss=0.07838, over 21374.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2871, pruned_loss=0.07174, over 4254596.24 frames. ], batch size: 160, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:48:22,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1778712.0, ans=0.2 2023-06-24 15:48:36,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1778712.0, ans=0.0 2023-06-24 15:48:54,793 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.014e+02 5.065e+02 6.961e+02 1.085e+03 3.109e+03, threshold=1.392e+03, percent-clipped=2.0 2023-06-24 15:49:55,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1778952.0, ans=0.125 2023-06-24 15:50:02,076 INFO [train.py:996] (1/4) Epoch 10, batch 22050, loss[loss=0.2124, simple_loss=0.2979, pruned_loss=0.06341, over 21694.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2928, pruned_loss=0.07353, over 4253093.50 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:50:04,678 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-24 15:50:09,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.63 vs. limit=22.5 2023-06-24 15:50:29,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1779072.0, ans=0.2 2023-06-24 15:51:20,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1779252.0, ans=0.1 2023-06-24 15:51:38,376 INFO [train.py:996] (1/4) Epoch 10, batch 22100, loss[loss=0.2809, simple_loss=0.3566, pruned_loss=0.1026, over 21327.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.301, pruned_loss=0.07774, over 4261933.18 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:52:09,373 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.723e+02 7.259e+02 1.034e+03 1.568e+03 3.837e+03, threshold=2.069e+03, percent-clipped=34.0 2023-06-24 15:52:16,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-24 15:53:16,271 INFO [train.py:996] (1/4) Epoch 10, batch 22150, loss[loss=0.2942, simple_loss=0.3445, pruned_loss=0.1219, over 21746.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3042, pruned_loss=0.07978, over 4272432.20 frames. ], batch size: 508, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:53:16,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1779612.0, ans=0.125 2023-06-24 15:53:20,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1779612.0, ans=0.0 2023-06-24 15:54:19,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1779792.0, ans=0.0 2023-06-24 15:54:54,167 INFO [train.py:996] (1/4) Epoch 10, batch 22200, loss[loss=0.3022, simple_loss=0.3816, pruned_loss=0.1115, over 21564.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3063, pruned_loss=0.08088, over 4276468.02 frames. ], batch size: 471, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:55:17,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1779972.0, ans=0.125 2023-06-24 15:55:24,792 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.644e+02 6.933e+02 1.129e+03 1.517e+03 2.505e+03, threshold=2.259e+03, percent-clipped=10.0 2023-06-24 15:55:44,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1780032.0, ans=0.2 2023-06-24 15:56:31,915 INFO [train.py:996] (1/4) Epoch 10, batch 22250, loss[loss=0.2987, simple_loss=0.3618, pruned_loss=0.1178, over 21584.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3125, pruned_loss=0.08225, over 4284524.47 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:56:35,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1780212.0, ans=0.125 2023-06-24 15:56:52,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1780272.0, ans=0.125 2023-06-24 15:57:15,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-06-24 15:58:01,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1780452.0, ans=0.0 2023-06-24 15:58:05,617 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1780512.0, ans=0.125 2023-06-24 15:58:06,591 INFO [train.py:996] (1/4) Epoch 10, batch 22300, loss[loss=0.2071, simple_loss=0.2817, pruned_loss=0.0662, over 21916.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3125, pruned_loss=0.08287, over 4290791.42 frames. ], batch size: 283, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:58:37,334 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.918e+02 7.899e+02 1.099e+03 1.483e+03 2.745e+03, threshold=2.199e+03, percent-clipped=4.0 2023-06-24 15:59:43,738 INFO [train.py:996] (1/4) Epoch 10, batch 22350, loss[loss=0.2389, simple_loss=0.3062, pruned_loss=0.08581, over 21885.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3117, pruned_loss=0.084, over 4296596.30 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:00:26,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1780932.0, ans=0.125 2023-06-24 16:00:49,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1780992.0, ans=0.025 2023-06-24 16:01:08,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-24 16:01:21,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1781052.0, ans=0.05 2023-06-24 16:01:21,869 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-24 16:01:25,458 INFO [train.py:996] (1/4) Epoch 10, batch 22400, loss[loss=0.2341, simple_loss=0.3145, pruned_loss=0.07686, over 21745.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3098, pruned_loss=0.08146, over 4298067.04 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:01:44,795 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-24 16:01:51,924 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.474e+02 7.856e+02 9.897e+02 1.374e+03 2.984e+03, threshold=1.979e+03, percent-clipped=5.0 2023-06-24 16:02:09,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1781232.0, ans=0.0 2023-06-24 16:02:56,383 INFO [train.py:996] (1/4) Epoch 10, batch 22450, loss[loss=0.1841, simple_loss=0.2575, pruned_loss=0.05538, over 21725.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3035, pruned_loss=0.08042, over 4288090.01 frames. ], batch size: 283, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:02:56,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1781412.0, ans=0.1 2023-06-24 16:03:24,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1781472.0, ans=0.125 2023-06-24 16:04:33,430 INFO [train.py:996] (1/4) Epoch 10, batch 22500, loss[loss=0.2156, simple_loss=0.3196, pruned_loss=0.05586, over 20808.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.2996, pruned_loss=0.07983, over 4276640.72 frames. ], batch size: 607, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:04:42,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1781712.0, ans=0.125 2023-06-24 16:04:49,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1781772.0, ans=0.2 2023-06-24 16:05:06,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.137e+02 1.060e+03 1.856e+03 3.830e+03, threshold=2.121e+03, percent-clipped=17.0 2023-06-24 16:05:27,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1781832.0, ans=0.1 2023-06-24 16:06:03,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1781952.0, ans=0.0 2023-06-24 16:06:10,752 INFO [train.py:996] (1/4) Epoch 10, batch 22550, loss[loss=0.2164, simple_loss=0.2889, pruned_loss=0.07197, over 21820.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3029, pruned_loss=0.07985, over 4277823.79 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:07:49,292 INFO [train.py:996] (1/4) Epoch 10, batch 22600, loss[loss=0.203, simple_loss=0.2783, pruned_loss=0.06388, over 21634.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3066, pruned_loss=0.0796, over 4275505.52 frames. ], batch size: 230, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:08:15,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1782372.0, ans=0.2 2023-06-24 16:08:27,266 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.566e+02 7.815e+02 1.188e+03 1.926e+03 4.524e+03, threshold=2.375e+03, percent-clipped=20.0 2023-06-24 16:08:27,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1782372.0, ans=0.125 2023-06-24 16:09:13,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1782552.0, ans=0.125 2023-06-24 16:09:25,463 INFO [train.py:996] (1/4) Epoch 10, batch 22650, loss[loss=0.2319, simple_loss=0.2874, pruned_loss=0.08817, over 21542.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3027, pruned_loss=0.0794, over 4272650.92 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:10:29,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1782792.0, ans=0.1 2023-06-24 16:10:58,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1782852.0, ans=0.125 2023-06-24 16:11:01,273 INFO [train.py:996] (1/4) Epoch 10, batch 22700, loss[loss=0.2251, simple_loss=0.2881, pruned_loss=0.08106, over 21784.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2983, pruned_loss=0.0793, over 4271936.43 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:11:38,442 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 7.699e+02 1.029e+03 1.382e+03 2.516e+03, threshold=2.058e+03, percent-clipped=2.0 2023-06-24 16:11:49,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-24 16:11:56,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1783032.0, ans=0.125 2023-06-24 16:12:37,728 INFO [train.py:996] (1/4) Epoch 10, batch 22750, loss[loss=0.2561, simple_loss=0.3221, pruned_loss=0.09499, over 21773.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2988, pruned_loss=0.08092, over 4278626.16 frames. ], batch size: 332, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:12:41,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1783212.0, ans=0.5 2023-06-24 16:12:46,467 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-24 16:12:55,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1783212.0, ans=0.125 2023-06-24 16:13:54,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1783392.0, ans=0.125 2023-06-24 16:13:57,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1783452.0, ans=0.0 2023-06-24 16:14:14,073 INFO [train.py:996] (1/4) Epoch 10, batch 22800, loss[loss=0.1969, simple_loss=0.2713, pruned_loss=0.0613, over 21852.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3037, pruned_loss=0.08283, over 4280538.51 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:14:48,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1783572.0, ans=0.125 2023-06-24 16:14:51,216 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.113e+02 9.771e+02 1.479e+03 3.289e+03, threshold=1.954e+03, percent-clipped=6.0 2023-06-24 16:14:56,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1783632.0, ans=0.2 2023-06-24 16:15:19,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1783692.0, ans=0.2 2023-06-24 16:15:49,977 INFO [train.py:996] (1/4) Epoch 10, batch 22850, loss[loss=0.2099, simple_loss=0.2739, pruned_loss=0.07297, over 21760.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3007, pruned_loss=0.08224, over 4265019.14 frames. ], batch size: 112, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:15:50,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1783812.0, ans=0.125 2023-06-24 16:15:58,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1783812.0, ans=0.1 2023-06-24 16:16:01,857 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-24 16:16:05,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1783872.0, ans=0.125 2023-06-24 16:16:23,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1783872.0, ans=0.0 2023-06-24 16:16:43,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1783932.0, ans=0.0 2023-06-24 16:16:54,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1783992.0, ans=0.125 2023-06-24 16:16:58,590 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-24 16:17:19,566 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-24 16:17:27,984 INFO [train.py:996] (1/4) Epoch 10, batch 22900, loss[loss=0.246, simple_loss=0.3513, pruned_loss=0.07035, over 21704.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3001, pruned_loss=0.08094, over 4257986.71 frames. ], batch size: 298, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:18:11,497 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:18:12,794 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 7.158e+02 1.057e+03 1.638e+03 3.126e+03, threshold=2.114e+03, percent-clipped=14.0 2023-06-24 16:18:43,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1784292.0, ans=0.0 2023-06-24 16:18:49,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1784352.0, ans=0.07 2023-06-24 16:19:16,536 INFO [train.py:996] (1/4) Epoch 10, batch 22950, loss[loss=0.218, simple_loss=0.3242, pruned_loss=0.05592, over 21208.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3126, pruned_loss=0.07912, over 4254261.11 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:20:20,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1784592.0, ans=0.125 2023-06-24 16:20:52,493 INFO [train.py:996] (1/4) Epoch 10, batch 23000, loss[loss=0.2355, simple_loss=0.3827, pruned_loss=0.04414, over 20832.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.313, pruned_loss=0.0773, over 4259299.40 frames. ], batch size: 608, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:21:26,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1784772.0, ans=0.125 2023-06-24 16:21:30,666 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.479e+02 7.106e+02 9.780e+02 1.454e+03 3.933e+03, threshold=1.956e+03, percent-clipped=7.0 2023-06-24 16:21:38,050 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-24 16:21:42,231 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:21:43,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1784832.0, ans=0.95 2023-06-24 16:22:02,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1784952.0, ans=0.0 2023-06-24 16:22:12,970 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-06-24 16:22:26,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1784952.0, ans=0.125 2023-06-24 16:22:36,144 INFO [train.py:996] (1/4) Epoch 10, batch 23050, loss[loss=0.1988, simple_loss=0.2746, pruned_loss=0.06151, over 21142.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3152, pruned_loss=0.07966, over 4260887.50 frames. ], batch size: 608, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:22:43,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1785012.0, ans=0.5 2023-06-24 16:22:48,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1785012.0, ans=10.0 2023-06-24 16:24:11,779 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-24 16:24:13,729 INFO [train.py:996] (1/4) Epoch 10, batch 23100, loss[loss=0.2199, simple_loss=0.2808, pruned_loss=0.07956, over 21610.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3113, pruned_loss=0.08045, over 4265821.45 frames. ], batch size: 415, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:24:14,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1785312.0, ans=0.05 2023-06-24 16:24:45,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1785372.0, ans=0.1 2023-06-24 16:24:47,869 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.180e+02 6.884e+02 9.412e+02 1.257e+03 2.198e+03, threshold=1.882e+03, percent-clipped=3.0 2023-06-24 16:25:05,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1785492.0, ans=10.0 2023-06-24 16:25:13,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1785492.0, ans=0.125 2023-06-24 16:25:18,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1785492.0, ans=0.2 2023-06-24 16:25:41,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1785552.0, ans=0.125 2023-06-24 16:25:50,176 INFO [train.py:996] (1/4) Epoch 10, batch 23150, loss[loss=0.2508, simple_loss=0.3156, pruned_loss=0.09301, over 21713.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3054, pruned_loss=0.07997, over 4267221.98 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:25:58,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1785612.0, ans=0.0 2023-06-24 16:26:02,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1785612.0, ans=0.125 2023-06-24 16:26:33,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1785732.0, ans=0.125 2023-06-24 16:26:56,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-24 16:27:20,642 INFO [train.py:996] (1/4) Epoch 10, batch 23200, loss[loss=0.2208, simple_loss=0.2981, pruned_loss=0.07177, over 21473.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3048, pruned_loss=0.08082, over 4279315.53 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:27:30,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-24 16:27:39,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1785972.0, ans=0.125 2023-06-24 16:28:00,000 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.595e+02 6.989e+02 9.310e+02 1.266e+03 2.936e+03, threshold=1.862e+03, percent-clipped=9.0 2023-06-24 16:28:23,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1786092.0, ans=0.1 2023-06-24 16:28:28,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1786092.0, ans=0.125 2023-06-24 16:28:49,720 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-24 16:28:51,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-24 16:29:01,251 INFO [train.py:996] (1/4) Epoch 10, batch 23250, loss[loss=0.1901, simple_loss=0.2585, pruned_loss=0.06086, over 21218.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.305, pruned_loss=0.08181, over 4283611.48 frames. ], batch size: 608, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:29:43,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1786332.0, ans=0.2 2023-06-24 16:29:51,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1786392.0, ans=0.2 2023-06-24 16:30:09,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1786392.0, ans=0.0 2023-06-24 16:30:37,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1786512.0, ans=0.0 2023-06-24 16:30:38,208 INFO [train.py:996] (1/4) Epoch 10, batch 23300, loss[loss=0.2305, simple_loss=0.327, pruned_loss=0.06703, over 21326.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3121, pruned_loss=0.08263, over 4287105.56 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:30:54,091 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:31:14,186 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.511e+02 8.047e+02 1.147e+03 1.597e+03 3.212e+03, threshold=2.293e+03, percent-clipped=17.0 2023-06-24 16:31:17,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1786632.0, ans=0.125 2023-06-24 16:31:29,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1786692.0, ans=0.1 2023-06-24 16:32:05,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1786752.0, ans=0.0 2023-06-24 16:32:10,057 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:32:20,665 INFO [train.py:996] (1/4) Epoch 10, batch 23350, loss[loss=0.1522, simple_loss=0.2125, pruned_loss=0.046, over 17076.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3147, pruned_loss=0.08153, over 4271144.38 frames. ], batch size: 63, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:33:44,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1787052.0, ans=0.2 2023-06-24 16:33:53,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1787052.0, ans=0.125 2023-06-24 16:33:57,548 INFO [train.py:996] (1/4) Epoch 10, batch 23400, loss[loss=0.2417, simple_loss=0.3005, pruned_loss=0.09147, over 21439.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3093, pruned_loss=0.07805, over 4273771.53 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:34:28,905 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.034e+02 7.282e+02 9.879e+02 1.387e+03 3.167e+03, threshold=1.976e+03, percent-clipped=3.0 2023-06-24 16:34:33,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-24 16:35:03,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-24 16:35:14,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1787352.0, ans=0.1 2023-06-24 16:35:34,826 INFO [train.py:996] (1/4) Epoch 10, batch 23450, loss[loss=0.28, simple_loss=0.3569, pruned_loss=0.1016, over 21804.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3114, pruned_loss=0.08161, over 4271824.22 frames. ], batch size: 124, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:35:58,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1787472.0, ans=0.125 2023-06-24 16:36:20,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1787532.0, ans=0.125 2023-06-24 16:36:36,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1787592.0, ans=0.125 2023-06-24 16:36:49,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-24 16:37:04,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1787652.0, ans=0.125 2023-06-24 16:37:09,841 INFO [train.py:996] (1/4) Epoch 10, batch 23500, loss[loss=0.2366, simple_loss=0.2961, pruned_loss=0.08852, over 21390.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3109, pruned_loss=0.08245, over 4275834.68 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:37:25,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1787772.0, ans=0.125 2023-06-24 16:37:25,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1787772.0, ans=0.2 2023-06-24 16:37:45,944 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 6.777e+02 9.961e+02 1.518e+03 3.385e+03, threshold=1.992e+03, percent-clipped=9.0 2023-06-24 16:38:43,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1787952.0, ans=0.1 2023-06-24 16:38:46,204 INFO [train.py:996] (1/4) Epoch 10, batch 23550, loss[loss=0.2307, simple_loss=0.2872, pruned_loss=0.08713, over 21742.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3071, pruned_loss=0.08206, over 4260287.39 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:38:59,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1788012.0, ans=0.0 2023-06-24 16:39:00,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788072.0, ans=0.1 2023-06-24 16:39:30,832 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788132.0, ans=0.1 2023-06-24 16:40:18,355 INFO [train.py:996] (1/4) Epoch 10, batch 23600, loss[loss=0.2266, simple_loss=0.3108, pruned_loss=0.07122, over 16930.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3056, pruned_loss=0.0819, over 4253922.65 frames. ], batch size: 60, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:40:47,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1788372.0, ans=0.04949747468305833 2023-06-24 16:41:05,498 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.671e+02 8.088e+02 1.163e+03 1.528e+03 3.406e+03, threshold=2.327e+03, percent-clipped=14.0 2023-06-24 16:41:12,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1788432.0, ans=0.125 2023-06-24 16:41:55,369 INFO [train.py:996] (1/4) Epoch 10, batch 23650, loss[loss=0.2656, simple_loss=0.3473, pruned_loss=0.09189, over 21272.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3045, pruned_loss=0.07995, over 4257586.71 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:42:10,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788612.0, ans=0.1 2023-06-24 16:42:12,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1788612.0, ans=0.0 2023-06-24 16:43:06,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1788792.0, ans=0.04949747468305833 2023-06-24 16:43:15,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1788792.0, ans=0.04949747468305833 2023-06-24 16:43:17,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1788852.0, ans=0.0 2023-06-24 16:43:27,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1788852.0, ans=0.125 2023-06-24 16:43:38,449 INFO [train.py:996] (1/4) Epoch 10, batch 23700, loss[loss=0.2429, simple_loss=0.3084, pruned_loss=0.08866, over 21180.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3081, pruned_loss=0.079, over 4260753.51 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:44:20,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1788972.0, ans=0.0 2023-06-24 16:44:23,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1789032.0, ans=0.125 2023-06-24 16:44:26,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.212e+02 6.427e+02 8.775e+02 1.177e+03 2.225e+03, threshold=1.755e+03, percent-clipped=0.0 2023-06-24 16:44:53,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1789092.0, ans=0.1 2023-06-24 16:45:22,450 INFO [train.py:996] (1/4) Epoch 10, batch 23750, loss[loss=0.2644, simple_loss=0.3398, pruned_loss=0.0945, over 21200.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3107, pruned_loss=0.08002, over 4266147.08 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:45:34,048 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1789212.0, ans=0.0 2023-06-24 16:45:49,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1789272.0, ans=0.125 2023-06-24 16:46:07,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1789332.0, ans=0.0 2023-06-24 16:46:33,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789392.0, ans=0.1 2023-06-24 16:47:01,129 INFO [train.py:996] (1/4) Epoch 10, batch 23800, loss[loss=0.1949, simple_loss=0.277, pruned_loss=0.05638, over 20770.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3078, pruned_loss=0.07747, over 4257596.56 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:47:02,572 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.47 vs. limit=15.0 2023-06-24 16:47:12,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789512.0, ans=0.1 2023-06-24 16:47:19,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1789512.0, ans=0.0 2023-06-24 16:47:31,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1789572.0, ans=0.125 2023-06-24 16:47:39,070 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.271e+02 7.251e+02 1.165e+03 1.659e+03 4.396e+03, threshold=2.330e+03, percent-clipped=22.0 2023-06-24 16:47:56,489 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-24 16:48:44,370 INFO [train.py:996] (1/4) Epoch 10, batch 23850, loss[loss=0.2396, simple_loss=0.335, pruned_loss=0.07204, over 20717.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3186, pruned_loss=0.08024, over 4260831.24 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:48:51,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.64 vs. limit=15.0 2023-06-24 16:48:56,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1789812.0, ans=0.0 2023-06-24 16:49:10,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1789872.0, ans=0.125 2023-06-24 16:50:03,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1789992.0, ans=0.2 2023-06-24 16:50:17,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1790052.0, ans=0.0 2023-06-24 16:50:20,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-24 16:50:20,408 INFO [train.py:996] (1/4) Epoch 10, batch 23900, loss[loss=0.2068, simple_loss=0.2877, pruned_loss=0.06298, over 21638.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3246, pruned_loss=0.08277, over 4265722.88 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:50:47,047 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-24 16:50:59,826 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 1.066e+03 1.472e+03 2.085e+03 4.372e+03, threshold=2.943e+03, percent-clipped=19.0 2023-06-24 16:51:58,867 INFO [train.py:996] (1/4) Epoch 10, batch 23950, loss[loss=0.2368, simple_loss=0.3152, pruned_loss=0.07919, over 20732.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3184, pruned_loss=0.08235, over 4271567.30 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:52:05,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1790412.0, ans=0.2 2023-06-24 16:52:42,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1790532.0, ans=0.95 2023-06-24 16:53:25,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1790652.0, ans=0.1 2023-06-24 16:53:36,880 INFO [train.py:996] (1/4) Epoch 10, batch 24000, loss[loss=0.2356, simple_loss=0.3084, pruned_loss=0.08143, over 21400.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3191, pruned_loss=0.08461, over 4265807.30 frames. ], batch size: 549, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:53:36,881 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 16:53:52,743 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2655, simple_loss=0.3589, pruned_loss=0.08609, over 1796401.00 frames. 2023-06-24 16:53:52,744 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 16:53:54,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1790712.0, ans=0.125 2023-06-24 16:53:58,565 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-24 16:54:06,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1790712.0, ans=0.0 2023-06-24 16:54:12,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1790772.0, ans=0.125 2023-06-24 16:54:41,008 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.229e+02 9.460e+02 1.386e+03 2.838e+03, threshold=1.892e+03, percent-clipped=0.0 2023-06-24 16:55:11,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1790892.0, ans=0.125 2023-06-24 16:55:27,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1790952.0, ans=0.1 2023-06-24 16:55:31,748 INFO [train.py:996] (1/4) Epoch 10, batch 24050, loss[loss=0.2162, simple_loss=0.306, pruned_loss=0.06321, over 21622.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3209, pruned_loss=0.08506, over 4267599.97 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:55:32,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1791012.0, ans=0.125 2023-06-24 16:55:51,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1791072.0, ans=0.125 2023-06-24 16:56:11,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1791072.0, ans=0.0 2023-06-24 16:56:11,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1791072.0, ans=0.125 2023-06-24 16:56:25,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1791132.0, ans=0.0 2023-06-24 16:56:33,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1791132.0, ans=0.0 2023-06-24 16:57:02,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1791252.0, ans=0.125 2023-06-24 16:57:13,076 INFO [train.py:996] (1/4) Epoch 10, batch 24100, loss[loss=0.2336, simple_loss=0.3067, pruned_loss=0.08026, over 21249.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3205, pruned_loss=0.08332, over 4270556.02 frames. ], batch size: 159, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:58:01,299 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 7.131e+02 9.675e+02 1.402e+03 3.208e+03, threshold=1.935e+03, percent-clipped=13.0 2023-06-24 16:58:06,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1791432.0, ans=0.125 2023-06-24 16:58:18,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1791492.0, ans=0.0 2023-06-24 16:58:22,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791492.0, ans=0.1 2023-06-24 16:58:38,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791552.0, ans=0.1 2023-06-24 16:58:52,138 INFO [train.py:996] (1/4) Epoch 10, batch 24150, loss[loss=0.2543, simple_loss=0.3147, pruned_loss=0.09691, over 21811.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3202, pruned_loss=0.08495, over 4274674.81 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:59:02,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-24 16:59:35,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-24 16:59:47,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1791732.0, ans=0.2 2023-06-24 17:00:14,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1791852.0, ans=0.125 2023-06-24 17:00:23,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-24 17:00:27,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1791852.0, ans=0.125 2023-06-24 17:00:30,318 INFO [train.py:996] (1/4) Epoch 10, batch 24200, loss[loss=0.295, simple_loss=0.3826, pruned_loss=0.1037, over 21180.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3225, pruned_loss=0.08574, over 4280225.74 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:00:42,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1791912.0, ans=0.125 2023-06-24 17:00:59,625 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-24 17:01:05,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791972.0, ans=0.1 2023-06-24 17:01:06,020 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-24 17:01:15,905 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.664e+02 7.347e+02 1.079e+03 1.481e+03 2.381e+03, threshold=2.159e+03, percent-clipped=5.0 2023-06-24 17:01:27,955 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-24 17:01:55,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792152.0, ans=0.1 2023-06-24 17:02:14,143 INFO [train.py:996] (1/4) Epoch 10, batch 24250, loss[loss=0.1951, simple_loss=0.2731, pruned_loss=0.05853, over 21873.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3184, pruned_loss=0.07959, over 4273951.78 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:02:16,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1792212.0, ans=0.125 2023-06-24 17:03:38,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1792452.0, ans=0.125 2023-06-24 17:03:56,598 INFO [train.py:996] (1/4) Epoch 10, batch 24300, loss[loss=0.2146, simple_loss=0.2804, pruned_loss=0.07443, over 20236.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3117, pruned_loss=0.07379, over 4269646.16 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:04:16,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1792572.0, ans=0.125 2023-06-24 17:04:32,609 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.370e+02 5.749e+02 8.681e+02 1.337e+03 2.668e+03, threshold=1.736e+03, percent-clipped=3.0 2023-06-24 17:05:11,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1792752.0, ans=0.0 2023-06-24 17:05:29,316 INFO [train.py:996] (1/4) Epoch 10, batch 24350, loss[loss=0.2373, simple_loss=0.3173, pruned_loss=0.07859, over 21437.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3058, pruned_loss=0.07266, over 4276944.02 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:05:34,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1792812.0, ans=0.1 2023-06-24 17:05:47,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1792872.0, ans=0.125 2023-06-24 17:05:52,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1792872.0, ans=0.0 2023-06-24 17:06:16,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1792932.0, ans=0.2 2023-06-24 17:06:30,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1792992.0, ans=0.125 2023-06-24 17:06:56,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1793052.0, ans=0.2 2023-06-24 17:06:57,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-24 17:07:07,865 INFO [train.py:996] (1/4) Epoch 10, batch 24400, loss[loss=0.2526, simple_loss=0.3293, pruned_loss=0.08794, over 21680.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3093, pruned_loss=0.07602, over 4279017.81 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:07:26,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1793172.0, ans=0.09899494936611666 2023-06-24 17:07:29,645 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:07:29,650 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1793172.0, ans=0.0 2023-06-24 17:07:52,234 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.844e+02 8.479e+02 1.173e+03 1.615e+03 2.996e+03, threshold=2.346e+03, percent-clipped=19.0 2023-06-24 17:08:17,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1793292.0, ans=0.2 2023-06-24 17:08:46,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1793352.0, ans=0.05 2023-06-24 17:08:48,995 INFO [train.py:996] (1/4) Epoch 10, batch 24450, loss[loss=0.2018, simple_loss=0.2788, pruned_loss=0.06235, over 21382.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3129, pruned_loss=0.0785, over 4282422.01 frames. ], batch size: 131, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:10:26,076 INFO [train.py:996] (1/4) Epoch 10, batch 24500, loss[loss=0.2197, simple_loss=0.2935, pruned_loss=0.07293, over 21760.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3154, pruned_loss=0.07943, over 4288192.38 frames. ], batch size: 247, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:10:28,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1793712.0, ans=0.1 2023-06-24 17:11:07,639 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.494e+02 6.645e+02 1.000e+03 1.722e+03 3.391e+03, threshold=2.001e+03, percent-clipped=7.0 2023-06-24 17:11:09,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.05 vs. limit=8.0 2023-06-24 17:11:30,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1793892.0, ans=0.0 2023-06-24 17:11:42,325 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-24 17:11:59,918 INFO [train.py:996] (1/4) Epoch 10, batch 24550, loss[loss=0.2223, simple_loss=0.309, pruned_loss=0.06779, over 21810.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3171, pruned_loss=0.08109, over 4291313.12 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:12:23,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1794072.0, ans=0.1 2023-06-24 17:12:26,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1794072.0, ans=0.0 2023-06-24 17:13:21,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1794252.0, ans=0.125 2023-06-24 17:13:37,077 INFO [train.py:996] (1/4) Epoch 10, batch 24600, loss[loss=0.1951, simple_loss=0.2584, pruned_loss=0.06588, over 21178.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3147, pruned_loss=0.08075, over 4275974.23 frames. ], batch size: 159, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:14:02,591 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-24 17:14:28,281 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.632e+02 7.490e+02 1.116e+03 1.562e+03 6.451e+03, threshold=2.232e+03, percent-clipped=18.0 2023-06-24 17:14:36,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1794432.0, ans=0.125 2023-06-24 17:14:58,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-24 17:15:08,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1794552.0, ans=0.09899494936611666 2023-06-24 17:15:15,837 INFO [train.py:996] (1/4) Epoch 10, batch 24650, loss[loss=0.1871, simple_loss=0.2571, pruned_loss=0.05862, over 21874.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3096, pruned_loss=0.07988, over 4263617.35 frames. ], batch size: 373, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:15:24,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1794612.0, ans=0.0 2023-06-24 17:15:38,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1794672.0, ans=0.125 2023-06-24 17:15:54,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1794672.0, ans=0.125 2023-06-24 17:16:25,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1794792.0, ans=0.1 2023-06-24 17:16:38,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1794852.0, ans=0.0 2023-06-24 17:16:52,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1794912.0, ans=0.1 2023-06-24 17:16:53,144 INFO [train.py:996] (1/4) Epoch 10, batch 24700, loss[loss=0.2192, simple_loss=0.2812, pruned_loss=0.07857, over 21729.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3067, pruned_loss=0.079, over 4260416.04 frames. ], batch size: 124, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:16:53,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1794912.0, ans=0.125 2023-06-24 17:17:09,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1794912.0, ans=0.125 2023-06-24 17:17:49,174 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 6.249e+02 8.587e+02 1.281e+03 3.151e+03, threshold=1.717e+03, percent-clipped=6.0 2023-06-24 17:17:59,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1795092.0, ans=0.2 2023-06-24 17:18:07,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1795092.0, ans=0.125 2023-06-24 17:18:13,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1795092.0, ans=0.0 2023-06-24 17:18:31,298 INFO [train.py:996] (1/4) Epoch 10, batch 24750, loss[loss=0.2064, simple_loss=0.2781, pruned_loss=0.06738, over 21809.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2993, pruned_loss=0.07685, over 4260010.20 frames. ], batch size: 98, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:18:59,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1795272.0, ans=0.0 2023-06-24 17:19:09,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1795272.0, ans=0.2 2023-06-24 17:19:09,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1795272.0, ans=0.02 2023-06-24 17:19:30,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1795332.0, ans=0.0 2023-06-24 17:19:49,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1795392.0, ans=0.125 2023-06-24 17:20:07,377 INFO [train.py:996] (1/4) Epoch 10, batch 24800, loss[loss=0.2243, simple_loss=0.2856, pruned_loss=0.08152, over 21864.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2933, pruned_loss=0.07619, over 4252267.18 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 17:20:22,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1795572.0, ans=0.125 2023-06-24 17:20:49,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1795632.0, ans=0.0 2023-06-24 17:20:59,564 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.326e+02 5.922e+02 8.431e+02 1.285e+03 2.453e+03, threshold=1.686e+03, percent-clipped=12.0 2023-06-24 17:21:30,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-06-24 17:21:45,151 INFO [train.py:996] (1/4) Epoch 10, batch 24850, loss[loss=0.2721, simple_loss=0.3531, pruned_loss=0.09551, over 21524.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2947, pruned_loss=0.078, over 4256575.25 frames. ], batch size: 471, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:22:07,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1795872.0, ans=0.1 2023-06-24 17:22:46,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1795932.0, ans=0.125 2023-06-24 17:22:51,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1795992.0, ans=0.05 2023-06-24 17:22:55,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1795992.0, ans=0.0 2023-06-24 17:22:57,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1795992.0, ans=0.1 2023-06-24 17:23:05,533 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:23:08,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1796052.0, ans=0.0 2023-06-24 17:23:22,414 INFO [train.py:996] (1/4) Epoch 10, batch 24900, loss[loss=0.2797, simple_loss=0.3443, pruned_loss=0.1075, over 21197.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2975, pruned_loss=0.079, over 4257661.24 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:24:02,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1796172.0, ans=0.0 2023-06-24 17:24:20,585 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 8.107e+02 1.257e+03 1.989e+03 3.453e+03, threshold=2.514e+03, percent-clipped=33.0 2023-06-24 17:24:35,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1796292.0, ans=0.125 2023-06-24 17:24:43,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1796352.0, ans=0.125 2023-06-24 17:24:48,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1796352.0, ans=0.125 2023-06-24 17:25:00,300 INFO [train.py:996] (1/4) Epoch 10, batch 24950, loss[loss=0.2677, simple_loss=0.3655, pruned_loss=0.08499, over 17298.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3064, pruned_loss=0.08367, over 4260604.35 frames. ], batch size: 60, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:25:51,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1796532.0, ans=0.125 2023-06-24 17:26:02,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1796532.0, ans=0.2 2023-06-24 17:26:15,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1796592.0, ans=0.1 2023-06-24 17:26:16,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1796592.0, ans=10.0 2023-06-24 17:26:29,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1796652.0, ans=0.125 2023-06-24 17:26:34,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1796652.0, ans=0.125 2023-06-24 17:26:43,363 INFO [train.py:996] (1/4) Epoch 10, batch 25000, loss[loss=0.2163, simple_loss=0.2845, pruned_loss=0.07409, over 21531.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3141, pruned_loss=0.0856, over 4270631.14 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:26:50,841 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-24 17:27:07,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1796772.0, ans=0.125 2023-06-24 17:27:37,649 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 7.141e+02 9.275e+02 1.381e+03 2.945e+03, threshold=1.855e+03, percent-clipped=4.0 2023-06-24 17:27:41,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1796832.0, ans=0.2 2023-06-24 17:27:46,053 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:28:00,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1796952.0, ans=0.0 2023-06-24 17:28:31,899 INFO [train.py:996] (1/4) Epoch 10, batch 25050, loss[loss=0.2354, simple_loss=0.2991, pruned_loss=0.08584, over 21808.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3066, pruned_loss=0.08356, over 4259753.83 frames. ], batch size: 352, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:28:49,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1797012.0, ans=0.1 2023-06-24 17:29:03,769 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-24 17:29:07,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1797132.0, ans=0.0 2023-06-24 17:29:28,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1797192.0, ans=0.0 2023-06-24 17:30:03,462 INFO [train.py:996] (1/4) Epoch 10, batch 25100, loss[loss=0.2181, simple_loss=0.3109, pruned_loss=0.06265, over 21737.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3015, pruned_loss=0.08186, over 4247577.99 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:30:34,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1797372.0, ans=0.125 2023-06-24 17:30:35,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-24 17:30:38,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1797372.0, ans=0.125 2023-06-24 17:30:51,808 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.233e+02 6.255e+02 9.404e+02 1.549e+03 2.850e+03, threshold=1.881e+03, percent-clipped=12.0 2023-06-24 17:31:35,843 INFO [train.py:996] (1/4) Epoch 10, batch 25150, loss[loss=0.2258, simple_loss=0.3083, pruned_loss=0.07164, over 21866.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3039, pruned_loss=0.07949, over 4245035.50 frames. ], batch size: 371, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:32:24,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1797732.0, ans=0.125 2023-06-24 17:32:38,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1797792.0, ans=0.05 2023-06-24 17:33:00,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1797852.0, ans=0.0 2023-06-24 17:33:12,422 INFO [train.py:996] (1/4) Epoch 10, batch 25200, loss[loss=0.2072, simple_loss=0.2885, pruned_loss=0.06298, over 21421.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3022, pruned_loss=0.07662, over 4250734.98 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:33:40,204 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-24 17:33:55,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1798032.0, ans=0.125 2023-06-24 17:34:00,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.948e+02 1.057e+03 1.508e+03 2.758e+03, threshold=2.115e+03, percent-clipped=16.0 2023-06-24 17:34:10,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1798092.0, ans=0.0 2023-06-24 17:34:20,736 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-24 17:34:25,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.92 vs. limit=15.0 2023-06-24 17:34:49,136 INFO [train.py:996] (1/4) Epoch 10, batch 25250, loss[loss=0.2283, simple_loss=0.2878, pruned_loss=0.08439, over 21262.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3005, pruned_loss=0.07554, over 4261124.22 frames. ], batch size: 160, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:34:52,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1798212.0, ans=0.125 2023-06-24 17:35:36,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1798332.0, ans=0.2 2023-06-24 17:35:53,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1798392.0, ans=0.0 2023-06-24 17:35:58,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1798392.0, ans=0.1 2023-06-24 17:36:26,977 INFO [train.py:996] (1/4) Epoch 10, batch 25300, loss[loss=0.2387, simple_loss=0.3142, pruned_loss=0.0816, over 21737.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2984, pruned_loss=0.0751, over 4255645.26 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:37:15,752 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 6.891e+02 9.341e+02 1.406e+03 3.031e+03, threshold=1.868e+03, percent-clipped=2.0 2023-06-24 17:37:43,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1798752.0, ans=0.0 2023-06-24 17:37:44,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-06-24 17:38:10,247 INFO [train.py:996] (1/4) Epoch 10, batch 25350, loss[loss=0.1951, simple_loss=0.2808, pruned_loss=0.05468, over 21780.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2998, pruned_loss=0.0738, over 4257587.46 frames. ], batch size: 371, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:38:21,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1798812.0, ans=0.125 2023-06-24 17:38:22,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-24 17:38:26,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1798812.0, ans=0.0 2023-06-24 17:39:26,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1799052.0, ans=0.125 2023-06-24 17:39:42,364 INFO [train.py:996] (1/4) Epoch 10, batch 25400, loss[loss=0.2295, simple_loss=0.3014, pruned_loss=0.07882, over 21789.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2967, pruned_loss=0.07333, over 4256378.20 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:39:54,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1799112.0, ans=0.2 2023-06-24 17:40:19,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1799172.0, ans=0.015 2023-06-24 17:40:31,213 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.306e+02 6.316e+02 8.896e+02 1.149e+03 2.761e+03, threshold=1.779e+03, percent-clipped=5.0 2023-06-24 17:41:10,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.41 vs. limit=22.5 2023-06-24 17:41:24,899 INFO [train.py:996] (1/4) Epoch 10, batch 25450, loss[loss=0.1992, simple_loss=0.2979, pruned_loss=0.05023, over 21808.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2963, pruned_loss=0.07393, over 4258283.62 frames. ], batch size: 333, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:41:25,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1799412.0, ans=0.1 2023-06-24 17:41:39,657 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1799412.0, ans=0.125 2023-06-24 17:42:08,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1799532.0, ans=0.125 2023-06-24 17:42:46,950 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:43:04,196 INFO [train.py:996] (1/4) Epoch 10, batch 25500, loss[loss=0.3002, simple_loss=0.3742, pruned_loss=0.1131, over 21360.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2976, pruned_loss=0.07206, over 4250592.79 frames. ], batch size: 507, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:43:26,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1799772.0, ans=0.0 2023-06-24 17:43:28,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1799772.0, ans=0.125 2023-06-24 17:43:31,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-24 17:43:42,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1799832.0, ans=0.125 2023-06-24 17:43:43,848 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.327e+02 6.563e+02 1.058e+03 1.442e+03 3.790e+03, threshold=2.117e+03, percent-clipped=15.0 2023-06-24 17:44:39,402 INFO [train.py:996] (1/4) Epoch 10, batch 25550, loss[loss=0.2181, simple_loss=0.3224, pruned_loss=0.05691, over 21787.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3056, pruned_loss=0.07278, over 4256411.33 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:44:40,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1800012.0, ans=0.125 2023-06-24 17:45:10,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1800132.0, ans=0.125 2023-06-24 17:46:12,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1800252.0, ans=0.125 2023-06-24 17:46:17,027 INFO [train.py:996] (1/4) Epoch 10, batch 25600, loss[loss=0.2759, simple_loss=0.3682, pruned_loss=0.09184, over 21556.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3098, pruned_loss=0.07393, over 4254180.05 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:46:33,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1800372.0, ans=0.1 2023-06-24 17:46:58,014 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.521e+02 8.019e+02 1.314e+03 1.703e+03 3.186e+03, threshold=2.628e+03, percent-clipped=13.0 2023-06-24 17:47:53,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1800612.0, ans=0.2 2023-06-24 17:47:54,747 INFO [train.py:996] (1/4) Epoch 10, batch 25650, loss[loss=0.2193, simple_loss=0.2855, pruned_loss=0.07657, over 16274.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3117, pruned_loss=0.07722, over 4256547.05 frames. ], batch size: 65, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:48:22,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1800672.0, ans=0.125 2023-06-24 17:48:47,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1800792.0, ans=0.0 2023-06-24 17:49:28,890 INFO [train.py:996] (1/4) Epoch 10, batch 25700, loss[loss=0.2356, simple_loss=0.2981, pruned_loss=0.08657, over 21906.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.31, pruned_loss=0.07833, over 4251531.89 frames. ], batch size: 107, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:49:33,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-24 17:49:39,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1800912.0, ans=0.125 2023-06-24 17:49:41,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1800912.0, ans=0.0 2023-06-24 17:49:52,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1800972.0, ans=0.125 2023-06-24 17:50:16,734 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 8.817e+02 1.358e+03 2.096e+03 4.463e+03, threshold=2.717e+03, percent-clipped=14.0 2023-06-24 17:50:49,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1801152.0, ans=0.5 2023-06-24 17:51:03,528 INFO [train.py:996] (1/4) Epoch 10, batch 25750, loss[loss=0.265, simple_loss=0.3449, pruned_loss=0.09259, over 21498.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3153, pruned_loss=0.08183, over 4261364.38 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:51:08,042 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-24 17:51:10,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1801212.0, ans=0.035 2023-06-24 17:51:42,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1801272.0, ans=0.015 2023-06-24 17:51:42,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1801272.0, ans=0.125 2023-06-24 17:52:10,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1801392.0, ans=0.0 2023-06-24 17:52:35,268 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1801452.0, ans=0.0 2023-06-24 17:52:38,277 INFO [train.py:996] (1/4) Epoch 10, batch 25800, loss[loss=0.3113, simple_loss=0.3804, pruned_loss=0.1211, over 21333.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3261, pruned_loss=0.08588, over 4270924.75 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:52:42,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1801512.0, ans=0.0 2023-06-24 17:53:13,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1801572.0, ans=0.05 2023-06-24 17:53:32,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1801632.0, ans=0.1 2023-06-24 17:53:37,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=12.0 2023-06-24 17:53:40,512 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-24 17:53:40,821 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.838e+02 7.699e+02 1.038e+03 1.788e+03 4.629e+03, threshold=2.076e+03, percent-clipped=8.0 2023-06-24 17:53:42,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1801632.0, ans=0.1 2023-06-24 17:53:52,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1801692.0, ans=0.0 2023-06-24 17:54:04,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1801752.0, ans=0.125 2023-06-24 17:54:17,138 INFO [train.py:996] (1/4) Epoch 10, batch 25850, loss[loss=0.2525, simple_loss=0.315, pruned_loss=0.09501, over 21486.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3279, pruned_loss=0.08497, over 4273863.07 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:54:55,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1801872.0, ans=0.2 2023-06-24 17:54:59,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.93 vs. limit=6.0 2023-06-24 17:55:24,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1801992.0, ans=0.05 2023-06-24 17:55:57,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1802052.0, ans=0.0 2023-06-24 17:56:01,203 INFO [train.py:996] (1/4) Epoch 10, batch 25900, loss[loss=0.2461, simple_loss=0.3352, pruned_loss=0.07853, over 21445.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3278, pruned_loss=0.08558, over 4282952.81 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:56:27,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1802172.0, ans=0.07 2023-06-24 17:56:51,836 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-24 17:56:53,713 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.611e+02 7.173e+02 1.132e+03 1.463e+03 2.574e+03, threshold=2.264e+03, percent-clipped=5.0 2023-06-24 17:57:29,858 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-24 17:57:39,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-24 17:57:45,026 INFO [train.py:996] (1/4) Epoch 10, batch 25950, loss[loss=0.3032, simple_loss=0.3672, pruned_loss=0.1196, over 21357.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3341, pruned_loss=0.08898, over 4273770.15 frames. ], batch size: 507, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:58:28,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2023-06-24 17:58:43,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1802592.0, ans=10.0 2023-06-24 17:59:23,452 INFO [train.py:996] (1/4) Epoch 10, batch 26000, loss[loss=0.2646, simple_loss=0.3451, pruned_loss=0.09208, over 21916.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3352, pruned_loss=0.08865, over 4272777.55 frames. ], batch size: 372, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:59:35,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-24 18:00:06,703 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.493e+02 7.586e+02 1.074e+03 1.522e+03 3.008e+03, threshold=2.148e+03, percent-clipped=6.0 2023-06-24 18:00:56,237 INFO [train.py:996] (1/4) Epoch 10, batch 26050, loss[loss=0.2299, simple_loss=0.2967, pruned_loss=0.08156, over 21918.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3327, pruned_loss=0.08796, over 4276961.57 frames. ], batch size: 316, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:02:22,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1803252.0, ans=0.0 2023-06-24 18:02:32,725 INFO [train.py:996] (1/4) Epoch 10, batch 26100, loss[loss=0.2348, simple_loss=0.3027, pruned_loss=0.08348, over 21465.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3266, pruned_loss=0.0878, over 4285188.10 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:02:54,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1803372.0, ans=0.0 2023-06-24 18:02:58,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1803372.0, ans=0.07 2023-06-24 18:03:02,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1803432.0, ans=0.2 2023-06-24 18:03:15,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.148e+02 7.025e+02 9.795e+02 1.501e+03 3.322e+03, threshold=1.959e+03, percent-clipped=9.0 2023-06-24 18:03:46,199 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-24 18:04:05,732 INFO [train.py:996] (1/4) Epoch 10, batch 26150, loss[loss=0.2767, simple_loss=0.3548, pruned_loss=0.09929, over 21857.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3226, pruned_loss=0.08707, over 4290955.50 frames. ], batch size: 118, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:04:17,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1803612.0, ans=0.2 2023-06-24 18:04:27,929 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-24 18:04:50,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1803732.0, ans=0.0 2023-06-24 18:05:27,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1803852.0, ans=0.0 2023-06-24 18:05:38,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1803852.0, ans=0.1 2023-06-24 18:05:44,638 INFO [train.py:996] (1/4) Epoch 10, batch 26200, loss[loss=0.2255, simple_loss=0.3146, pruned_loss=0.06822, over 21130.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3237, pruned_loss=0.08513, over 4285587.10 frames. ], batch size: 143, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:05:46,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1803912.0, ans=0.125 2023-06-24 18:05:49,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1803912.0, ans=0.125 2023-06-24 18:06:36,103 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.529e+02 6.225e+02 8.349e+02 1.175e+03 2.397e+03, threshold=1.670e+03, percent-clipped=3.0 2023-06-24 18:06:43,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1804032.0, ans=0.125 2023-06-24 18:07:16,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1804212.0, ans=0.0 2023-06-24 18:07:17,740 INFO [train.py:996] (1/4) Epoch 10, batch 26250, loss[loss=0.2234, simple_loss=0.2957, pruned_loss=0.07558, over 21832.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3269, pruned_loss=0.08463, over 4287059.70 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:07:37,442 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-24 18:08:18,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1804332.0, ans=0.125 2023-06-24 18:08:28,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1804392.0, ans=0.1 2023-06-24 18:08:39,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1804452.0, ans=0.125 2023-06-24 18:08:52,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1804512.0, ans=0.2 2023-06-24 18:08:53,831 INFO [train.py:996] (1/4) Epoch 10, batch 26300, loss[loss=0.2116, simple_loss=0.2905, pruned_loss=0.06632, over 17146.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.324, pruned_loss=0.08535, over 4287893.23 frames. ], batch size: 60, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:08:59,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1804512.0, ans=10.0 2023-06-24 18:09:03,084 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-24 18:09:51,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.353e+02 7.232e+02 9.217e+02 1.300e+03 2.808e+03, threshold=1.843e+03, percent-clipped=15.0 2023-06-24 18:10:27,970 INFO [train.py:996] (1/4) Epoch 10, batch 26350, loss[loss=0.2828, simple_loss=0.3471, pruned_loss=0.1093, over 21424.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3222, pruned_loss=0.08561, over 4284987.21 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:10:28,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1804812.0, ans=0.125 2023-06-24 18:10:50,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1804872.0, ans=0.0 2023-06-24 18:11:11,829 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.93 vs. limit=15.0 2023-06-24 18:12:00,229 INFO [train.py:996] (1/4) Epoch 10, batch 26400, loss[loss=0.2708, simple_loss=0.3077, pruned_loss=0.1169, over 21464.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3167, pruned_loss=0.08594, over 4285479.43 frames. ], batch size: 510, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:12:43,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1805172.0, ans=0.0 2023-06-24 18:13:04,647 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.504e+02 7.130e+02 9.166e+02 1.346e+03 2.893e+03, threshold=1.833e+03, percent-clipped=10.0 2023-06-24 18:13:23,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1805352.0, ans=0.125 2023-06-24 18:13:27,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1805352.0, ans=0.035 2023-06-24 18:13:36,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1805352.0, ans=0.125 2023-06-24 18:13:44,722 INFO [train.py:996] (1/4) Epoch 10, batch 26450, loss[loss=0.272, simple_loss=0.3759, pruned_loss=0.08406, over 21718.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3176, pruned_loss=0.08634, over 4279144.06 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:13:45,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1805412.0, ans=0.04949747468305833 2023-06-24 18:14:57,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1805592.0, ans=0.0 2023-06-24 18:15:14,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1805652.0, ans=0.125 2023-06-24 18:15:23,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1805652.0, ans=0.1 2023-06-24 18:15:33,634 INFO [train.py:996] (1/4) Epoch 10, batch 26500, loss[loss=0.2112, simple_loss=0.2811, pruned_loss=0.07062, over 21623.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3206, pruned_loss=0.0846, over 4278183.89 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:15:54,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1805772.0, ans=0.2 2023-06-24 18:15:55,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1805772.0, ans=0.125 2023-06-24 18:16:19,144 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.291e+02 8.492e+02 1.713e+03 2.381e+03 4.815e+03, threshold=3.427e+03, percent-clipped=46.0 2023-06-24 18:16:22,887 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1805892.0, ans=0.125 2023-06-24 18:16:23,511 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-24 18:16:39,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1805892.0, ans=0.0 2023-06-24 18:17:00,298 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=22.5 2023-06-24 18:17:05,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=22.5 2023-06-24 18:17:13,489 INFO [train.py:996] (1/4) Epoch 10, batch 26550, loss[loss=0.2847, simple_loss=0.3843, pruned_loss=0.09254, over 19812.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3171, pruned_loss=0.08135, over 4265108.69 frames. ], batch size: 703, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:17:19,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-24 18:17:20,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1806012.0, ans=0.0 2023-06-24 18:17:47,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1806132.0, ans=0.125 2023-06-24 18:18:08,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806132.0, ans=0.1 2023-06-24 18:18:18,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1806192.0, ans=0.125 2023-06-24 18:18:51,162 INFO [train.py:996] (1/4) Epoch 10, batch 26600, loss[loss=0.2545, simple_loss=0.3092, pruned_loss=0.09991, over 20217.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3165, pruned_loss=0.07813, over 4265898.96 frames. ], batch size: 707, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:18:57,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1806312.0, ans=0.5 2023-06-24 18:18:58,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.80 vs. limit=15.0 2023-06-24 18:19:17,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1806372.0, ans=0.125 2023-06-24 18:19:50,176 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.252e+02 6.735e+02 9.367e+02 1.418e+03 2.947e+03, threshold=1.873e+03, percent-clipped=0.0 2023-06-24 18:20:00,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1806492.0, ans=0.2 2023-06-24 18:20:02,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1806492.0, ans=0.125 2023-06-24 18:20:09,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806492.0, ans=0.1 2023-06-24 18:20:11,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1806552.0, ans=0.5 2023-06-24 18:20:18,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806552.0, ans=0.1 2023-06-24 18:20:25,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1806552.0, ans=0.125 2023-06-24 18:20:27,317 INFO [train.py:996] (1/4) Epoch 10, batch 26650, loss[loss=0.1745, simple_loss=0.2553, pruned_loss=0.04689, over 21558.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3084, pruned_loss=0.07661, over 4258382.11 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:20:31,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1806612.0, ans=0.125 2023-06-24 18:20:35,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806612.0, ans=0.1 2023-06-24 18:21:33,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1806792.0, ans=0.0 2023-06-24 18:22:04,431 INFO [train.py:996] (1/4) Epoch 10, batch 26700, loss[loss=0.2711, simple_loss=0.3214, pruned_loss=0.1104, over 21781.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3006, pruned_loss=0.07318, over 4256251.29 frames. ], batch size: 508, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:22:06,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1806912.0, ans=0.125 2023-06-24 18:22:19,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1806972.0, ans=0.125 2023-06-24 18:22:19,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1806972.0, ans=0.125 2023-06-24 18:22:40,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1807032.0, ans=0.125 2023-06-24 18:22:41,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-24 18:23:03,963 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 5.904e+02 8.895e+02 1.266e+03 2.611e+03, threshold=1.779e+03, percent-clipped=6.0 2023-06-24 18:23:17,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1807092.0, ans=0.05 2023-06-24 18:23:37,447 INFO [train.py:996] (1/4) Epoch 10, batch 26750, loss[loss=0.2192, simple_loss=0.3084, pruned_loss=0.06496, over 21796.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2999, pruned_loss=0.07195, over 4265703.42 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:24:32,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807332.0, ans=0.1 2023-06-24 18:25:17,208 INFO [train.py:996] (1/4) Epoch 10, batch 26800, loss[loss=0.2869, simple_loss=0.3585, pruned_loss=0.1076, over 21537.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.307, pruned_loss=0.07613, over 4272818.39 frames. ], batch size: 414, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:25:19,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807512.0, ans=0.1 2023-06-24 18:25:35,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1807512.0, ans=0.04949747468305833 2023-06-24 18:26:00,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1807572.0, ans=0.0 2023-06-24 18:26:16,843 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.147e+02 7.518e+02 9.832e+02 1.417e+03 2.844e+03, threshold=1.966e+03, percent-clipped=8.0 2023-06-24 18:26:20,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1807692.0, ans=0.05 2023-06-24 18:26:59,149 INFO [train.py:996] (1/4) Epoch 10, batch 26850, loss[loss=0.1885, simple_loss=0.2525, pruned_loss=0.06221, over 21605.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3094, pruned_loss=0.07946, over 4264001.35 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:27:21,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.20 vs. limit=15.0 2023-06-24 18:27:32,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1807872.0, ans=0.125 2023-06-24 18:27:40,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1807932.0, ans=0.0 2023-06-24 18:27:54,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807992.0, ans=0.1 2023-06-24 18:28:15,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.65 vs. limit=15.0 2023-06-24 18:28:30,945 INFO [train.py:996] (1/4) Epoch 10, batch 26900, loss[loss=0.1934, simple_loss=0.2638, pruned_loss=0.06151, over 21746.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3014, pruned_loss=0.07907, over 4256032.43 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:28:46,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-24 18:29:21,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1808232.0, ans=0.09899494936611666 2023-06-24 18:29:30,463 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 7.494e+02 9.472e+02 1.508e+03 3.136e+03, threshold=1.894e+03, percent-clipped=8.0 2023-06-24 18:29:43,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1808292.0, ans=0.0 2023-06-24 18:29:49,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1808352.0, ans=0.125 2023-06-24 18:30:03,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1808352.0, ans=0.0 2023-06-24 18:30:07,730 INFO [train.py:996] (1/4) Epoch 10, batch 26950, loss[loss=0.2825, simple_loss=0.3745, pruned_loss=0.09523, over 21223.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3001, pruned_loss=0.07918, over 4261034.56 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:31:19,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-24 18:31:25,457 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-24 18:31:41,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1808652.0, ans=0.125 2023-06-24 18:31:54,854 INFO [train.py:996] (1/4) Epoch 10, batch 27000, loss[loss=0.1895, simple_loss=0.2698, pruned_loss=0.05461, over 21190.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3004, pruned_loss=0.07686, over 4254804.25 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:31:54,854 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 18:32:16,362 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2412, simple_loss=0.3374, pruned_loss=0.07247, over 1796401.00 frames. 2023-06-24 18:32:16,363 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 18:32:25,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-24 18:32:26,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1808712.0, ans=0.05 2023-06-24 18:32:26,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-24 18:32:28,482 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-24 18:32:48,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1808832.0, ans=0.125 2023-06-24 18:32:49,451 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-24 18:33:07,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808892.0, ans=0.1 2023-06-24 18:33:08,819 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.928e+02 6.448e+02 9.390e+02 1.351e+03 2.937e+03, threshold=1.878e+03, percent-clipped=11.0 2023-06-24 18:33:18,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1808892.0, ans=0.0 2023-06-24 18:33:55,950 INFO [train.py:996] (1/4) Epoch 10, batch 27050, loss[loss=0.2257, simple_loss=0.3087, pruned_loss=0.0713, over 21829.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3028, pruned_loss=0.07319, over 4256922.88 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:34:21,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809072.0, ans=0.1 2023-06-24 18:34:24,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-24 18:34:39,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1809132.0, ans=0.2 2023-06-24 18:34:40,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1809132.0, ans=0.0 2023-06-24 18:35:32,516 INFO [train.py:996] (1/4) Epoch 10, batch 27100, loss[loss=0.2778, simple_loss=0.365, pruned_loss=0.09529, over 21717.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3058, pruned_loss=0.07513, over 4271008.61 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:36:03,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1809432.0, ans=0.125 2023-06-24 18:36:20,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1809432.0, ans=0.0 2023-06-24 18:36:24,777 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.409e+02 6.272e+02 8.340e+02 1.180e+03 2.454e+03, threshold=1.668e+03, percent-clipped=3.0 2023-06-24 18:37:10,899 INFO [train.py:996] (1/4) Epoch 10, batch 27150, loss[loss=0.3243, simple_loss=0.4102, pruned_loss=0.1192, over 21662.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3192, pruned_loss=0.07927, over 4275683.16 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:37:19,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1809612.0, ans=0.1 2023-06-24 18:37:25,977 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-24 18:37:28,566 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:37:55,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1809732.0, ans=0.07 2023-06-24 18:38:08,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809792.0, ans=0.1 2023-06-24 18:38:17,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1809792.0, ans=0.2 2023-06-24 18:38:33,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1809852.0, ans=0.0 2023-06-24 18:38:42,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809852.0, ans=0.1 2023-06-24 18:38:42,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1809852.0, ans=0.2 2023-06-24 18:38:46,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1809852.0, ans=0.125 2023-06-24 18:38:49,511 INFO [train.py:996] (1/4) Epoch 10, batch 27200, loss[loss=0.2431, simple_loss=0.3346, pruned_loss=0.0758, over 20693.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3273, pruned_loss=0.08127, over 4274589.42 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:39:20,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1809972.0, ans=0.0 2023-06-24 18:39:49,064 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-24 18:39:56,026 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 7.616e+02 1.186e+03 1.800e+03 4.357e+03, threshold=2.372e+03, percent-clipped=30.0 2023-06-24 18:39:56,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1810092.0, ans=0.1 2023-06-24 18:40:13,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1810152.0, ans=0.0 2023-06-24 18:40:21,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1810152.0, ans=0.0 2023-06-24 18:40:27,554 INFO [train.py:996] (1/4) Epoch 10, batch 27250, loss[loss=0.2756, simple_loss=0.3365, pruned_loss=0.1073, over 21478.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3302, pruned_loss=0.08575, over 4271928.54 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:41:33,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2023-06-24 18:41:34,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1810392.0, ans=0.0 2023-06-24 18:42:12,508 INFO [train.py:996] (1/4) Epoch 10, batch 27300, loss[loss=0.2391, simple_loss=0.3379, pruned_loss=0.07016, over 21732.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.33, pruned_loss=0.08532, over 4275360.33 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:42:32,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1810512.0, ans=0.0 2023-06-24 18:43:18,109 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.724e+02 6.828e+02 8.478e+02 1.184e+03 2.294e+03, threshold=1.696e+03, percent-clipped=0.0 2023-06-24 18:43:41,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1810752.0, ans=0.0 2023-06-24 18:43:45,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1810752.0, ans=0.0 2023-06-24 18:43:54,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1810812.0, ans=0.0 2023-06-24 18:43:55,764 INFO [train.py:996] (1/4) Epoch 10, batch 27350, loss[loss=0.2391, simple_loss=0.3178, pruned_loss=0.08016, over 21790.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3326, pruned_loss=0.08605, over 4277122.96 frames. ], batch size: 124, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:45:02,263 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-24 18:45:48,115 INFO [train.py:996] (1/4) Epoch 10, batch 27400, loss[loss=0.2135, simple_loss=0.2812, pruned_loss=0.07289, over 21786.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3278, pruned_loss=0.08579, over 4274271.79 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:46:25,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=1811172.0, ans=15.0 2023-06-24 18:46:49,507 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 6.744e+02 9.251e+02 1.244e+03 3.904e+03, threshold=1.850e+03, percent-clipped=13.0 2023-06-24 18:47:39,107 INFO [train.py:996] (1/4) Epoch 10, batch 27450, loss[loss=0.2816, simple_loss=0.3495, pruned_loss=0.1068, over 21563.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3221, pruned_loss=0.08382, over 4268468.73 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:47:54,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1811412.0, ans=0.0 2023-06-24 18:48:35,099 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=12.0 2023-06-24 18:49:11,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811652.0, ans=0.1 2023-06-24 18:49:15,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1811652.0, ans=0.0 2023-06-24 18:49:19,535 INFO [train.py:996] (1/4) Epoch 10, batch 27500, loss[loss=0.2218, simple_loss=0.2896, pruned_loss=0.07696, over 21572.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3211, pruned_loss=0.0848, over 4273375.06 frames. ], batch size: 212, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:50:26,984 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.516e+02 9.018e+02 1.642e+03 4.000e+03, threshold=1.804e+03, percent-clipped=22.0 2023-06-24 18:50:28,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-24 18:51:04,434 INFO [train.py:996] (1/4) Epoch 10, batch 27550, loss[loss=0.2003, simple_loss=0.285, pruned_loss=0.05785, over 21773.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3164, pruned_loss=0.08179, over 4273297.84 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 18:51:04,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1812012.0, ans=0.2 2023-06-24 18:51:25,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1812012.0, ans=0.0 2023-06-24 18:51:36,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1812072.0, ans=0.125 2023-06-24 18:51:40,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-24 18:51:55,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.38 vs. limit=22.5 2023-06-24 18:52:36,282 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.53 vs. limit=22.5 2023-06-24 18:52:41,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-24 18:52:52,120 INFO [train.py:996] (1/4) Epoch 10, batch 27600, loss[loss=0.2322, simple_loss=0.2975, pruned_loss=0.08341, over 21716.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3093, pruned_loss=0.08155, over 4277300.86 frames. ], batch size: 316, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:53:07,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1812312.0, ans=0.0 2023-06-24 18:53:23,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1812372.0, ans=0.125 2023-06-24 18:53:24,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-24 18:53:25,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1812372.0, ans=0.0 2023-06-24 18:53:38,812 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-24 18:53:51,037 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.487e+02 7.022e+02 8.920e+02 1.196e+03 2.930e+03, threshold=1.784e+03, percent-clipped=9.0 2023-06-24 18:54:00,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1812492.0, ans=0.125 2023-06-24 18:54:29,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1812552.0, ans=0.125 2023-06-24 18:54:33,760 INFO [train.py:996] (1/4) Epoch 10, batch 27650, loss[loss=0.222, simple_loss=0.3072, pruned_loss=0.06845, over 21890.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3038, pruned_loss=0.08079, over 4267504.69 frames. ], batch size: 316, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:55:15,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1812732.0, ans=0.04949747468305833 2023-06-24 18:55:59,311 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-24 18:56:00,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1812852.0, ans=0.125 2023-06-24 18:56:23,914 INFO [train.py:996] (1/4) Epoch 10, batch 27700, loss[loss=0.3705, simple_loss=0.4292, pruned_loss=0.1559, over 21525.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3036, pruned_loss=0.07964, over 4267255.75 frames. ], batch size: 508, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:56:25,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.65 vs. limit=10.0 2023-06-24 18:56:49,829 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:56:54,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1812972.0, ans=0.0 2023-06-24 18:56:54,534 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:56:56,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1812972.0, ans=0.125 2023-06-24 18:57:05,520 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-24 18:57:16,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1813032.0, ans=0.2 2023-06-24 18:57:20,924 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 7.274e+02 1.072e+03 1.414e+03 3.667e+03, threshold=2.145e+03, percent-clipped=20.0 2023-06-24 18:58:08,688 INFO [train.py:996] (1/4) Epoch 10, batch 27750, loss[loss=0.208, simple_loss=0.2872, pruned_loss=0.06443, over 21701.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3086, pruned_loss=0.0793, over 4273710.34 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:58:39,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1813272.0, ans=0.1 2023-06-24 18:59:45,890 INFO [train.py:996] (1/4) Epoch 10, batch 27800, loss[loss=0.2677, simple_loss=0.324, pruned_loss=0.1057, over 21792.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3068, pruned_loss=0.07873, over 4277632.09 frames. ], batch size: 508, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:00:49,654 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.608e+02 7.266e+02 8.786e+02 1.229e+03 3.001e+03, threshold=1.757e+03, percent-clipped=8.0 2023-06-24 19:00:50,738 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.77 vs. limit=10.0 2023-06-24 19:01:40,633 INFO [train.py:996] (1/4) Epoch 10, batch 27850, loss[loss=0.3201, simple_loss=0.3888, pruned_loss=0.1257, over 21707.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3055, pruned_loss=0.07956, over 4290435.72 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:02:05,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-24 19:02:08,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1813872.0, ans=0.125 2023-06-24 19:02:12,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1813872.0, ans=0.125 2023-06-24 19:02:29,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1813932.0, ans=0.07 2023-06-24 19:02:52,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1813992.0, ans=0.0 2023-06-24 19:03:01,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1813992.0, ans=0.2 2023-06-24 19:03:33,935 INFO [train.py:996] (1/4) Epoch 10, batch 27900, loss[loss=0.2296, simple_loss=0.3182, pruned_loss=0.07055, over 21404.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3134, pruned_loss=0.08033, over 4292027.96 frames. ], batch size: 194, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:03:36,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1814112.0, ans=0.125 2023-06-24 19:04:22,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1814232.0, ans=0.2 2023-06-24 19:04:24,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1814232.0, ans=0.05 2023-06-24 19:04:37,101 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.993e+02 8.012e+02 1.137e+03 1.760e+03 3.581e+03, threshold=2.273e+03, percent-clipped=25.0 2023-06-24 19:05:02,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1814352.0, ans=0.0 2023-06-24 19:05:21,101 INFO [train.py:996] (1/4) Epoch 10, batch 27950, loss[loss=0.1768, simple_loss=0.265, pruned_loss=0.04426, over 21612.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3138, pruned_loss=0.07707, over 4288807.72 frames. ], batch size: 195, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:05:49,187 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-24 19:06:00,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1814472.0, ans=0.07 2023-06-24 19:07:06,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1814712.0, ans=0.1 2023-06-24 19:07:07,525 INFO [train.py:996] (1/4) Epoch 10, batch 28000, loss[loss=0.2162, simple_loss=0.2833, pruned_loss=0.07458, over 21310.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3126, pruned_loss=0.07573, over 4286065.10 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:07:10,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.19 vs. limit=15.0 2023-06-24 19:08:13,047 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.030e+02 6.703e+02 9.141e+02 1.311e+03 2.491e+03, threshold=1.828e+03, percent-clipped=2.0 2023-06-24 19:08:45,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1814952.0, ans=0.0 2023-06-24 19:08:51,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814952.0, ans=0.1 2023-06-24 19:08:53,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814952.0, ans=0.1 2023-06-24 19:08:56,439 INFO [train.py:996] (1/4) Epoch 10, batch 28050, loss[loss=0.2018, simple_loss=0.2848, pruned_loss=0.05943, over 21777.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3106, pruned_loss=0.07727, over 4289701.52 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:08:56,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1815012.0, ans=0.0 2023-06-24 19:09:14,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=22.5 2023-06-24 19:09:41,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1815132.0, ans=0.125 2023-06-24 19:10:27,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1815252.0, ans=0.125 2023-06-24 19:10:37,054 INFO [train.py:996] (1/4) Epoch 10, batch 28100, loss[loss=0.2133, simple_loss=0.2778, pruned_loss=0.07441, over 21640.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3081, pruned_loss=0.07638, over 4291281.14 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:11:09,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1815372.0, ans=0.05 2023-06-24 19:11:18,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-24 19:11:53,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.563e+02 7.316e+02 9.478e+02 1.492e+03 2.732e+03, threshold=1.896e+03, percent-clipped=11.0 2023-06-24 19:11:55,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1815492.0, ans=0.125 2023-06-24 19:12:03,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1815492.0, ans=0.1 2023-06-24 19:12:28,319 INFO [train.py:996] (1/4) Epoch 10, batch 28150, loss[loss=0.1809, simple_loss=0.25, pruned_loss=0.05588, over 21560.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3001, pruned_loss=0.07609, over 4290469.69 frames. ], batch size: 231, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:13:01,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1815672.0, ans=0.125 2023-06-24 19:13:03,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-06-24 19:13:31,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1815732.0, ans=0.0 2023-06-24 19:13:43,400 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1815792.0, ans=0.125 2023-06-24 19:14:18,401 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:14:22,725 INFO [train.py:996] (1/4) Epoch 10, batch 28200, loss[loss=0.2096, simple_loss=0.2752, pruned_loss=0.07202, over 21337.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.298, pruned_loss=0.07708, over 4279448.56 frames. ], batch size: 177, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:14:47,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1815972.0, ans=0.2 2023-06-24 19:15:15,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1816032.0, ans=0.125 2023-06-24 19:15:19,022 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-24 19:15:29,946 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.315e+02 7.583e+02 1.086e+03 1.711e+03 4.107e+03, threshold=2.171e+03, percent-clipped=18.0 2023-06-24 19:16:09,719 INFO [train.py:996] (1/4) Epoch 10, batch 28250, loss[loss=0.2242, simple_loss=0.2801, pruned_loss=0.08416, over 21379.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.303, pruned_loss=0.08059, over 4278687.31 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:17:00,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1816332.0, ans=0.125 2023-06-24 19:17:53,574 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-24 19:17:57,320 INFO [train.py:996] (1/4) Epoch 10, batch 28300, loss[loss=0.2379, simple_loss=0.3267, pruned_loss=0.07452, over 21491.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3013, pruned_loss=0.07847, over 4275746.65 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:18:18,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1816572.0, ans=0.125 2023-06-24 19:18:20,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-24 19:18:29,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1816572.0, ans=0.0 2023-06-24 19:18:57,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1816692.0, ans=0.04949747468305833 2023-06-24 19:19:02,674 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.412e+02 7.987e+02 1.217e+03 2.083e+03 3.970e+03, threshold=2.435e+03, percent-clipped=20.0 2023-06-24 19:19:23,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1816692.0, ans=0.1 2023-06-24 19:19:32,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1816752.0, ans=15.0 2023-06-24 19:19:43,806 INFO [train.py:996] (1/4) Epoch 10, batch 28350, loss[loss=0.2242, simple_loss=0.2961, pruned_loss=0.07613, over 21718.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2998, pruned_loss=0.07326, over 4265917.25 frames. ], batch size: 351, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:20:20,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1816872.0, ans=0.05 2023-06-24 19:20:29,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1816932.0, ans=0.125 2023-06-24 19:20:29,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1816932.0, ans=0.1 2023-06-24 19:21:15,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1817052.0, ans=0.0 2023-06-24 19:21:30,140 INFO [train.py:996] (1/4) Epoch 10, batch 28400, loss[loss=0.2279, simple_loss=0.3013, pruned_loss=0.07724, over 21320.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2967, pruned_loss=0.07306, over 4258301.67 frames. ], batch size: 549, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:22:14,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1817232.0, ans=0.05 2023-06-24 19:22:31,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1817232.0, ans=0.125 2023-06-24 19:22:47,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1817292.0, ans=0.125 2023-06-24 19:22:48,471 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.250e+02 7.623e+02 1.000e+03 1.515e+03 3.438e+03, threshold=2.000e+03, percent-clipped=5.0 2023-06-24 19:22:59,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1817352.0, ans=0.125 2023-06-24 19:23:22,186 INFO [train.py:996] (1/4) Epoch 10, batch 28450, loss[loss=0.2558, simple_loss=0.3203, pruned_loss=0.0956, over 21803.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3005, pruned_loss=0.07552, over 4263731.21 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:23:56,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1817472.0, ans=0.0 2023-06-24 19:25:19,430 INFO [train.py:996] (1/4) Epoch 10, batch 28500, loss[loss=0.241, simple_loss=0.315, pruned_loss=0.08346, over 21873.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3014, pruned_loss=0.07768, over 4271082.54 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:26:11,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1817832.0, ans=0.125 2023-06-24 19:26:18,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1817832.0, ans=0.125 2023-06-24 19:26:27,806 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.659e+02 7.035e+02 9.248e+02 1.262e+03 2.146e+03, threshold=1.850e+03, percent-clipped=2.0 2023-06-24 19:27:07,860 INFO [train.py:996] (1/4) Epoch 10, batch 28550, loss[loss=0.2561, simple_loss=0.323, pruned_loss=0.09462, over 21243.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3105, pruned_loss=0.08111, over 4276329.99 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:27:11,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1818012.0, ans=0.125 2023-06-24 19:28:11,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1818192.0, ans=0.125 2023-06-24 19:28:43,682 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1818252.0, ans=0.125 2023-06-24 19:28:55,089 INFO [train.py:996] (1/4) Epoch 10, batch 28600, loss[loss=0.2273, simple_loss=0.306, pruned_loss=0.07432, over 21338.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3164, pruned_loss=0.08289, over 4275978.06 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:29:04,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1818312.0, ans=0.2 2023-06-24 19:29:36,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1818372.0, ans=0.125 2023-06-24 19:29:41,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1818432.0, ans=0.04949747468305833 2023-06-24 19:30:08,441 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.757e+02 7.121e+02 1.044e+03 1.505e+03 3.342e+03, threshold=2.089e+03, percent-clipped=18.0 2023-06-24 19:30:31,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1818552.0, ans=0.1 2023-06-24 19:30:39,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1818552.0, ans=0.0 2023-06-24 19:30:42,112 INFO [train.py:996] (1/4) Epoch 10, batch 28650, loss[loss=0.1856, simple_loss=0.2457, pruned_loss=0.06274, over 21546.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.311, pruned_loss=0.08301, over 4259558.35 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:30:56,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1818612.0, ans=0.0 2023-06-24 19:30:56,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1818612.0, ans=0.125 2023-06-24 19:31:18,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-24 19:31:43,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1818732.0, ans=0.125 2023-06-24 19:32:01,055 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1818792.0, ans=0.1 2023-06-24 19:32:16,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1818852.0, ans=0.125 2023-06-24 19:32:34,083 INFO [train.py:996] (1/4) Epoch 10, batch 28700, loss[loss=0.2616, simple_loss=0.3373, pruned_loss=0.09292, over 21798.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.311, pruned_loss=0.08441, over 4255054.49 frames. ], batch size: 118, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:32:57,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1818972.0, ans=0.0 2023-06-24 19:33:47,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1819092.0, ans=0.125 2023-06-24 19:33:52,203 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.396e+02 6.260e+02 7.915e+02 1.085e+03 2.283e+03, threshold=1.583e+03, percent-clipped=3.0 2023-06-24 19:34:06,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1819152.0, ans=0.2 2023-06-24 19:34:23,870 INFO [train.py:996] (1/4) Epoch 10, batch 28750, loss[loss=0.252, simple_loss=0.3217, pruned_loss=0.09113, over 21774.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3092, pruned_loss=0.08401, over 4258582.25 frames. ], batch size: 112, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:34:35,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1819212.0, ans=0.2 2023-06-24 19:35:34,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1819392.0, ans=0.2 2023-06-24 19:35:41,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1819392.0, ans=0.125 2023-06-24 19:36:21,345 INFO [train.py:996] (1/4) Epoch 10, batch 28800, loss[loss=0.3107, simple_loss=0.3708, pruned_loss=0.1253, over 21280.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3148, pruned_loss=0.08533, over 4256958.84 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:36:23,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1819512.0, ans=0.07 2023-06-24 19:36:53,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-24 19:37:27,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1819692.0, ans=0.04949747468305833 2023-06-24 19:37:29,875 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.249e+02 6.492e+02 9.041e+02 1.378e+03 2.887e+03, threshold=1.808e+03, percent-clipped=17.0 2023-06-24 19:38:07,911 INFO [train.py:996] (1/4) Epoch 10, batch 28850, loss[loss=0.2574, simple_loss=0.3243, pruned_loss=0.09522, over 21770.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3154, pruned_loss=0.086, over 4266974.62 frames. ], batch size: 414, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:38:25,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1819812.0, ans=0.1 2023-06-24 19:39:04,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1819932.0, ans=0.125 2023-06-24 19:39:53,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1820052.0, ans=0.0 2023-06-24 19:39:57,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1820112.0, ans=0.125 2023-06-24 19:39:58,251 INFO [train.py:996] (1/4) Epoch 10, batch 28900, loss[loss=0.2924, simple_loss=0.3652, pruned_loss=0.1098, over 21545.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3189, pruned_loss=0.08827, over 4277652.74 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:40:00,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1820112.0, ans=0.0 2023-06-24 19:41:00,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1820232.0, ans=0.0 2023-06-24 19:41:21,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1820292.0, ans=0.0 2023-06-24 19:41:22,758 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.789e+02 7.717e+02 1.074e+03 1.480e+03 3.570e+03, threshold=2.148e+03, percent-clipped=12.0 2023-06-24 19:41:49,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1820352.0, ans=0.0 2023-06-24 19:41:56,152 INFO [train.py:996] (1/4) Epoch 10, batch 28950, loss[loss=0.1636, simple_loss=0.2055, pruned_loss=0.06082, over 16733.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3195, pruned_loss=0.08712, over 4270859.45 frames. ], batch size: 61, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:42:11,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1820412.0, ans=10.0 2023-06-24 19:43:27,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1820592.0, ans=0.0 2023-06-24 19:43:52,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1820712.0, ans=0.125 2023-06-24 19:43:53,416 INFO [train.py:996] (1/4) Epoch 10, batch 29000, loss[loss=0.2558, simple_loss=0.3182, pruned_loss=0.09669, over 21445.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3229, pruned_loss=0.08601, over 4271223.92 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:43:55,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1820712.0, ans=0.2 2023-06-24 19:44:21,186 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.04 vs. limit=15.0 2023-06-24 19:45:02,232 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.875e+02 6.963e+02 8.880e+02 1.390e+03 4.828e+03, threshold=1.776e+03, percent-clipped=11.0 2023-06-24 19:45:37,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1820952.0, ans=0.125 2023-06-24 19:45:41,343 INFO [train.py:996] (1/4) Epoch 10, batch 29050, loss[loss=0.2832, simple_loss=0.3427, pruned_loss=0.1118, over 21862.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3226, pruned_loss=0.08698, over 4270399.17 frames. ], batch size: 414, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:46:57,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1821192.0, ans=0.2 2023-06-24 19:47:27,539 INFO [train.py:996] (1/4) Epoch 10, batch 29100, loss[loss=0.1846, simple_loss=0.2586, pruned_loss=0.05536, over 15376.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3146, pruned_loss=0.08487, over 4267973.34 frames. ], batch size: 61, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:47:30,726 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2023-06-24 19:47:35,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1821312.0, ans=0.0 2023-06-24 19:47:40,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1821312.0, ans=0.2 2023-06-24 19:48:40,597 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.869e+02 7.571e+02 1.020e+03 1.542e+03 3.510e+03, threshold=2.040e+03, percent-clipped=14.0 2023-06-24 19:49:15,525 INFO [train.py:996] (1/4) Epoch 10, batch 29150, loss[loss=0.2082, simple_loss=0.2896, pruned_loss=0.06343, over 21777.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3119, pruned_loss=0.08234, over 4271020.79 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:49:28,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1821612.0, ans=0.125 2023-06-24 19:49:33,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1821672.0, ans=0.125 2023-06-24 19:49:55,282 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1821672.0, ans=0.2 2023-06-24 19:50:12,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-24 19:50:25,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1821792.0, ans=0.125 2023-06-24 19:50:51,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.38 vs. limit=12.0 2023-06-24 19:51:04,118 INFO [train.py:996] (1/4) Epoch 10, batch 29200, loss[loss=0.24, simple_loss=0.3158, pruned_loss=0.08211, over 21601.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3081, pruned_loss=0.08178, over 4274650.04 frames. ], batch size: 442, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:51:10,671 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-24 19:51:23,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1821912.0, ans=0.0 2023-06-24 19:51:38,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1821972.0, ans=0.2 2023-06-24 19:52:08,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1822032.0, ans=0.1 2023-06-24 19:52:14,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1822092.0, ans=0.04949747468305833 2023-06-24 19:52:17,866 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.365e+02 7.371e+02 1.140e+03 1.562e+03 2.800e+03, threshold=2.281e+03, percent-clipped=10.0 2023-06-24 19:52:38,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1822152.0, ans=0.07 2023-06-24 19:52:53,173 INFO [train.py:996] (1/4) Epoch 10, batch 29250, loss[loss=0.1972, simple_loss=0.2866, pruned_loss=0.05396, over 21112.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3055, pruned_loss=0.07938, over 4277761.66 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:52:55,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1822212.0, ans=0.09899494936611666 2023-06-24 19:53:31,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1822272.0, ans=0.0 2023-06-24 19:54:16,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1822392.0, ans=0.0 2023-06-24 19:54:36,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1822452.0, ans=0.0 2023-06-24 19:54:41,046 INFO [train.py:996] (1/4) Epoch 10, batch 29300, loss[loss=0.2088, simple_loss=0.2655, pruned_loss=0.07607, over 21200.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.306, pruned_loss=0.07785, over 4274091.90 frames. ], batch size: 144, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:55:14,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1822572.0, ans=0.2 2023-06-24 19:55:56,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1822692.0, ans=0.02 2023-06-24 19:56:02,860 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.304e+02 6.696e+02 9.053e+02 1.473e+03 3.162e+03, threshold=1.811e+03, percent-clipped=5.0 2023-06-24 19:56:05,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1822692.0, ans=0.05 2023-06-24 19:56:06,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-24 19:56:30,736 INFO [train.py:996] (1/4) Epoch 10, batch 29350, loss[loss=0.1785, simple_loss=0.2406, pruned_loss=0.0582, over 20748.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3029, pruned_loss=0.07782, over 4268696.20 frames. ], batch size: 609, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:57:05,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-24 19:58:31,664 INFO [train.py:996] (1/4) Epoch 10, batch 29400, loss[loss=0.2393, simple_loss=0.3242, pruned_loss=0.07719, over 21547.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3029, pruned_loss=0.07598, over 4266596.84 frames. ], batch size: 473, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:58:33,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1823112.0, ans=0.05 2023-06-24 19:59:06,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1823172.0, ans=0.0 2023-06-24 19:59:08,114 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:59:39,420 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.735e+02 7.217e+02 1.208e+03 1.784e+03 4.520e+03, threshold=2.416e+03, percent-clipped=24.0 2023-06-24 19:59:49,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1823352.0, ans=0.125 2023-06-24 20:00:08,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1823352.0, ans=0.125 2023-06-24 20:00:10,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-24 20:00:13,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1823352.0, ans=0.0 2023-06-24 20:00:18,191 INFO [train.py:996] (1/4) Epoch 10, batch 29450, loss[loss=0.2451, simple_loss=0.3142, pruned_loss=0.08795, over 21819.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3037, pruned_loss=0.07612, over 4262241.08 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 20:00:29,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-24 20:01:45,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.64 vs. limit=6.0 2023-06-24 20:01:58,228 INFO [train.py:996] (1/4) Epoch 10, batch 29500, loss[loss=0.2795, simple_loss=0.3314, pruned_loss=0.1138, over 21209.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3062, pruned_loss=0.07873, over 4271553.18 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:02:18,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1823772.0, ans=0.0 2023-06-24 20:03:03,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1823892.0, ans=0.0 2023-06-24 20:03:06,150 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.744e+02 7.201e+02 9.951e+02 1.281e+03 3.079e+03, threshold=1.990e+03, percent-clipped=2.0 2023-06-24 20:03:30,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1823952.0, ans=0.125 2023-06-24 20:03:45,400 INFO [train.py:996] (1/4) Epoch 10, batch 29550, loss[loss=0.2535, simple_loss=0.3206, pruned_loss=0.0932, over 21435.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3049, pruned_loss=0.08, over 4277476.09 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:04:23,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1824072.0, ans=0.0 2023-06-24 20:04:28,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1824072.0, ans=0.035 2023-06-24 20:05:38,092 INFO [train.py:996] (1/4) Epoch 10, batch 29600, loss[loss=0.24, simple_loss=0.3077, pruned_loss=0.08611, over 21266.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.311, pruned_loss=0.08214, over 4280634.69 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:05:47,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1824312.0, ans=0.125 2023-06-24 20:06:01,549 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-24 20:06:02,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1824372.0, ans=0.07 2023-06-24 20:06:34,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1824492.0, ans=0.2 2023-06-24 20:06:44,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1824492.0, ans=0.125 2023-06-24 20:06:54,895 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.014e+02 8.080e+02 1.124e+03 1.418e+03 4.047e+03, threshold=2.247e+03, percent-clipped=9.0 2023-06-24 20:07:23,004 INFO [train.py:996] (1/4) Epoch 10, batch 29650, loss[loss=0.2047, simple_loss=0.2788, pruned_loss=0.06525, over 21678.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3133, pruned_loss=0.08054, over 4275886.66 frames. ], batch size: 263, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:07:37,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1824612.0, ans=0.0 2023-06-24 20:08:07,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1824732.0, ans=0.125 2023-06-24 20:08:54,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1824852.0, ans=10.0 2023-06-24 20:08:54,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1824852.0, ans=0.0 2023-06-24 20:09:10,426 INFO [train.py:996] (1/4) Epoch 10, batch 29700, loss[loss=0.2385, simple_loss=0.3313, pruned_loss=0.07288, over 21513.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.312, pruned_loss=0.07975, over 4268669.23 frames. ], batch size: 194, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:09:20,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1824912.0, ans=0.125 2023-06-24 20:10:01,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1825032.0, ans=0.0 2023-06-24 20:10:29,384 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.654e+02 7.202e+02 1.096e+03 1.682e+03 4.155e+03, threshold=2.193e+03, percent-clipped=13.0 2023-06-24 20:10:58,054 INFO [train.py:996] (1/4) Epoch 10, batch 29750, loss[loss=0.2199, simple_loss=0.321, pruned_loss=0.05936, over 21864.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3163, pruned_loss=0.0793, over 4273577.33 frames. ], batch size: 371, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:11:06,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1825212.0, ans=0.125 2023-06-24 20:11:52,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-24 20:12:01,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-24 20:12:19,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1825392.0, ans=0.0 2023-06-24 20:12:22,718 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1825452.0, ans=0.1 2023-06-24 20:12:44,355 INFO [train.py:996] (1/4) Epoch 10, batch 29800, loss[loss=0.2097, simple_loss=0.2861, pruned_loss=0.06663, over 21886.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.317, pruned_loss=0.0791, over 4272325.26 frames. ], batch size: 371, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:13:05,038 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-24 20:13:25,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1825632.0, ans=0.2 2023-06-24 20:13:36,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-24 20:13:42,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1825692.0, ans=0.125 2023-06-24 20:13:56,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1825692.0, ans=0.0 2023-06-24 20:14:03,889 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.147e+02 7.156e+02 9.428e+02 1.298e+03 2.212e+03, threshold=1.886e+03, percent-clipped=2.0 2023-06-24 20:14:32,495 INFO [train.py:996] (1/4) Epoch 10, batch 29850, loss[loss=0.1983, simple_loss=0.2805, pruned_loss=0.05809, over 21787.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3128, pruned_loss=0.07727, over 4274606.18 frames. ], batch size: 298, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:15:11,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1825932.0, ans=0.125 2023-06-24 20:15:21,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1825932.0, ans=0.0 2023-06-24 20:16:13,643 INFO [train.py:996] (1/4) Epoch 10, batch 29900, loss[loss=0.2893, simple_loss=0.35, pruned_loss=0.1143, over 21402.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3112, pruned_loss=0.07887, over 4277750.45 frames. ], batch size: 548, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:16:14,336 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:16:37,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1826172.0, ans=0.0 2023-06-24 20:17:08,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1826232.0, ans=0.1 2023-06-24 20:17:15,810 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826232.0, ans=0.1 2023-06-24 20:17:37,200 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.094e+02 6.347e+02 7.852e+02 1.102e+03 2.302e+03, threshold=1.570e+03, percent-clipped=5.0 2023-06-24 20:17:44,788 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-24 20:18:07,368 INFO [train.py:996] (1/4) Epoch 10, batch 29950, loss[loss=0.2607, simple_loss=0.3243, pruned_loss=0.09854, over 21704.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3145, pruned_loss=0.08221, over 4277951.99 frames. ], batch size: 298, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:18:10,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1826412.0, ans=0.1 2023-06-24 20:18:25,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1826472.0, ans=0.02 2023-06-24 20:18:25,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1826472.0, ans=0.0 2023-06-24 20:18:26,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1826472.0, ans=0.125 2023-06-24 20:18:46,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1826532.0, ans=0.0 2023-06-24 20:18:57,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-24 20:19:07,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1826532.0, ans=0.025 2023-06-24 20:19:28,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1826592.0, ans=0.125 2023-06-24 20:19:55,948 INFO [train.py:996] (1/4) Epoch 10, batch 30000, loss[loss=0.2403, simple_loss=0.3396, pruned_loss=0.07052, over 21495.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3169, pruned_loss=0.08271, over 4277022.44 frames. ], batch size: 471, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:19:55,949 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 20:20:14,321 INFO [train.py:1028] (1/4) Epoch 10, validation: loss=0.2483, simple_loss=0.3443, pruned_loss=0.07614, over 1796401.00 frames. 2023-06-24 20:20:14,322 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 20:20:27,096 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-24 20:20:49,542 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:21:40,539 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.194e+02 6.464e+02 9.828e+02 1.489e+03 3.469e+03, threshold=1.966e+03, percent-clipped=22.0 2023-06-24 20:22:14,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1826952.0, ans=15.0 2023-06-24 20:22:15,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1827012.0, ans=0.125 2023-06-24 20:22:16,897 INFO [train.py:996] (1/4) Epoch 10, batch 30050, loss[loss=0.2467, simple_loss=0.3583, pruned_loss=0.06756, over 20740.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3189, pruned_loss=0.07884, over 4262562.59 frames. ], batch size: 607, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:22:38,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.63 vs. limit=6.0 2023-06-24 20:23:28,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1827192.0, ans=0.125 2023-06-24 20:23:56,943 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-24 20:24:02,328 INFO [train.py:996] (1/4) Epoch 10, batch 30100, loss[loss=0.2213, simple_loss=0.2759, pruned_loss=0.08335, over 21363.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3192, pruned_loss=0.07894, over 4265090.84 frames. ], batch size: 160, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:24:04,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1827312.0, ans=0.2 2023-06-24 20:24:05,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-24 20:24:06,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1827312.0, ans=0.2 2023-06-24 20:24:16,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1827312.0, ans=0.05 2023-06-24 20:24:32,006 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-24 20:24:32,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-24 20:24:34,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=22.5 2023-06-24 20:25:20,871 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.518e+02 7.739e+02 1.138e+03 1.850e+03 3.841e+03, threshold=2.275e+03, percent-clipped=20.0 2023-06-24 20:25:44,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1827552.0, ans=0.125 2023-06-24 20:25:54,246 INFO [train.py:996] (1/4) Epoch 10, batch 30150, loss[loss=0.2637, simple_loss=0.3301, pruned_loss=0.09863, over 21735.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3153, pruned_loss=0.08065, over 4256593.62 frames. ], batch size: 231, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:26:13,117 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-06-24 20:26:14,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1827672.0, ans=0.07 2023-06-24 20:26:26,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1827672.0, ans=0.125 2023-06-24 20:26:26,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1827672.0, ans=0.1 2023-06-24 20:27:43,919 INFO [train.py:996] (1/4) Epoch 10, batch 30200, loss[loss=0.2231, simple_loss=0.2924, pruned_loss=0.07692, over 20052.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3173, pruned_loss=0.07981, over 4261393.40 frames. ], batch size: 703, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:27:46,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1827912.0, ans=0.125 2023-06-24 20:29:09,512 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.062e+02 7.107e+02 1.003e+03 1.591e+03 3.654e+03, threshold=2.006e+03, percent-clipped=8.0 2023-06-24 20:29:38,187 INFO [train.py:996] (1/4) Epoch 10, batch 30250, loss[loss=0.2489, simple_loss=0.3337, pruned_loss=0.08204, over 17144.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3261, pruned_loss=0.08303, over 4263923.38 frames. ], batch size: 60, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:29:42,367 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1828212.0, ans=0.125 2023-06-24 20:30:29,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1828332.0, ans=0.05 2023-06-24 20:30:45,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1828392.0, ans=0.125 2023-06-24 20:30:48,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1828392.0, ans=0.125 2023-06-24 20:31:02,355 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-24 20:31:07,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1828452.0, ans=0.125 2023-06-24 20:31:14,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-24 20:31:24,969 INFO [train.py:996] (1/4) Epoch 10, batch 30300, loss[loss=0.2058, simple_loss=0.2729, pruned_loss=0.06934, over 21933.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3255, pruned_loss=0.08398, over 4264360.65 frames. ], batch size: 103, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:31:27,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1828512.0, ans=0.2 2023-06-24 20:32:42,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1828692.0, ans=0.125 2023-06-24 20:32:47,063 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.072e+02 7.345e+02 9.964e+02 1.414e+03 3.448e+03, threshold=1.993e+03, percent-clipped=9.0 2023-06-24 20:33:03,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1828752.0, ans=0.1 2023-06-24 20:33:13,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1828752.0, ans=0.0 2023-06-24 20:33:21,531 INFO [train.py:996] (1/4) Epoch 10, batch 30350, loss[loss=0.2532, simple_loss=0.3368, pruned_loss=0.08479, over 21742.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3241, pruned_loss=0.08472, over 4263692.18 frames. ], batch size: 332, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:33:24,880 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.70 vs. limit=10.0 2023-06-24 20:33:43,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=15.0 2023-06-24 20:33:44,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1828872.0, ans=0.125 2023-06-24 20:33:58,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1828932.0, ans=0.2 2023-06-24 20:34:02,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1828932.0, ans=0.125 2023-06-24 20:34:44,640 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:34:50,287 INFO [train.py:996] (1/4) Epoch 10, batch 30400, loss[loss=0.2405, simple_loss=0.2936, pruned_loss=0.09369, over 20318.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3168, pruned_loss=0.08203, over 4245602.96 frames. ], batch size: 703, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:35:57,901 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.475e+02 9.226e+02 1.502e+03 2.385e+03 9.827e+03, threshold=3.004e+03, percent-clipped=34.0 2023-06-24 20:36:04,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1829352.0, ans=0.125 2023-06-24 20:36:18,193 INFO [train.py:996] (1/4) Epoch 10, batch 30450, loss[loss=0.3207, simple_loss=0.4426, pruned_loss=0.09944, over 19804.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3182, pruned_loss=0.08099, over 4189198.02 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:36:18,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1829412.0, ans=0.125 2023-06-24 20:36:36,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1829472.0, ans=0.0 2023-06-24 20:36:51,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1829532.0, ans=0.04949747468305833 2023-06-24 20:37:15,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1829592.0, ans=0.125 2023-06-24 20:37:19,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1829592.0, ans=0.125 2023-06-24 20:39:22,098 INFO [train.py:996] (1/4) Epoch 11, batch 0, loss[loss=0.2216, simple_loss=0.2805, pruned_loss=0.0814, over 20734.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2805, pruned_loss=0.0814, over 20734.00 frames. ], batch size: 609, lr: 2.72e-03, grad_scale: 32.0 2023-06-24 20:39:22,099 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 20:39:38,857 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2455, simple_loss=0.3504, pruned_loss=0.07029, over 1796401.00 frames. 2023-06-24 20:39:38,858 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 20:40:09,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=12.0 2023-06-24 20:40:43,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-24 20:40:49,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1829856.0, ans=0.0 2023-06-24 20:40:54,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.14 vs. limit=15.0 2023-06-24 20:41:04,938 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.679e+02 1.420e+03 2.190e+03 4.363e+03 1.061e+04, threshold=4.380e+03, percent-clipped=34.0 2023-06-24 20:41:20,418 INFO [train.py:996] (1/4) Epoch 11, batch 50, loss[loss=0.2694, simple_loss=0.3848, pruned_loss=0.07698, over 21646.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3222, pruned_loss=0.0839, over 966163.25 frames. ], batch size: 263, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:41:22,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1829976.0, ans=0.125 2023-06-24 20:41:25,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1829976.0, ans=0.0 2023-06-24 20:41:25,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-24 20:41:42,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-24 20:41:48,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1830036.0, ans=0.025 2023-06-24 20:42:00,589 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-24 20:42:14,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1830096.0, ans=0.125 2023-06-24 20:42:18,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1830156.0, ans=0.0 2023-06-24 20:42:35,298 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:42:56,723 INFO [train.py:996] (1/4) Epoch 11, batch 100, loss[loss=0.2862, simple_loss=0.3581, pruned_loss=0.1071, over 21822.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3337, pruned_loss=0.08329, over 1701872.22 frames. ], batch size: 118, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:43:46,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1830396.0, ans=0.2 2023-06-24 20:43:46,673 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1830396.0, ans=0.0 2023-06-24 20:44:22,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1830516.0, ans=0.2 2023-06-24 20:44:30,181 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.668e+02 7.799e+02 1.011e+03 1.345e+03 2.704e+03, threshold=2.023e+03, percent-clipped=0.0 2023-06-24 20:44:47,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1830516.0, ans=0.0 2023-06-24 20:44:50,622 INFO [train.py:996] (1/4) Epoch 11, batch 150, loss[loss=0.256, simple_loss=0.3562, pruned_loss=0.07793, over 21610.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3388, pruned_loss=0.08504, over 2268880.08 frames. ], batch size: 389, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:45:20,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1830636.0, ans=0.0 2023-06-24 20:46:00,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1830756.0, ans=0.0 2023-06-24 20:46:01,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1830756.0, ans=0.0 2023-06-24 20:46:05,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1830756.0, ans=0.0 2023-06-24 20:46:15,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1830816.0, ans=0.125 2023-06-24 20:46:19,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-24 20:46:31,261 INFO [train.py:996] (1/4) Epoch 11, batch 200, loss[loss=0.2226, simple_loss=0.3192, pruned_loss=0.06298, over 21705.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3334, pruned_loss=0.0836, over 2720152.99 frames. ], batch size: 351, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:46:45,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1830876.0, ans=0.1 2023-06-24 20:47:55,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.519e+02 7.230e+02 1.005e+03 1.517e+03 6.245e+03, threshold=2.009e+03, percent-clipped=15.0 2023-06-24 20:48:15,635 INFO [train.py:996] (1/4) Epoch 11, batch 250, loss[loss=0.3353, simple_loss=0.3915, pruned_loss=0.1396, over 21410.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3282, pruned_loss=0.08305, over 3069737.28 frames. ], batch size: 471, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:48:20,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831176.0, ans=0.1 2023-06-24 20:48:35,586 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:48:47,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831236.0, ans=0.1 2023-06-24 20:50:01,294 INFO [train.py:996] (1/4) Epoch 11, batch 300, loss[loss=0.2332, simple_loss=0.2949, pruned_loss=0.08581, over 21211.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3237, pruned_loss=0.08414, over 3332668.94 frames. ], batch size: 608, lr: 2.72e-03, grad_scale: 8.0 2023-06-24 20:50:19,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1831536.0, ans=0.125 2023-06-24 20:50:55,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-24 20:51:00,136 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-24 20:51:05,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-24 20:51:20,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1831716.0, ans=0.125 2023-06-24 20:51:30,486 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 7.499e+02 1.164e+03 1.692e+03 3.059e+03, threshold=2.329e+03, percent-clipped=16.0 2023-06-24 20:51:50,462 INFO [train.py:996] (1/4) Epoch 11, batch 350, loss[loss=0.2261, simple_loss=0.3031, pruned_loss=0.07457, over 21635.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3179, pruned_loss=0.08273, over 3541562.29 frames. ], batch size: 230, lr: 2.72e-03, grad_scale: 8.0 2023-06-24 20:52:28,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1831836.0, ans=0.07 2023-06-24 20:52:32,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1831896.0, ans=0.125 2023-06-24 20:53:30,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1832076.0, ans=0.125 2023-06-24 20:53:31,096 INFO [train.py:996] (1/4) Epoch 11, batch 400, loss[loss=0.2909, simple_loss=0.3654, pruned_loss=0.1082, over 21505.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3115, pruned_loss=0.07969, over 3700545.41 frames. ], batch size: 508, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:53:42,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1832076.0, ans=0.0 2023-06-24 20:53:44,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.08 vs. limit=6.0 2023-06-24 20:53:49,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1832076.0, ans=0.2 2023-06-24 20:53:49,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1832076.0, ans=0.125 2023-06-24 20:53:52,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1832136.0, ans=0.125 2023-06-24 20:54:02,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1832136.0, ans=0.125 2023-06-24 20:54:13,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1832136.0, ans=0.0 2023-06-24 20:55:11,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.125e+02 8.397e+02 1.523e+03 1.983e+03 4.862e+03, threshold=3.046e+03, percent-clipped=16.0 2023-06-24 20:55:15,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1832316.0, ans=0.125 2023-06-24 20:55:15,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1832316.0, ans=0.1 2023-06-24 20:55:18,138 INFO [train.py:996] (1/4) Epoch 11, batch 450, loss[loss=0.2236, simple_loss=0.288, pruned_loss=0.07964, over 21524.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3089, pruned_loss=0.07922, over 3830737.84 frames. ], batch size: 391, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:55:40,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1832436.0, ans=0.0 2023-06-24 20:57:05,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1832616.0, ans=0.125 2023-06-24 20:57:08,329 INFO [train.py:996] (1/4) Epoch 11, batch 500, loss[loss=0.1953, simple_loss=0.2756, pruned_loss=0.05747, over 21858.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3047, pruned_loss=0.07807, over 3935625.24 frames. ], batch size: 373, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:57:16,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1832676.0, ans=0.125 2023-06-24 20:57:54,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1832796.0, ans=0.0 2023-06-24 20:58:12,295 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1832856.0, ans=0.125 2023-06-24 20:58:32,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1832916.0, ans=0.125 2023-06-24 20:58:40,209 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 9.876e+02 1.724e+03 2.578e+03 4.436e+03, threshold=3.448e+03, percent-clipped=13.0 2023-06-24 20:58:53,054 INFO [train.py:996] (1/4) Epoch 11, batch 550, loss[loss=0.2626, simple_loss=0.3572, pruned_loss=0.08395, over 21275.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3097, pruned_loss=0.07695, over 4011137.46 frames. ], batch size: 176, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:59:45,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1833096.0, ans=0.0 2023-06-24 21:00:04,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1833156.0, ans=0.125 2023-06-24 21:00:12,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1833156.0, ans=0.2 2023-06-24 21:00:23,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1833216.0, ans=0.125 2023-06-24 21:00:38,860 INFO [train.py:996] (1/4) Epoch 11, batch 600, loss[loss=0.2182, simple_loss=0.3132, pruned_loss=0.06164, over 21320.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3142, pruned_loss=0.07717, over 4065737.78 frames. ], batch size: 176, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:00:45,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1833276.0, ans=0.125 2023-06-24 21:00:46,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-24 21:01:22,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1833336.0, ans=0.125 2023-06-24 21:01:48,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1833456.0, ans=0.125 2023-06-24 21:02:04,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1833516.0, ans=0.1 2023-06-24 21:02:13,832 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 7.057e+02 1.048e+03 1.641e+03 3.624e+03, threshold=2.096e+03, percent-clipped=2.0 2023-06-24 21:02:26,674 INFO [train.py:996] (1/4) Epoch 11, batch 650, loss[loss=0.2276, simple_loss=0.2921, pruned_loss=0.08153, over 21774.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3145, pruned_loss=0.07728, over 4114495.40 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:02:36,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-24 21:03:19,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1833696.0, ans=0.1 2023-06-24 21:03:52,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1833816.0, ans=0.0 2023-06-24 21:04:04,926 INFO [train.py:996] (1/4) Epoch 11, batch 700, loss[loss=0.2262, simple_loss=0.2935, pruned_loss=0.07941, over 21474.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3151, pruned_loss=0.07791, over 4159130.39 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:04:21,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1833876.0, ans=0.125 2023-06-24 21:05:20,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1834056.0, ans=0.125 2023-06-24 21:05:39,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1834116.0, ans=0.0 2023-06-24 21:05:44,876 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.745e+02 9.389e+02 1.418e+03 2.158e+03 4.228e+03, threshold=2.836e+03, percent-clipped=28.0 2023-06-24 21:05:47,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1834116.0, ans=0.07 2023-06-24 21:05:51,303 INFO [train.py:996] (1/4) Epoch 11, batch 750, loss[loss=0.2371, simple_loss=0.307, pruned_loss=0.08355, over 21845.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3135, pruned_loss=0.07879, over 4184948.04 frames. ], batch size: 371, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:06:03,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.49 vs. limit=22.5 2023-06-24 21:07:04,442 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-24 21:07:32,585 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=15.0 2023-06-24 21:07:36,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1834416.0, ans=0.0 2023-06-24 21:07:40,978 INFO [train.py:996] (1/4) Epoch 11, batch 800, loss[loss=0.2336, simple_loss=0.2991, pruned_loss=0.08402, over 21302.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3123, pruned_loss=0.07927, over 4203732.54 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:08:19,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1834536.0, ans=0.2 2023-06-24 21:09:21,227 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 7.606e+02 1.363e+03 2.031e+03 4.976e+03, threshold=2.727e+03, percent-clipped=7.0 2023-06-24 21:09:32,190 INFO [train.py:996] (1/4) Epoch 11, batch 850, loss[loss=0.2689, simple_loss=0.3213, pruned_loss=0.1083, over 21659.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3106, pruned_loss=0.07964, over 4227013.21 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:09:58,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1834836.0, ans=0.04949747468305833 2023-06-24 21:10:09,160 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-24 21:11:11,571 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 21:11:20,181 INFO [train.py:996] (1/4) Epoch 11, batch 900, loss[loss=0.2167, simple_loss=0.3053, pruned_loss=0.06405, over 21818.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3067, pruned_loss=0.07916, over 4239639.80 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:11:20,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1835076.0, ans=0.0 2023-06-24 21:11:22,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1835076.0, ans=0.95 2023-06-24 21:12:08,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1835196.0, ans=0.125 2023-06-24 21:12:55,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1835316.0, ans=0.125 2023-06-24 21:13:04,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 7.581e+02 9.805e+02 1.489e+03 3.191e+03, threshold=1.961e+03, percent-clipped=4.0 2023-06-24 21:13:08,483 INFO [train.py:996] (1/4) Epoch 11, batch 950, loss[loss=0.2223, simple_loss=0.3061, pruned_loss=0.06929, over 21745.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3046, pruned_loss=0.07852, over 4252760.99 frames. ], batch size: 298, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:13:43,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1835436.0, ans=0.125 2023-06-24 21:14:10,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1835556.0, ans=0.0 2023-06-24 21:14:49,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1835616.0, ans=0.2 2023-06-24 21:14:57,568 INFO [train.py:996] (1/4) Epoch 11, batch 1000, loss[loss=0.2523, simple_loss=0.3275, pruned_loss=0.0886, over 21784.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3035, pruned_loss=0.07788, over 4266847.01 frames. ], batch size: 282, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:15:10,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1835676.0, ans=0.125 2023-06-24 21:15:12,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-24 21:15:34,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1835736.0, ans=0.125 2023-06-24 21:15:34,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1835736.0, ans=0.2 2023-06-24 21:16:12,809 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=22.5 2023-06-24 21:16:21,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1835856.0, ans=0.125 2023-06-24 21:16:49,680 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 6.726e+02 9.371e+02 1.402e+03 3.411e+03, threshold=1.874e+03, percent-clipped=8.0 2023-06-24 21:16:53,256 INFO [train.py:996] (1/4) Epoch 11, batch 1050, loss[loss=0.2388, simple_loss=0.3065, pruned_loss=0.08552, over 21843.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3026, pruned_loss=0.07737, over 4276762.08 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:16:55,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1835976.0, ans=0.125 2023-06-24 21:17:15,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1836036.0, ans=0.0 2023-06-24 21:17:59,221 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:17:59,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1836156.0, ans=0.125 2023-06-24 21:18:14,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1836156.0, ans=0.0 2023-06-24 21:18:43,417 INFO [train.py:996] (1/4) Epoch 11, batch 1100, loss[loss=0.3038, simple_loss=0.3668, pruned_loss=0.1204, over 21730.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3035, pruned_loss=0.07701, over 4285327.70 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:18:43,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1836276.0, ans=0.125 2023-06-24 21:18:45,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1836276.0, ans=0.0 2023-06-24 21:18:48,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=26.03 vs. limit=22.5 2023-06-24 21:19:08,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1836336.0, ans=0.125 2023-06-24 21:20:26,813 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.856e+02 8.203e+02 1.251e+03 2.125e+03 4.416e+03, threshold=2.502e+03, percent-clipped=31.0 2023-06-24 21:20:36,487 INFO [train.py:996] (1/4) Epoch 11, batch 1150, loss[loss=0.2258, simple_loss=0.3203, pruned_loss=0.06568, over 21636.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3044, pruned_loss=0.07649, over 4287357.75 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:21:22,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1836696.0, ans=0.125 2023-06-24 21:21:41,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1836756.0, ans=0.0 2023-06-24 21:21:54,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1836756.0, ans=0.0 2023-06-24 21:22:01,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1836816.0, ans=0.125 2023-06-24 21:22:22,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1836816.0, ans=0.0 2023-06-24 21:22:25,724 INFO [train.py:996] (1/4) Epoch 11, batch 1200, loss[loss=0.2202, simple_loss=0.2984, pruned_loss=0.07097, over 21472.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3065, pruned_loss=0.07698, over 4288675.72 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:22:36,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1836876.0, ans=0.125 2023-06-24 21:23:10,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-24 21:23:28,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1837056.0, ans=0.125 2023-06-24 21:23:28,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1837056.0, ans=0.2 2023-06-24 21:23:30,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1837056.0, ans=0.1 2023-06-24 21:24:05,092 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.099e+02 7.723e+02 1.059e+03 1.468e+03 2.676e+03, threshold=2.118e+03, percent-clipped=4.0 2023-06-24 21:24:14,384 INFO [train.py:996] (1/4) Epoch 11, batch 1250, loss[loss=0.2967, simple_loss=0.3608, pruned_loss=0.1163, over 21528.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3104, pruned_loss=0.07867, over 4290492.30 frames. ], batch size: 507, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:24:50,324 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-24 21:25:14,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1837296.0, ans=0.0 2023-06-24 21:26:04,435 INFO [train.py:996] (1/4) Epoch 11, batch 1300, loss[loss=0.211, simple_loss=0.3118, pruned_loss=0.0551, over 21714.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.311, pruned_loss=0.07799, over 4287046.67 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:27:52,127 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.439e+02 7.719e+02 9.846e+02 1.503e+03 2.792e+03, threshold=1.969e+03, percent-clipped=4.0 2023-06-24 21:27:53,901 INFO [train.py:996] (1/4) Epoch 11, batch 1350, loss[loss=0.2512, simple_loss=0.3296, pruned_loss=0.08645, over 21751.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3122, pruned_loss=0.07814, over 4289775.38 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:29:43,478 INFO [train.py:996] (1/4) Epoch 11, batch 1400, loss[loss=0.2249, simple_loss=0.2952, pruned_loss=0.07735, over 21852.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3112, pruned_loss=0.07921, over 4294095.82 frames. ], batch size: 107, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:30:08,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1838136.0, ans=0.5 2023-06-24 21:30:12,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1838136.0, ans=0.1 2023-06-24 21:30:54,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1838256.0, ans=0.0 2023-06-24 21:31:04,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1838256.0, ans=0.0 2023-06-24 21:31:31,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2023-06-24 21:31:31,867 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.111e+02 8.921e+02 1.291e+03 1.879e+03 3.355e+03, threshold=2.582e+03, percent-clipped=19.0 2023-06-24 21:31:33,600 INFO [train.py:996] (1/4) Epoch 11, batch 1450, loss[loss=0.2418, simple_loss=0.3111, pruned_loss=0.08619, over 21632.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3107, pruned_loss=0.07942, over 4291255.41 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:31:37,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1838376.0, ans=0.2 2023-06-24 21:31:45,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1838376.0, ans=0.0 2023-06-24 21:33:03,739 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.38 vs. limit=22.5 2023-06-24 21:33:21,338 INFO [train.py:996] (1/4) Epoch 11, batch 1500, loss[loss=0.259, simple_loss=0.3529, pruned_loss=0.08255, over 21700.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3131, pruned_loss=0.08047, over 4296920.96 frames. ], batch size: 389, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:33:25,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1838676.0, ans=0.2 2023-06-24 21:33:29,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-24 21:33:35,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1838676.0, ans=0.0 2023-06-24 21:35:03,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1838916.0, ans=10.0 2023-06-24 21:35:08,592 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.229e+02 8.037e+02 1.041e+03 1.486e+03 3.371e+03, threshold=2.081e+03, percent-clipped=9.0 2023-06-24 21:35:10,380 INFO [train.py:996] (1/4) Epoch 11, batch 1550, loss[loss=0.1991, simple_loss=0.2677, pruned_loss=0.06521, over 21310.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3112, pruned_loss=0.08105, over 4299282.51 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:35:17,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1838976.0, ans=0.125 2023-06-24 21:36:03,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1839096.0, ans=0.125 2023-06-24 21:36:48,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1839216.0, ans=0.0 2023-06-24 21:37:01,966 INFO [train.py:996] (1/4) Epoch 11, batch 1600, loss[loss=0.2795, simple_loss=0.3553, pruned_loss=0.1018, over 21830.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3103, pruned_loss=0.08066, over 4298178.48 frames. ], batch size: 372, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:37:02,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1839276.0, ans=0.1 2023-06-24 21:38:12,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1839396.0, ans=0.1 2023-06-24 21:38:23,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-24 21:38:59,348 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.877e+02 8.023e+02 1.190e+03 1.791e+03 3.601e+03, threshold=2.379e+03, percent-clipped=18.0 2023-06-24 21:38:59,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1839576.0, ans=0.1 2023-06-24 21:39:01,028 INFO [train.py:996] (1/4) Epoch 11, batch 1650, loss[loss=0.1885, simple_loss=0.2798, pruned_loss=0.04855, over 21584.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3091, pruned_loss=0.08004, over 4296442.27 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:39:05,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1839576.0, ans=0.1 2023-06-24 21:39:24,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1839636.0, ans=0.0 2023-06-24 21:39:57,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1839696.0, ans=0.125 2023-06-24 21:40:46,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-24 21:40:50,807 INFO [train.py:996] (1/4) Epoch 11, batch 1700, loss[loss=0.2434, simple_loss=0.3202, pruned_loss=0.08328, over 21467.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3126, pruned_loss=0.0815, over 4285957.03 frames. ], batch size: 548, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:41:42,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1839936.0, ans=0.125 2023-06-24 21:41:56,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1839996.0, ans=0.2 2023-06-24 21:42:08,468 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:42:47,747 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.610e+02 1.287e+03 1.983e+03 3.488e+03, threshold=2.574e+03, percent-clipped=18.0 2023-06-24 21:42:49,514 INFO [train.py:996] (1/4) Epoch 11, batch 1750, loss[loss=0.238, simple_loss=0.3277, pruned_loss=0.07416, over 21456.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3145, pruned_loss=0.08074, over 4283813.18 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:43:26,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1840236.0, ans=0.04949747468305833 2023-06-24 21:44:50,807 INFO [train.py:996] (1/4) Epoch 11, batch 1800, loss[loss=0.2239, simple_loss=0.3148, pruned_loss=0.06653, over 21616.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3157, pruned_loss=0.07943, over 4285317.32 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:45:29,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1840536.0, ans=0.1 2023-06-24 21:45:45,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1840596.0, ans=0.0 2023-06-24 21:46:16,265 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.52 vs. limit=15.0 2023-06-24 21:46:40,519 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.892e+02 1.019e+03 1.831e+03 4.064e+03, threshold=2.037e+03, percent-clipped=9.0 2023-06-24 21:46:48,473 INFO [train.py:996] (1/4) Epoch 11, batch 1850, loss[loss=0.234, simple_loss=0.3181, pruned_loss=0.07501, over 21804.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3141, pruned_loss=0.07618, over 4287537.36 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:46:49,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1840776.0, ans=0.0 2023-06-24 21:48:32,749 INFO [train.py:996] (1/4) Epoch 11, batch 1900, loss[loss=0.2303, simple_loss=0.2949, pruned_loss=0.08286, over 21296.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3146, pruned_loss=0.07546, over 4287131.72 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:48:36,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1841076.0, ans=0.125 2023-06-24 21:48:41,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1841076.0, ans=0.1 2023-06-24 21:49:12,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1841196.0, ans=0.125 2023-06-24 21:49:36,658 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:49:57,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1841316.0, ans=0.0 2023-06-24 21:50:02,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-24 21:50:20,934 INFO [train.py:996] (1/4) Epoch 11, batch 1950, loss[loss=0.2099, simple_loss=0.271, pruned_loss=0.07437, over 21682.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3113, pruned_loss=0.07556, over 4276905.09 frames. ], batch size: 333, lr: 2.71e-03, grad_scale: 4.0 2023-06-24 21:50:22,727 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 9.605e+02 1.769e+03 2.616e+03 5.034e+03, threshold=3.539e+03, percent-clipped=42.0 2023-06-24 21:50:43,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1841436.0, ans=0.025 2023-06-24 21:51:03,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1841496.0, ans=0.0 2023-06-24 21:51:31,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-24 21:51:45,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-24 21:51:58,208 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1841616.0, ans=0.125 2023-06-24 21:52:05,557 INFO [train.py:996] (1/4) Epoch 11, batch 2000, loss[loss=0.2438, simple_loss=0.3382, pruned_loss=0.07474, over 19925.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3045, pruned_loss=0.07365, over 4265538.52 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:52:10,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-24 21:52:31,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1841736.0, ans=0.1 2023-06-24 21:52:43,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1841736.0, ans=0.0 2023-06-24 21:53:16,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1841856.0, ans=0.125 2023-06-24 21:53:34,805 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:53:53,426 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:53:55,947 INFO [train.py:996] (1/4) Epoch 11, batch 2050, loss[loss=0.2098, simple_loss=0.2934, pruned_loss=0.06307, over 21353.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3057, pruned_loss=0.07322, over 4267299.63 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:53:57,629 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.961e+02 9.295e+02 1.430e+03 2.343e+03 5.111e+03, threshold=2.860e+03, percent-clipped=7.0 2023-06-24 21:54:34,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1842096.0, ans=0.0 2023-06-24 21:54:34,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1842096.0, ans=0.125 2023-06-24 21:54:43,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1842096.0, ans=0.125 2023-06-24 21:55:02,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1842156.0, ans=0.0 2023-06-24 21:55:05,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1842156.0, ans=0.07 2023-06-24 21:55:25,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1842216.0, ans=0.125 2023-06-24 21:55:47,684 INFO [train.py:996] (1/4) Epoch 11, batch 2100, loss[loss=0.2928, simple_loss=0.3458, pruned_loss=0.1199, over 21728.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3088, pruned_loss=0.07566, over 4268923.00 frames. ], batch size: 507, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:56:01,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1842276.0, ans=0.0 2023-06-24 21:56:30,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.30 vs. limit=15.0 2023-06-24 21:56:44,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.14 vs. limit=10.0 2023-06-24 21:56:58,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1842456.0, ans=0.0 2023-06-24 21:57:09,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1842456.0, ans=0.125 2023-06-24 21:57:37,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1842576.0, ans=0.0 2023-06-24 21:57:38,447 INFO [train.py:996] (1/4) Epoch 11, batch 2150, loss[loss=0.2089, simple_loss=0.2721, pruned_loss=0.07289, over 21473.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.311, pruned_loss=0.07712, over 4265295.92 frames. ], batch size: 195, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:57:39,932 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.865e+02 8.663e+02 1.127e+03 1.659e+03 3.855e+03, threshold=2.253e+03, percent-clipped=2.0 2023-06-24 21:58:08,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1842636.0, ans=0.125 2023-06-24 21:58:36,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1842696.0, ans=0.125 2023-06-24 21:58:53,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1842756.0, ans=0.125 2023-06-24 21:59:18,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1842816.0, ans=0.0 2023-06-24 21:59:26,335 INFO [train.py:996] (1/4) Epoch 11, batch 2200, loss[loss=0.2933, simple_loss=0.3479, pruned_loss=0.1193, over 21672.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3132, pruned_loss=0.07809, over 4275325.48 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:00:31,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1842996.0, ans=0.125 2023-06-24 22:00:31,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1842996.0, ans=0.125 2023-06-24 22:00:40,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1843056.0, ans=0.0 2023-06-24 22:00:41,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1843056.0, ans=0.0 2023-06-24 22:01:14,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1843116.0, ans=0.125 2023-06-24 22:01:16,571 INFO [train.py:996] (1/4) Epoch 11, batch 2250, loss[loss=0.2218, simple_loss=0.3083, pruned_loss=0.06759, over 21681.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3112, pruned_loss=0.07642, over 4277056.92 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:01:18,215 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 9.027e+02 1.396e+03 1.956e+03 3.592e+03, threshold=2.793e+03, percent-clipped=17.0 2023-06-24 22:01:51,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1843236.0, ans=0.2 2023-06-24 22:02:44,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1843356.0, ans=0.0 2023-06-24 22:03:04,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1843476.0, ans=0.125 2023-06-24 22:03:05,570 INFO [train.py:996] (1/4) Epoch 11, batch 2300, loss[loss=0.2036, simple_loss=0.2665, pruned_loss=0.07035, over 21507.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3059, pruned_loss=0.07459, over 4280194.47 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:04:01,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1843596.0, ans=0.125 2023-06-24 22:04:27,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1843656.0, ans=0.125 2023-06-24 22:04:57,282 INFO [train.py:996] (1/4) Epoch 11, batch 2350, loss[loss=0.2222, simple_loss=0.2787, pruned_loss=0.08279, over 21559.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.302, pruned_loss=0.07461, over 4276481.02 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:04:58,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.967e+02 8.353e+02 1.301e+03 1.765e+03 5.491e+03, threshold=2.603e+03, percent-clipped=6.0 2023-06-24 22:05:23,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1843836.0, ans=0.125 2023-06-24 22:05:36,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-24 22:05:37,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1843836.0, ans=0.05 2023-06-24 22:05:57,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1843896.0, ans=0.0 2023-06-24 22:06:17,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1843956.0, ans=10.0 2023-06-24 22:06:37,400 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-24 22:06:47,704 INFO [train.py:996] (1/4) Epoch 11, batch 2400, loss[loss=0.2458, simple_loss=0.31, pruned_loss=0.09084, over 21640.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3012, pruned_loss=0.07659, over 4271471.97 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:07:24,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1844136.0, ans=0.0 2023-06-24 22:07:56,492 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-24 22:08:28,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-24 22:08:29,475 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:08:44,059 INFO [train.py:996] (1/4) Epoch 11, batch 2450, loss[loss=0.2147, simple_loss=0.2866, pruned_loss=0.07142, over 21742.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3031, pruned_loss=0.07832, over 4268621.59 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:08:45,762 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.999e+02 9.036e+02 1.390e+03 1.907e+03 3.347e+03, threshold=2.779e+03, percent-clipped=7.0 2023-06-24 22:09:12,871 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1844436.0, ans=0.2 2023-06-24 22:10:05,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1844556.0, ans=0.125 2023-06-24 22:10:24,542 INFO [train.py:996] (1/4) Epoch 11, batch 2500, loss[loss=0.1988, simple_loss=0.2635, pruned_loss=0.06701, over 21337.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3024, pruned_loss=0.07855, over 4276119.22 frames. ], batch size: 160, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:11:45,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1844856.0, ans=0.0 2023-06-24 22:12:21,382 INFO [train.py:996] (1/4) Epoch 11, batch 2550, loss[loss=0.2406, simple_loss=0.3098, pruned_loss=0.08563, over 15913.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3045, pruned_loss=0.07787, over 4260279.79 frames. ], batch size: 64, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:12:22,905 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.374e+02 8.761e+02 1.237e+03 1.691e+03 3.223e+03, threshold=2.475e+03, percent-clipped=6.0 2023-06-24 22:12:49,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1845036.0, ans=0.125 2023-06-24 22:13:22,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1845096.0, ans=6.0 2023-06-24 22:13:26,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1845096.0, ans=0.125 2023-06-24 22:13:30,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1845156.0, ans=0.04949747468305833 2023-06-24 22:13:32,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-24 22:14:11,204 INFO [train.py:996] (1/4) Epoch 11, batch 2600, loss[loss=0.2853, simple_loss=0.3444, pruned_loss=0.1131, over 21332.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3091, pruned_loss=0.0799, over 4258663.08 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:14:19,495 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2023-06-24 22:14:59,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1845396.0, ans=0.125 2023-06-24 22:16:00,063 INFO [train.py:996] (1/4) Epoch 11, batch 2650, loss[loss=0.2537, simple_loss=0.3218, pruned_loss=0.09277, over 21380.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3111, pruned_loss=0.08174, over 4269707.43 frames. ], batch size: 159, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:16:01,627 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 1.066e+03 1.667e+03 2.223e+03 5.089e+03, threshold=3.334e+03, percent-clipped=18.0 2023-06-24 22:16:08,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-24 22:17:03,608 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-24 22:17:07,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1845756.0, ans=0.035 2023-06-24 22:17:17,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1845756.0, ans=0.125 2023-06-24 22:17:38,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1845816.0, ans=0.1 2023-06-24 22:17:46,146 INFO [train.py:996] (1/4) Epoch 11, batch 2700, loss[loss=0.2235, simple_loss=0.2991, pruned_loss=0.07389, over 21739.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3093, pruned_loss=0.08144, over 4267481.03 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:17:51,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=8.0 2023-06-24 22:18:11,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1845936.0, ans=0.2 2023-06-24 22:19:36,547 INFO [train.py:996] (1/4) Epoch 11, batch 2750, loss[loss=0.2649, simple_loss=0.3469, pruned_loss=0.09146, over 21607.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3103, pruned_loss=0.08203, over 4265577.58 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:19:38,341 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.314e+02 7.487e+02 1.146e+03 1.660e+03 3.901e+03, threshold=2.292e+03, percent-clipped=2.0 2023-06-24 22:20:21,176 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:20:23,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.67 vs. limit=8.0 2023-06-24 22:21:13,596 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:21:18,829 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1846476.0, ans=0.0 2023-06-24 22:21:19,903 INFO [train.py:996] (1/4) Epoch 11, batch 2800, loss[loss=0.2604, simple_loss=0.3484, pruned_loss=0.08621, over 21751.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3124, pruned_loss=0.08144, over 4273798.24 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:21:52,101 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.58 vs. limit=6.0 2023-06-24 22:23:06,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1846716.0, ans=0.125 2023-06-24 22:23:10,958 INFO [train.py:996] (1/4) Epoch 11, batch 2850, loss[loss=0.2502, simple_loss=0.339, pruned_loss=0.08071, over 21654.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3158, pruned_loss=0.08311, over 4275247.81 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:23:11,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1846776.0, ans=0.04949747468305833 2023-06-24 22:23:13,660 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.24 vs. limit=15.0 2023-06-24 22:23:19,733 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.356e+02 9.385e+02 1.588e+03 2.448e+03 5.122e+03, threshold=3.175e+03, percent-clipped=28.0 2023-06-24 22:24:59,824 INFO [train.py:996] (1/4) Epoch 11, batch 2900, loss[loss=0.1684, simple_loss=0.2306, pruned_loss=0.05314, over 21277.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.312, pruned_loss=0.08241, over 4278309.36 frames. ], batch size: 176, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:25:10,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1847076.0, ans=0.09899494936611666 2023-06-24 22:25:10,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1847076.0, ans=0.125 2023-06-24 22:25:47,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1847196.0, ans=0.1 2023-06-24 22:25:58,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1847196.0, ans=0.0 2023-06-24 22:26:19,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1847256.0, ans=0.125 2023-06-24 22:26:26,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1847316.0, ans=0.0 2023-06-24 22:26:48,166 INFO [train.py:996] (1/4) Epoch 11, batch 2950, loss[loss=0.2504, simple_loss=0.3251, pruned_loss=0.08786, over 21557.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3131, pruned_loss=0.08287, over 4282279.26 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:26:51,473 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.380e+02 7.849e+02 1.003e+03 1.596e+03 3.041e+03, threshold=2.006e+03, percent-clipped=1.0 2023-06-24 22:27:26,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1847436.0, ans=0.125 2023-06-24 22:27:55,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1847496.0, ans=0.125 2023-06-24 22:28:10,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1847556.0, ans=0.125 2023-06-24 22:28:12,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1847556.0, ans=0.0 2023-06-24 22:28:38,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1847676.0, ans=0.125 2023-06-24 22:28:39,719 INFO [train.py:996] (1/4) Epoch 11, batch 3000, loss[loss=0.2367, simple_loss=0.3146, pruned_loss=0.07937, over 21835.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3152, pruned_loss=0.08225, over 4285901.64 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:28:39,720 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-24 22:29:02,929 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2533, simple_loss=0.3467, pruned_loss=0.07995, over 1796401.00 frames. 2023-06-24 22:29:02,930 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-24 22:29:24,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1847736.0, ans=0.02 2023-06-24 22:29:47,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-24 22:29:49,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.66 vs. limit=10.0 2023-06-24 22:30:50,711 INFO [train.py:996] (1/4) Epoch 11, batch 3050, loss[loss=0.1543, simple_loss=0.2244, pruned_loss=0.04208, over 16921.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3157, pruned_loss=0.08095, over 4284815.78 frames. ], batch size: 60, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:30:56,012 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 9.249e+02 1.451e+03 2.091e+03 4.098e+03, threshold=2.902e+03, percent-clipped=32.0 2023-06-24 22:31:45,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1848096.0, ans=0.1 2023-06-24 22:32:37,706 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-24 22:32:40,010 INFO [train.py:996] (1/4) Epoch 11, batch 3100, loss[loss=0.2141, simple_loss=0.3018, pruned_loss=0.06324, over 21465.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3148, pruned_loss=0.07992, over 4287275.17 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:32:47,526 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1848276.0, ans=0.1 2023-06-24 22:33:12,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1848336.0, ans=0.0 2023-06-24 22:33:20,252 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-24 22:33:53,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1848456.0, ans=0.125 2023-06-24 22:33:55,092 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1848456.0, ans=0.0 2023-06-24 22:34:05,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1848456.0, ans=0.04949747468305833 2023-06-24 22:34:10,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-24 22:34:17,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1848516.0, ans=0.125 2023-06-24 22:34:30,837 INFO [train.py:996] (1/4) Epoch 11, batch 3150, loss[loss=0.2296, simple_loss=0.3062, pruned_loss=0.07653, over 21682.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3156, pruned_loss=0.08027, over 4283571.54 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:34:40,513 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:34:41,516 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.651e+02 8.004e+02 1.417e+03 1.894e+03 2.816e+03, threshold=2.834e+03, percent-clipped=0.0 2023-06-24 22:34:45,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1848576.0, ans=0.2 2023-06-24 22:35:00,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1848636.0, ans=0.125 2023-06-24 22:35:34,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1848696.0, ans=0.035 2023-06-24 22:36:27,232 INFO [train.py:996] (1/4) Epoch 11, batch 3200, loss[loss=0.3118, simple_loss=0.3889, pruned_loss=0.1173, over 21452.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3195, pruned_loss=0.08145, over 4278813.67 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:36:28,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1848876.0, ans=0.07 2023-06-24 22:36:57,771 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-24 22:37:15,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1848996.0, ans=0.125 2023-06-24 22:37:23,892 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-24 22:38:00,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1849116.0, ans=0.125 2023-06-24 22:38:00,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1849116.0, ans=0.125 2023-06-24 22:38:03,786 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1849116.0, ans=0.0 2023-06-24 22:38:10,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1849116.0, ans=0.0 2023-06-24 22:38:15,097 INFO [train.py:996] (1/4) Epoch 11, batch 3250, loss[loss=0.2688, simple_loss=0.3313, pruned_loss=0.1032, over 21439.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.319, pruned_loss=0.08134, over 4277387.43 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:38:20,060 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.217e+02 9.282e+02 1.304e+03 1.953e+03 5.530e+03, threshold=2.608e+03, percent-clipped=11.0 2023-06-24 22:40:04,424 INFO [train.py:996] (1/4) Epoch 11, batch 3300, loss[loss=0.215, simple_loss=0.2901, pruned_loss=0.06992, over 21790.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3169, pruned_loss=0.08198, over 4274886.58 frames. ], batch size: 372, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:40:23,349 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-24 22:40:34,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1849536.0, ans=0.05 2023-06-24 22:40:56,066 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-24 22:41:05,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1849596.0, ans=0.0 2023-06-24 22:41:22,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1849656.0, ans=0.125 2023-06-24 22:41:24,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1849656.0, ans=0.1 2023-06-24 22:41:54,793 INFO [train.py:996] (1/4) Epoch 11, batch 3350, loss[loss=0.1951, simple_loss=0.2547, pruned_loss=0.06776, over 20031.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3189, pruned_loss=0.08196, over 4281980.20 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:42:01,409 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.009e+02 8.020e+02 1.184e+03 1.979e+03 5.260e+03, threshold=2.368e+03, percent-clipped=15.0 2023-06-24 22:42:09,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1849776.0, ans=0.0 2023-06-24 22:42:12,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1849776.0, ans=0.125 2023-06-24 22:43:08,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1849896.0, ans=0.125 2023-06-24 22:43:34,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1850016.0, ans=0.0 2023-06-24 22:43:35,635 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-24 22:43:36,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1850016.0, ans=0.0 2023-06-24 22:43:50,821 INFO [train.py:996] (1/4) Epoch 11, batch 3400, loss[loss=0.2298, simple_loss=0.3096, pruned_loss=0.075, over 21688.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3201, pruned_loss=0.08315, over 4280298.31 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:44:52,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1850196.0, ans=0.2 2023-06-24 22:45:01,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1850256.0, ans=0.125 2023-06-24 22:45:13,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.34 vs. limit=15.0 2023-06-24 22:45:40,247 INFO [train.py:996] (1/4) Epoch 11, batch 3450, loss[loss=0.2207, simple_loss=0.2797, pruned_loss=0.08086, over 21180.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3142, pruned_loss=0.08154, over 4279128.33 frames. ], batch size: 176, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:45:52,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.862e+02 7.621e+02 1.155e+03 1.643e+03 3.444e+03, threshold=2.310e+03, percent-clipped=7.0 2023-06-24 22:45:53,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1850376.0, ans=0.0 2023-06-24 22:45:57,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-24 22:46:15,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1850436.0, ans=0.125 2023-06-24 22:46:34,232 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1850496.0, ans=0.0 2023-06-24 22:46:36,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1850496.0, ans=0.2 2023-06-24 22:47:21,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1850616.0, ans=0.125 2023-06-24 22:47:35,913 INFO [train.py:996] (1/4) Epoch 11, batch 3500, loss[loss=0.299, simple_loss=0.3749, pruned_loss=0.1116, over 21571.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3239, pruned_loss=0.08591, over 4280580.52 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:48:12,900 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1850736.0, ans=0.125 2023-06-24 22:48:18,149 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-24 22:48:51,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1850856.0, ans=0.0 2023-06-24 22:49:06,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1850916.0, ans=0.125 2023-06-24 22:49:22,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1850916.0, ans=0.125 2023-06-24 22:49:22,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1850916.0, ans=0.0 2023-06-24 22:49:32,548 INFO [train.py:996] (1/4) Epoch 11, batch 3550, loss[loss=0.2306, simple_loss=0.3085, pruned_loss=0.07638, over 21326.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3271, pruned_loss=0.08839, over 4285019.21 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:49:39,365 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.955e+02 9.872e+02 1.548e+03 2.414e+03 6.693e+03, threshold=3.097e+03, percent-clipped=26.0 2023-06-24 22:49:56,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1851036.0, ans=0.2 2023-06-24 22:50:14,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1851096.0, ans=0.125 2023-06-24 22:50:27,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1851156.0, ans=0.0 2023-06-24 22:51:22,534 INFO [train.py:996] (1/4) Epoch 11, batch 3600, loss[loss=0.2826, simple_loss=0.3399, pruned_loss=0.1127, over 21273.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3203, pruned_loss=0.08742, over 4268783.52 frames. ], batch size: 143, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:51:56,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1851336.0, ans=0.125 2023-06-24 22:53:13,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1851576.0, ans=0.2 2023-06-24 22:53:14,652 INFO [train.py:996] (1/4) Epoch 11, batch 3650, loss[loss=0.2123, simple_loss=0.3038, pruned_loss=0.06041, over 21686.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3199, pruned_loss=0.08789, over 4267509.96 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:53:21,492 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.826e+02 7.978e+02 1.076e+03 1.568e+03 3.181e+03, threshold=2.152e+03, percent-clipped=1.0 2023-06-24 22:53:58,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1851696.0, ans=0.125 2023-06-24 22:55:01,513 INFO [train.py:996] (1/4) Epoch 11, batch 3700, loss[loss=0.2397, simple_loss=0.3137, pruned_loss=0.08284, over 21915.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3175, pruned_loss=0.0857, over 4278648.88 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:55:33,183 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.32 vs. limit=12.0 2023-06-24 22:55:37,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1851936.0, ans=0.0 2023-06-24 22:56:55,543 INFO [train.py:996] (1/4) Epoch 11, batch 3750, loss[loss=0.2092, simple_loss=0.2791, pruned_loss=0.06967, over 21484.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3161, pruned_loss=0.08539, over 4281264.32 frames. ], batch size: 212, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:57:02,982 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.268e+02 7.373e+02 1.096e+03 1.771e+03 3.259e+03, threshold=2.192e+03, percent-clipped=16.0 2023-06-24 22:57:30,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1852296.0, ans=0.0 2023-06-24 22:57:47,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1852296.0, ans=0.125 2023-06-24 22:57:54,465 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=12.0 2023-06-24 22:58:10,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1852356.0, ans=0.0 2023-06-24 22:58:24,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-24 22:58:39,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1852416.0, ans=0.125 2023-06-24 22:58:44,883 INFO [train.py:996] (1/4) Epoch 11, batch 3800, loss[loss=0.2736, simple_loss=0.3551, pruned_loss=0.0961, over 21521.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3123, pruned_loss=0.08385, over 4272024.66 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:58:45,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1852476.0, ans=0.125 2023-06-24 22:58:45,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1852476.0, ans=0.125 2023-06-24 22:59:08,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1852536.0, ans=0.0 2023-06-24 22:59:14,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1852536.0, ans=0.1 2023-06-24 22:59:28,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1852596.0, ans=0.125 2023-06-24 22:59:46,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1852656.0, ans=0.035 2023-06-24 23:00:11,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1852716.0, ans=0.2 2023-06-24 23:00:12,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-24 23:00:32,158 INFO [train.py:996] (1/4) Epoch 11, batch 3850, loss[loss=0.2551, simple_loss=0.3799, pruned_loss=0.06512, over 20838.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3116, pruned_loss=0.08363, over 4272206.06 frames. ], batch size: 607, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:00:39,280 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.734e+02 8.310e+02 1.331e+03 1.906e+03 3.711e+03, threshold=2.662e+03, percent-clipped=19.0 2023-06-24 23:00:44,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1852776.0, ans=0.1 2023-06-24 23:01:17,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1852896.0, ans=0.1 2023-06-24 23:01:38,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1852956.0, ans=0.0 2023-06-24 23:01:48,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1852956.0, ans=0.125 2023-06-24 23:02:16,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1853016.0, ans=0.125 2023-06-24 23:02:19,632 INFO [train.py:996] (1/4) Epoch 11, batch 3900, loss[loss=0.2296, simple_loss=0.2995, pruned_loss=0.07983, over 21894.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3066, pruned_loss=0.08256, over 4264199.09 frames. ], batch size: 118, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:02:21,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1853076.0, ans=0.125 2023-06-24 23:02:25,822 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=22.5 2023-06-24 23:02:45,817 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=12.0 2023-06-24 23:03:25,139 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1853256.0, ans=0.125 2023-06-24 23:04:11,745 INFO [train.py:996] (1/4) Epoch 11, batch 3950, loss[loss=0.2018, simple_loss=0.2607, pruned_loss=0.07146, over 20620.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3076, pruned_loss=0.08133, over 4270982.47 frames. ], batch size: 607, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:04:18,264 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 6.470e+02 9.111e+02 1.353e+03 4.725e+03, threshold=1.822e+03, percent-clipped=4.0 2023-06-24 23:05:12,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1853496.0, ans=10.0 2023-06-24 23:05:32,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1853556.0, ans=0.125 2023-06-24 23:05:59,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-24 23:06:01,593 INFO [train.py:996] (1/4) Epoch 11, batch 4000, loss[loss=0.2354, simple_loss=0.3037, pruned_loss=0.08355, over 21619.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3038, pruned_loss=0.07834, over 4270700.41 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 32.0 2023-06-24 23:06:29,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1853736.0, ans=0.125 2023-06-24 23:06:41,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1853796.0, ans=0.125 2023-06-24 23:06:42,789 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1853796.0, ans=0.125 2023-06-24 23:07:49,324 INFO [train.py:996] (1/4) Epoch 11, batch 4050, loss[loss=0.1938, simple_loss=0.2601, pruned_loss=0.06371, over 20752.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3031, pruned_loss=0.07602, over 4271322.05 frames. ], batch size: 608, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:07:54,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1853976.0, ans=0.125 2023-06-24 23:07:57,277 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.864e+02 8.238e+02 1.474e+03 2.566e+03 6.233e+03, threshold=2.948e+03, percent-clipped=38.0 2023-06-24 23:08:14,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-24 23:08:15,275 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:08:42,346 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-24 23:09:37,381 INFO [train.py:996] (1/4) Epoch 11, batch 4100, loss[loss=0.2289, simple_loss=0.3024, pruned_loss=0.07773, over 21901.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3039, pruned_loss=0.07644, over 4277346.52 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:10:08,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1854336.0, ans=0.0 2023-06-24 23:10:57,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1854456.0, ans=0.0 2023-06-24 23:10:59,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1854456.0, ans=0.025 2023-06-24 23:11:13,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-24 23:11:14,916 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1854516.0, ans=0.125 2023-06-24 23:11:20,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1854516.0, ans=0.125 2023-06-24 23:11:27,866 INFO [train.py:996] (1/4) Epoch 11, batch 4150, loss[loss=0.254, simple_loss=0.3275, pruned_loss=0.09022, over 21723.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3036, pruned_loss=0.07404, over 4276162.17 frames. ], batch size: 333, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:11:39,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1854576.0, ans=0.125 2023-06-24 23:11:44,224 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.391e+02 9.658e+02 1.367e+03 3.515e+03, threshold=1.932e+03, percent-clipped=2.0 2023-06-24 23:12:36,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-24 23:12:49,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1854756.0, ans=0.125 2023-06-24 23:12:56,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1854756.0, ans=0.0 2023-06-24 23:13:00,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1854756.0, ans=0.2 2023-06-24 23:13:12,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1854816.0, ans=0.1 2023-06-24 23:13:27,408 INFO [train.py:996] (1/4) Epoch 11, batch 4200, loss[loss=0.1974, simple_loss=0.2791, pruned_loss=0.05786, over 21668.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3029, pruned_loss=0.07346, over 4255304.42 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:14:59,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1855116.0, ans=0.05 2023-06-24 23:15:24,112 INFO [train.py:996] (1/4) Epoch 11, batch 4250, loss[loss=0.3343, simple_loss=0.4207, pruned_loss=0.1239, over 21429.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.312, pruned_loss=0.07707, over 4254798.95 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:15:32,167 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.310e+02 8.743e+02 1.334e+03 2.102e+03 4.812e+03, threshold=2.669e+03, percent-clipped=26.0 2023-06-24 23:15:43,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1855176.0, ans=0.125 2023-06-24 23:16:22,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1855296.0, ans=0.0 2023-06-24 23:16:27,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1855356.0, ans=0.125 2023-06-24 23:16:31,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1855356.0, ans=0.025 2023-06-24 23:17:15,284 INFO [train.py:996] (1/4) Epoch 11, batch 4300, loss[loss=0.1645, simple_loss=0.2156, pruned_loss=0.05669, over 16892.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3162, pruned_loss=0.07828, over 4258827.53 frames. ], batch size: 60, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:17:48,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1855536.0, ans=0.125 2023-06-24 23:17:50,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1855536.0, ans=0.125 2023-06-24 23:17:52,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855536.0, ans=0.1 2023-06-24 23:18:32,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1855656.0, ans=0.125 2023-06-24 23:18:42,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.59 vs. limit=10.0 2023-06-24 23:19:05,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1855716.0, ans=0.125 2023-06-24 23:19:09,586 INFO [train.py:996] (1/4) Epoch 11, batch 4350, loss[loss=0.2304, simple_loss=0.2952, pruned_loss=0.08285, over 21902.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3146, pruned_loss=0.07739, over 4257565.32 frames. ], batch size: 107, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 23:19:25,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 8.181e+02 1.022e+03 1.627e+03 5.028e+03, threshold=2.045e+03, percent-clipped=6.0 2023-06-24 23:19:25,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1855776.0, ans=0.125 2023-06-24 23:19:28,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=22.5 2023-06-24 23:19:34,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1855836.0, ans=0.125 2023-06-24 23:19:35,678 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:19:38,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=22.5 2023-06-24 23:19:44,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1855836.0, ans=0.09899494936611666 2023-06-24 23:20:09,617 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:21:02,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1856016.0, ans=0.2 2023-06-24 23:21:06,803 INFO [train.py:996] (1/4) Epoch 11, batch 4400, loss[loss=0.2052, simple_loss=0.2852, pruned_loss=0.06263, over 21143.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3113, pruned_loss=0.07649, over 4253661.97 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:21:15,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1856076.0, ans=0.0 2023-06-24 23:21:17,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1856076.0, ans=0.125 2023-06-24 23:22:42,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1856316.0, ans=0.1 2023-06-24 23:22:57,259 INFO [train.py:996] (1/4) Epoch 11, batch 4450, loss[loss=0.2702, simple_loss=0.3716, pruned_loss=0.08443, over 21877.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3196, pruned_loss=0.07804, over 4259931.20 frames. ], batch size: 317, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:23:00,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1856376.0, ans=0.125 2023-06-24 23:23:07,960 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.276e+02 9.671e+02 1.476e+03 2.549e+03 6.148e+03, threshold=2.952e+03, percent-clipped=35.0 2023-06-24 23:23:32,559 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-24 23:23:57,313 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-24 23:24:33,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1856616.0, ans=0.2 2023-06-24 23:24:36,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1856616.0, ans=0.125 2023-06-24 23:24:36,871 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-24 23:24:47,363 INFO [train.py:996] (1/4) Epoch 11, batch 4500, loss[loss=0.2487, simple_loss=0.3405, pruned_loss=0.07848, over 21857.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3203, pruned_loss=0.07931, over 4272026.52 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:24:49,914 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-24 23:24:54,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1856676.0, ans=0.125 2023-06-24 23:26:07,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1856856.0, ans=0.125 2023-06-24 23:26:13,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1856856.0, ans=0.0 2023-06-24 23:26:34,504 INFO [train.py:996] (1/4) Epoch 11, batch 4550, loss[loss=0.2857, simple_loss=0.3609, pruned_loss=0.1053, over 21800.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3211, pruned_loss=0.07908, over 4276601.68 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:26:44,488 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.220e+02 1.037e+03 1.526e+03 2.248e+03 5.276e+03, threshold=3.053e+03, percent-clipped=11.0 2023-06-24 23:27:16,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1857036.0, ans=0.0 2023-06-24 23:27:32,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1857096.0, ans=0.0 2023-06-24 23:27:50,810 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:27:58,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1857156.0, ans=0.125 2023-06-24 23:28:23,411 INFO [train.py:996] (1/4) Epoch 11, batch 4600, loss[loss=0.2252, simple_loss=0.3053, pruned_loss=0.07261, over 21824.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.323, pruned_loss=0.08021, over 4280561.95 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:29:24,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-24 23:29:54,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-24 23:30:02,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1857516.0, ans=0.1 2023-06-24 23:30:12,146 INFO [train.py:996] (1/4) Epoch 11, batch 4650, loss[loss=0.1685, simple_loss=0.2399, pruned_loss=0.04854, over 21578.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3174, pruned_loss=0.07949, over 4287084.39 frames. ], batch size: 230, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:30:17,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1857576.0, ans=0.0 2023-06-24 23:30:29,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.698e+02 1.029e+03 1.673e+03 3.855e+03, threshold=2.058e+03, percent-clipped=3.0 2023-06-24 23:30:35,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1857636.0, ans=0.0 2023-06-24 23:30:49,643 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-24 23:31:35,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1857756.0, ans=0.2 2023-06-24 23:32:07,114 INFO [train.py:996] (1/4) Epoch 11, batch 4700, loss[loss=0.2201, simple_loss=0.2787, pruned_loss=0.08073, over 21400.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3122, pruned_loss=0.07799, over 4274011.54 frames. ], batch size: 473, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:33:27,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1858056.0, ans=0.125 2023-06-24 23:33:48,526 INFO [train.py:996] (1/4) Epoch 11, batch 4750, loss[loss=0.2414, simple_loss=0.3054, pruned_loss=0.08871, over 21561.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3062, pruned_loss=0.07774, over 4283524.53 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:34:05,964 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.601e+02 8.320e+02 1.239e+03 2.079e+03 4.364e+03, threshold=2.479e+03, percent-clipped=25.0 2023-06-24 23:35:13,856 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-24 23:35:42,520 INFO [train.py:996] (1/4) Epoch 11, batch 4800, loss[loss=0.2046, simple_loss=0.2968, pruned_loss=0.05623, over 21787.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3068, pruned_loss=0.07951, over 4280228.21 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 32.0 2023-06-24 23:36:29,201 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:36:51,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1858656.0, ans=0.1 2023-06-24 23:36:56,759 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1858656.0, ans=0.125 2023-06-24 23:36:58,445 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:37:05,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1858656.0, ans=0.0 2023-06-24 23:37:23,169 INFO [train.py:996] (1/4) Epoch 11, batch 4850, loss[loss=0.2181, simple_loss=0.2941, pruned_loss=0.07108, over 21625.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3063, pruned_loss=0.07864, over 4273463.30 frames. ], batch size: 389, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:37:41,876 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.322e+02 1.130e+03 1.666e+03 2.337e+03 4.462e+03, threshold=3.333e+03, percent-clipped=23.0 2023-06-24 23:37:52,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.24 vs. limit=10.0 2023-06-24 23:37:57,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.96 vs. limit=10.0 2023-06-24 23:38:43,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1858956.0, ans=0.125 2023-06-24 23:38:48,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1858956.0, ans=0.0 2023-06-24 23:39:15,328 INFO [train.py:996] (1/4) Epoch 11, batch 4900, loss[loss=0.2506, simple_loss=0.3182, pruned_loss=0.0915, over 21860.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3068, pruned_loss=0.07895, over 4279248.85 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:39:55,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1859136.0, ans=0.125 2023-06-24 23:40:30,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-24 23:40:35,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-24 23:40:55,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1859316.0, ans=0.125 2023-06-24 23:41:05,223 INFO [train.py:996] (1/4) Epoch 11, batch 4950, loss[loss=0.1882, simple_loss=0.2903, pruned_loss=0.043, over 21765.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3098, pruned_loss=0.07708, over 4278223.20 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:41:08,330 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-24 23:41:23,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.980e+02 1.108e+03 1.676e+03 3.345e+03, threshold=2.216e+03, percent-clipped=1.0 2023-06-24 23:41:25,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1859376.0, ans=0.0 2023-06-24 23:42:23,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1859556.0, ans=0.125 2023-06-24 23:42:53,741 INFO [train.py:996] (1/4) Epoch 11, batch 5000, loss[loss=0.2476, simple_loss=0.3185, pruned_loss=0.08838, over 21845.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3078, pruned_loss=0.07417, over 4274028.33 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:42:54,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1859676.0, ans=0.125 2023-06-24 23:43:17,405 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.91 vs. limit=10.0 2023-06-24 23:43:55,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1859796.0, ans=0.2 2023-06-24 23:44:16,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-24 23:44:39,767 INFO [train.py:996] (1/4) Epoch 11, batch 5050, loss[loss=0.276, simple_loss=0.3439, pruned_loss=0.1041, over 21835.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3081, pruned_loss=0.07551, over 4284457.74 frames. ], batch size: 118, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:44:42,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1859976.0, ans=0.0 2023-06-24 23:44:57,991 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.469e+02 1.066e+03 1.616e+03 3.471e+03, threshold=2.133e+03, percent-clipped=8.0 2023-06-24 23:45:26,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860096.0, ans=0.1 2023-06-24 23:45:38,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1860096.0, ans=0.125 2023-06-24 23:45:50,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860156.0, ans=0.1 2023-06-24 23:45:55,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1860156.0, ans=0.125 2023-06-24 23:46:08,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1860216.0, ans=0.2 2023-06-24 23:46:12,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1860216.0, ans=0.125 2023-06-24 23:46:26,246 INFO [train.py:996] (1/4) Epoch 11, batch 5100, loss[loss=0.2093, simple_loss=0.2758, pruned_loss=0.07144, over 16986.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3074, pruned_loss=0.07669, over 4283756.12 frames. ], batch size: 60, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:46:34,393 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.62 vs. limit=22.5 2023-06-24 23:46:48,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1860336.0, ans=0.0 2023-06-24 23:48:21,816 INFO [train.py:996] (1/4) Epoch 11, batch 5150, loss[loss=0.2612, simple_loss=0.3172, pruned_loss=0.1026, over 21287.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3059, pruned_loss=0.07813, over 4286302.67 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:48:34,362 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.000e+02 7.764e+02 1.031e+03 1.609e+03 3.475e+03, threshold=2.061e+03, percent-clipped=12.0 2023-06-24 23:49:38,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1860756.0, ans=0.0 2023-06-24 23:49:40,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1860816.0, ans=0.035 2023-06-24 23:50:11,474 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-24 23:50:11,786 INFO [train.py:996] (1/4) Epoch 11, batch 5200, loss[loss=0.2303, simple_loss=0.3234, pruned_loss=0.06862, over 21590.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3071, pruned_loss=0.07824, over 4275362.62 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 32.0 2023-06-24 23:50:46,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1860936.0, ans=0.0 2023-06-24 23:51:10,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1860996.0, ans=0.125 2023-06-24 23:51:18,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1861056.0, ans=0.04949747468305833 2023-06-24 23:51:54,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-06-24 23:51:58,066 INFO [train.py:996] (1/4) Epoch 11, batch 5250, loss[loss=0.2023, simple_loss=0.287, pruned_loss=0.05879, over 21749.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3106, pruned_loss=0.07695, over 4273566.08 frames. ], batch size: 124, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:52:18,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 9.518e+02 1.553e+03 2.129e+03 4.596e+03, threshold=3.106e+03, percent-clipped=26.0 2023-06-24 23:52:52,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1861296.0, ans=0.125 2023-06-24 23:53:07,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1861356.0, ans=0.125 2023-06-24 23:53:31,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1861416.0, ans=0.125 2023-06-24 23:53:32,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1861416.0, ans=0.125 2023-06-24 23:53:38,141 INFO [train.py:996] (1/4) Epoch 11, batch 5300, loss[loss=0.2419, simple_loss=0.3106, pruned_loss=0.08657, over 21869.00 frames. ], tot_loss[loss=0.233, simple_loss=0.31, pruned_loss=0.07794, over 4287635.34 frames. ], batch size: 391, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:53:47,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1861476.0, ans=0.0 2023-06-24 23:54:19,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1861596.0, ans=0.0 2023-06-24 23:54:41,127 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-24 23:55:10,578 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-24 23:55:22,254 INFO [train.py:996] (1/4) Epoch 11, batch 5350, loss[loss=0.2319, simple_loss=0.2988, pruned_loss=0.0825, over 21786.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3091, pruned_loss=0.07928, over 4299154.61 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:55:35,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.646e+02 7.558e+02 1.125e+03 1.569e+03 2.899e+03, threshold=2.250e+03, percent-clipped=0.0 2023-06-24 23:57:01,682 INFO [train.py:996] (1/4) Epoch 11, batch 5400, loss[loss=0.261, simple_loss=0.3123, pruned_loss=0.1048, over 21782.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3082, pruned_loss=0.08025, over 4304979.77 frames. ], batch size: 508, lr: 2.69e-03, grad_scale: 8.0 2023-06-24 23:57:46,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1862196.0, ans=0.1 2023-06-24 23:57:54,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1862196.0, ans=0.0 2023-06-24 23:57:59,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1862196.0, ans=0.0 2023-06-24 23:58:30,481 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.59 vs. limit=12.0 2023-06-24 23:58:36,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1862316.0, ans=0.2 2023-06-24 23:58:48,757 INFO [train.py:996] (1/4) Epoch 11, batch 5450, loss[loss=0.1985, simple_loss=0.279, pruned_loss=0.05899, over 21324.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3081, pruned_loss=0.07812, over 4296199.92 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 8.0 2023-06-24 23:59:10,606 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.863e+02 8.635e+02 1.460e+03 2.379e+03 5.903e+03, threshold=2.920e+03, percent-clipped=27.0 2023-06-24 23:59:23,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-24 23:59:47,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1862496.0, ans=0.125 2023-06-25 00:00:45,465 INFO [train.py:996] (1/4) Epoch 11, batch 5500, loss[loss=0.2436, simple_loss=0.3461, pruned_loss=0.07053, over 21277.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3136, pruned_loss=0.07552, over 4290298.37 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:00:59,546 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:01:12,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1862736.0, ans=0.07 2023-06-25 00:01:56,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-06-25 00:01:56,336 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-25 00:02:05,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1862856.0, ans=0.0 2023-06-25 00:02:16,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1862916.0, ans=0.125 2023-06-25 00:02:26,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1862916.0, ans=0.125 2023-06-25 00:02:33,320 INFO [train.py:996] (1/4) Epoch 11, batch 5550, loss[loss=0.1896, simple_loss=0.2922, pruned_loss=0.04346, over 21610.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.314, pruned_loss=0.07319, over 4291035.50 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:02:48,987 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.684e+02 8.321e+02 1.311e+03 1.956e+03 3.720e+03, threshold=2.623e+03, percent-clipped=7.0 2023-06-25 00:03:58,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1863156.0, ans=0.125 2023-06-25 00:04:00,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1863156.0, ans=0.125 2023-06-25 00:04:11,277 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:04:21,201 INFO [train.py:996] (1/4) Epoch 11, batch 5600, loss[loss=0.2321, simple_loss=0.3083, pruned_loss=0.078, over 21279.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3127, pruned_loss=0.07063, over 4282435.46 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:04:57,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1863336.0, ans=0.125 2023-06-25 00:05:47,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1863456.0, ans=0.125 2023-06-25 00:06:06,181 INFO [train.py:996] (1/4) Epoch 11, batch 5650, loss[loss=0.2648, simple_loss=0.3358, pruned_loss=0.0969, over 21796.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3166, pruned_loss=0.07313, over 4284369.71 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:06:08,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1863576.0, ans=0.0 2023-06-25 00:06:32,021 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.004e+02 8.541e+02 1.292e+03 2.009e+03 3.827e+03, threshold=2.583e+03, percent-clipped=13.0 2023-06-25 00:07:22,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-25 00:07:57,688 INFO [train.py:996] (1/4) Epoch 11, batch 5700, loss[loss=0.2018, simple_loss=0.3015, pruned_loss=0.05109, over 21735.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3145, pruned_loss=0.07488, over 4290813.05 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:08:22,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1863936.0, ans=0.125 2023-06-25 00:08:29,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1863936.0, ans=0.125 2023-06-25 00:08:39,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1863936.0, ans=0.0 2023-06-25 00:08:42,170 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-25 00:08:51,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1863996.0, ans=0.0 2023-06-25 00:09:29,824 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 00:09:53,508 INFO [train.py:996] (1/4) Epoch 11, batch 5750, loss[loss=0.2142, simple_loss=0.3199, pruned_loss=0.05424, over 21206.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3103, pruned_loss=0.07165, over 4283716.07 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:10:02,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1864176.0, ans=0.125 2023-06-25 00:10:08,445 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.527e+02 8.365e+02 1.283e+03 1.865e+03 4.523e+03, threshold=2.566e+03, percent-clipped=10.0 2023-06-25 00:10:29,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1864236.0, ans=0.125 2023-06-25 00:11:31,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1864416.0, ans=0.0 2023-06-25 00:11:39,493 INFO [train.py:996] (1/4) Epoch 11, batch 5800, loss[loss=0.2481, simple_loss=0.3576, pruned_loss=0.0693, over 21242.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3119, pruned_loss=0.07052, over 4279169.08 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:11:41,849 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:12:26,762 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 00:13:10,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1864716.0, ans=0.1 2023-06-25 00:13:21,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1864716.0, ans=0.125 2023-06-25 00:13:23,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1864716.0, ans=0.125 2023-06-25 00:13:32,908 INFO [train.py:996] (1/4) Epoch 11, batch 5850, loss[loss=0.1698, simple_loss=0.2705, pruned_loss=0.03454, over 21716.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3089, pruned_loss=0.06625, over 4273437.54 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:13:46,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1864776.0, ans=0.0 2023-06-25 00:13:46,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1864776.0, ans=0.125 2023-06-25 00:13:53,435 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 6.927e+02 1.116e+03 1.995e+03 4.965e+03, threshold=2.231e+03, percent-clipped=19.0 2023-06-25 00:14:07,488 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-25 00:14:51,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1864956.0, ans=0.1 2023-06-25 00:15:17,037 INFO [train.py:996] (1/4) Epoch 11, batch 5900, loss[loss=0.2584, simple_loss=0.3426, pruned_loss=0.08713, over 19927.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.3017, pruned_loss=0.06148, over 4272353.11 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:15:31,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1865076.0, ans=0.1 2023-06-25 00:15:35,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1865076.0, ans=0.0 2023-06-25 00:15:40,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1865136.0, ans=0.1 2023-06-25 00:16:59,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1865376.0, ans=0.2 2023-06-25 00:17:06,611 INFO [train.py:996] (1/4) Epoch 11, batch 5950, loss[loss=0.2102, simple_loss=0.2732, pruned_loss=0.07355, over 21343.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.3009, pruned_loss=0.06537, over 4282295.55 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:17:21,627 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.131e+02 6.600e+02 8.461e+02 1.275e+03 2.602e+03, threshold=1.692e+03, percent-clipped=3.0 2023-06-25 00:18:51,546 INFO [train.py:996] (1/4) Epoch 11, batch 6000, loss[loss=0.1699, simple_loss=0.2261, pruned_loss=0.05688, over 19953.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2961, pruned_loss=0.06849, over 4265517.13 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 32.0 2023-06-25 00:18:51,547 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 00:19:02,309 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.9577, 3.3489, 3.4703, 3.6528], device='cuda:1') 2023-06-25 00:19:08,581 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2642, simple_loss=0.3568, pruned_loss=0.08578, over 1796401.00 frames. 2023-06-25 00:19:08,583 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 00:19:14,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1865676.0, ans=0.07 2023-06-25 00:20:53,372 INFO [train.py:996] (1/4) Epoch 11, batch 6050, loss[loss=0.2088, simple_loss=0.2843, pruned_loss=0.06668, over 21437.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2912, pruned_loss=0.06993, over 4275209.29 frames. ], batch size: 473, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:20:56,118 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-25 00:21:18,409 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.947e+02 8.062e+02 1.043e+03 1.359e+03 2.248e+03, threshold=2.086e+03, percent-clipped=5.0 2023-06-25 00:21:40,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1866096.0, ans=0.1 2023-06-25 00:21:48,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1866096.0, ans=0.035 2023-06-25 00:21:50,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1866096.0, ans=0.0 2023-06-25 00:22:24,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1866216.0, ans=0.125 2023-06-25 00:22:39,224 INFO [train.py:996] (1/4) Epoch 11, batch 6100, loss[loss=0.2372, simple_loss=0.3045, pruned_loss=0.08493, over 21790.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2909, pruned_loss=0.06851, over 4272914.19 frames. ], batch size: 112, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:23:19,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1866336.0, ans=0.1 2023-06-25 00:23:44,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1866396.0, ans=0.125 2023-06-25 00:23:46,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1866456.0, ans=0.125 2023-06-25 00:24:27,321 INFO [train.py:996] (1/4) Epoch 11, batch 6150, loss[loss=0.2058, simple_loss=0.2833, pruned_loss=0.06416, over 21517.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2948, pruned_loss=0.07036, over 4270641.63 frames. ], batch size: 212, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:24:58,592 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.664e+02 7.679e+02 1.290e+03 1.928e+03 3.741e+03, threshold=2.581e+03, percent-clipped=18.0 2023-06-25 00:25:00,421 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:25:20,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1866696.0, ans=0.125 2023-06-25 00:25:48,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1866756.0, ans=0.0 2023-06-25 00:25:57,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1866816.0, ans=0.1 2023-06-25 00:26:19,972 INFO [train.py:996] (1/4) Epoch 11, batch 6200, loss[loss=0.2877, simple_loss=0.3631, pruned_loss=0.1061, over 21546.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2982, pruned_loss=0.07138, over 4266637.22 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:26:51,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1866936.0, ans=0.1 2023-06-25 00:27:08,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1866996.0, ans=0.125 2023-06-25 00:27:17,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1866996.0, ans=0.125 2023-06-25 00:27:30,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1867056.0, ans=0.125 2023-06-25 00:27:46,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1867116.0, ans=0.2 2023-06-25 00:28:05,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-25 00:28:06,354 INFO [train.py:996] (1/4) Epoch 11, batch 6250, loss[loss=0.2045, simple_loss=0.3076, pruned_loss=0.0507, over 21785.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3043, pruned_loss=0.0715, over 4264878.95 frames. ], batch size: 282, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:28:21,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1867176.0, ans=15.0 2023-06-25 00:28:31,522 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.849e+02 8.847e+02 1.490e+03 2.226e+03 5.467e+03, threshold=2.981e+03, percent-clipped=18.0 2023-06-25 00:29:00,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1867296.0, ans=0.1 2023-06-25 00:29:24,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-25 00:29:43,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1867416.0, ans=0.0 2023-06-25 00:29:52,772 INFO [train.py:996] (1/4) Epoch 11, batch 6300, loss[loss=0.2543, simple_loss=0.3234, pruned_loss=0.09264, over 21944.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3072, pruned_loss=0.07026, over 4260256.89 frames. ], batch size: 113, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:30:35,866 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-25 00:31:07,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1867656.0, ans=0.125 2023-06-25 00:31:44,192 INFO [train.py:996] (1/4) Epoch 11, batch 6350, loss[loss=0.2341, simple_loss=0.3042, pruned_loss=0.08205, over 21901.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3102, pruned_loss=0.07433, over 4275155.32 frames. ], batch size: 316, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:32:08,040 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.095e+02 6.705e+02 8.360e+02 1.250e+03 2.332e+03, threshold=1.672e+03, percent-clipped=0.0 2023-06-25 00:32:32,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1867896.0, ans=0.0 2023-06-25 00:33:37,595 INFO [train.py:996] (1/4) Epoch 11, batch 6400, loss[loss=0.2819, simple_loss=0.3499, pruned_loss=0.1069, over 21810.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.315, pruned_loss=0.07887, over 4280305.11 frames. ], batch size: 118, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:34:58,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1868256.0, ans=0.125 2023-06-25 00:35:00,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.74 vs. limit=10.0 2023-06-25 00:35:06,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1868316.0, ans=0.125 2023-06-25 00:35:26,314 INFO [train.py:996] (1/4) Epoch 11, batch 6450, loss[loss=0.2094, simple_loss=0.2934, pruned_loss=0.06266, over 21154.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3184, pruned_loss=0.07887, over 4273128.48 frames. ], batch size: 143, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:35:31,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1868376.0, ans=0.0 2023-06-25 00:35:51,639 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 9.176e+02 1.134e+03 1.706e+03 4.418e+03, threshold=2.268e+03, percent-clipped=27.0 2023-06-25 00:36:54,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1868616.0, ans=0.2 2023-06-25 00:37:13,889 INFO [train.py:996] (1/4) Epoch 11, batch 6500, loss[loss=0.2504, simple_loss=0.3459, pruned_loss=0.07741, over 21557.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3122, pruned_loss=0.07732, over 4259400.52 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:38:47,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-25 00:38:59,844 INFO [train.py:996] (1/4) Epoch 11, batch 6550, loss[loss=0.2063, simple_loss=0.2777, pruned_loss=0.06747, over 20115.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3093, pruned_loss=0.07575, over 4253557.95 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:39:24,190 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 9.229e+02 1.425e+03 2.181e+03 3.625e+03, threshold=2.850e+03, percent-clipped=21.0 2023-06-25 00:39:52,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1869096.0, ans=0.2 2023-06-25 00:40:31,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-25 00:40:37,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1869216.0, ans=0.125 2023-06-25 00:40:47,115 INFO [train.py:996] (1/4) Epoch 11, batch 6600, loss[loss=0.2038, simple_loss=0.2694, pruned_loss=0.06913, over 21763.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3038, pruned_loss=0.07578, over 4258825.11 frames. ], batch size: 371, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:41:07,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1869276.0, ans=0.125 2023-06-25 00:41:26,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1869336.0, ans=0.2 2023-06-25 00:41:31,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1869336.0, ans=0.1 2023-06-25 00:41:36,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1869396.0, ans=0.2 2023-06-25 00:41:43,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1869396.0, ans=0.0 2023-06-25 00:42:36,108 INFO [train.py:996] (1/4) Epoch 11, batch 6650, loss[loss=0.1927, simple_loss=0.2701, pruned_loss=0.05767, over 21813.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2963, pruned_loss=0.0735, over 4265985.42 frames. ], batch size: 352, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:42:37,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=12.0 2023-06-25 00:43:06,939 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 5.753e+02 7.174e+02 1.040e+03 2.181e+03, threshold=1.435e+03, percent-clipped=0.0 2023-06-25 00:43:07,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1869636.0, ans=0.0 2023-06-25 00:43:11,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-25 00:43:18,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=1869636.0, ans=12.0 2023-06-25 00:44:32,447 INFO [train.py:996] (1/4) Epoch 11, batch 6700, loss[loss=0.2411, simple_loss=0.3567, pruned_loss=0.06275, over 19829.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2919, pruned_loss=0.07343, over 4259736.43 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:44:32,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1869876.0, ans=0.0 2023-06-25 00:44:33,577 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-25 00:44:35,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 00:44:36,713 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:45:03,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=22.5 2023-06-25 00:45:19,172 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1869996.0, ans=0.125 2023-06-25 00:45:44,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1870056.0, ans=0.0 2023-06-25 00:45:48,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1870056.0, ans=0.125 2023-06-25 00:46:14,667 INFO [train.py:996] (1/4) Epoch 11, batch 6750, loss[loss=0.2312, simple_loss=0.3495, pruned_loss=0.05645, over 19818.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2905, pruned_loss=0.07394, over 4257142.63 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:46:43,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1870236.0, ans=0.0 2023-06-25 00:46:46,801 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 8.208e+02 1.148e+03 1.600e+03 3.333e+03, threshold=2.296e+03, percent-clipped=33.0 2023-06-25 00:46:57,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1870236.0, ans=0.125 2023-06-25 00:47:42,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1870416.0, ans=0.2 2023-06-25 00:47:50,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1870416.0, ans=0.125 2023-06-25 00:47:51,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-25 00:47:59,233 INFO [train.py:996] (1/4) Epoch 11, batch 6800, loss[loss=0.2357, simple_loss=0.2978, pruned_loss=0.08682, over 22027.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2931, pruned_loss=0.0758, over 4263845.68 frames. ], batch size: 103, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:48:02,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1870476.0, ans=0.0 2023-06-25 00:48:12,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1870476.0, ans=0.125 2023-06-25 00:49:14,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-25 00:49:32,168 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1870716.0, ans=0.0 2023-06-25 00:49:44,496 INFO [train.py:996] (1/4) Epoch 11, batch 6850, loss[loss=0.2444, simple_loss=0.2966, pruned_loss=0.0961, over 21555.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2926, pruned_loss=0.07617, over 4269151.08 frames. ], batch size: 508, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:50:16,085 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 8.303e+02 1.235e+03 2.153e+03 3.729e+03, threshold=2.471e+03, percent-clipped=22.0 2023-06-25 00:50:17,310 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-25 00:50:20,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1870836.0, ans=0.125 2023-06-25 00:51:01,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1870956.0, ans=0.2 2023-06-25 00:51:23,156 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1871016.0, ans=0.125 2023-06-25 00:51:31,203 INFO [train.py:996] (1/4) Epoch 11, batch 6900, loss[loss=0.2169, simple_loss=0.2835, pruned_loss=0.07515, over 21554.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2924, pruned_loss=0.07634, over 4279805.22 frames. ], batch size: 212, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:51:49,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1871076.0, ans=0.0 2023-06-25 00:52:33,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1871196.0, ans=0.0 2023-06-25 00:53:27,072 INFO [train.py:996] (1/4) Epoch 11, batch 6950, loss[loss=0.2596, simple_loss=0.3332, pruned_loss=0.09302, over 21927.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.297, pruned_loss=0.07391, over 4270947.05 frames. ], batch size: 316, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:53:42,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1871376.0, ans=0.125 2023-06-25 00:53:53,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.630e+02 7.235e+02 1.015e+03 1.522e+03 6.325e+03, threshold=2.030e+03, percent-clipped=9.0 2023-06-25 00:55:15,825 INFO [train.py:996] (1/4) Epoch 11, batch 7000, loss[loss=0.2321, simple_loss=0.2857, pruned_loss=0.08922, over 21583.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2997, pruned_loss=0.07546, over 4280683.47 frames. ], batch size: 231, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:56:02,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-25 00:56:34,960 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-25 00:56:54,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1871916.0, ans=0.1 2023-06-25 00:57:10,157 INFO [train.py:996] (1/4) Epoch 11, batch 7050, loss[loss=0.202, simple_loss=0.2872, pruned_loss=0.05841, over 21608.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2965, pruned_loss=0.07504, over 4279106.24 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:57:37,682 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.311e+02 8.822e+02 1.310e+03 1.745e+03 4.662e+03, threshold=2.619e+03, percent-clipped=19.0 2023-06-25 00:59:02,567 INFO [train.py:996] (1/4) Epoch 11, batch 7100, loss[loss=0.2138, simple_loss=0.2992, pruned_loss=0.06416, over 21741.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3022, pruned_loss=0.07676, over 4281899.09 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:59:37,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1872336.0, ans=0.0 2023-06-25 01:00:45,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1872516.0, ans=0.95 2023-06-25 01:00:53,319 INFO [train.py:996] (1/4) Epoch 11, batch 7150, loss[loss=0.1425, simple_loss=0.2097, pruned_loss=0.0376, over 21853.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2999, pruned_loss=0.07474, over 4271680.63 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:01:25,402 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.968e+02 7.662e+02 1.147e+03 1.671e+03 2.803e+03, threshold=2.294e+03, percent-clipped=2.0 2023-06-25 01:01:29,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1872636.0, ans=0.2 2023-06-25 01:02:08,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1872756.0, ans=0.0 2023-06-25 01:02:10,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1872756.0, ans=0.0 2023-06-25 01:02:21,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1872756.0, ans=0.125 2023-06-25 01:02:51,335 INFO [train.py:996] (1/4) Epoch 11, batch 7200, loss[loss=0.2225, simple_loss=0.291, pruned_loss=0.07701, over 21236.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3029, pruned_loss=0.07628, over 4269446.97 frames. ], batch size: 159, lr: 2.69e-03, grad_scale: 32.0 2023-06-25 01:03:03,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1872876.0, ans=0.125 2023-06-25 01:03:22,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1872936.0, ans=0.1 2023-06-25 01:03:25,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1872936.0, ans=0.125 2023-06-25 01:03:48,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1872996.0, ans=0.1 2023-06-25 01:04:30,459 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-25 01:04:37,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1873116.0, ans=0.0 2023-06-25 01:04:40,395 INFO [train.py:996] (1/4) Epoch 11, batch 7250, loss[loss=0.2079, simple_loss=0.2793, pruned_loss=0.06825, over 21859.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2999, pruned_loss=0.07659, over 4264147.08 frames. ], batch size: 107, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:05:06,932 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.525e+02 1.021e+03 1.447e+03 2.035e+03 4.041e+03, threshold=2.893e+03, percent-clipped=18.0 2023-06-25 01:05:07,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1873236.0, ans=0.0 2023-06-25 01:05:10,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1873236.0, ans=0.125 2023-06-25 01:06:19,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1873416.0, ans=0.0 2023-06-25 01:06:22,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1873416.0, ans=0.0 2023-06-25 01:06:27,166 INFO [train.py:996] (1/4) Epoch 11, batch 7300, loss[loss=0.1916, simple_loss=0.2583, pruned_loss=0.06242, over 21302.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2934, pruned_loss=0.07546, over 4260020.83 frames. ], batch size: 144, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:06:29,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1873476.0, ans=0.1 2023-06-25 01:06:56,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1873536.0, ans=0.2 2023-06-25 01:07:13,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1873596.0, ans=0.0 2023-06-25 01:07:18,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1873596.0, ans=0.0 2023-06-25 01:07:18,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1873596.0, ans=0.1 2023-06-25 01:07:20,554 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-25 01:07:40,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1873656.0, ans=0.1 2023-06-25 01:08:02,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1873716.0, ans=0.1 2023-06-25 01:08:16,342 INFO [train.py:996] (1/4) Epoch 11, batch 7350, loss[loss=0.2565, simple_loss=0.3282, pruned_loss=0.09235, over 21464.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2918, pruned_loss=0.07664, over 4263205.42 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:08:43,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.949e+02 8.143e+02 1.181e+03 1.694e+03 4.027e+03, threshold=2.361e+03, percent-clipped=4.0 2023-06-25 01:09:01,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1873896.0, ans=0.0 2023-06-25 01:09:04,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1873896.0, ans=0.2 2023-06-25 01:09:43,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874016.0, ans=0.1 2023-06-25 01:09:55,943 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-06-25 01:10:11,701 INFO [train.py:996] (1/4) Epoch 11, batch 7400, loss[loss=0.2413, simple_loss=0.3147, pruned_loss=0.08397, over 20877.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2961, pruned_loss=0.07762, over 4261497.26 frames. ], batch size: 609, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:10:18,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1874076.0, ans=0.0 2023-06-25 01:10:25,976 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.46 vs. limit=15.0 2023-06-25 01:10:51,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-25 01:11:41,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1874316.0, ans=0.2 2023-06-25 01:12:03,186 INFO [train.py:996] (1/4) Epoch 11, batch 7450, loss[loss=0.2099, simple_loss=0.2789, pruned_loss=0.07039, over 21581.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2931, pruned_loss=0.07619, over 4263056.89 frames. ], batch size: 391, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:12:07,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1874376.0, ans=0.125 2023-06-25 01:12:33,139 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.846e+02 7.768e+02 1.010e+03 1.629e+03 4.953e+03, threshold=2.020e+03, percent-clipped=6.0 2023-06-25 01:12:46,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1874436.0, ans=0.125 2023-06-25 01:12:50,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1874496.0, ans=0.1 2023-06-25 01:13:07,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1874556.0, ans=0.125 2023-06-25 01:13:54,440 INFO [train.py:996] (1/4) Epoch 11, batch 7500, loss[loss=0.2223, simple_loss=0.3196, pruned_loss=0.06247, over 21276.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2986, pruned_loss=0.07817, over 4271238.49 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:14:03,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1874676.0, ans=0.125 2023-06-25 01:14:05,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-25 01:14:29,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1874736.0, ans=0.0 2023-06-25 01:14:43,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874796.0, ans=0.1 2023-06-25 01:15:11,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.50 vs. limit=6.0 2023-06-25 01:15:34,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1874916.0, ans=0.0 2023-06-25 01:15:43,452 INFO [train.py:996] (1/4) Epoch 11, batch 7550, loss[loss=0.226, simple_loss=0.3269, pruned_loss=0.0625, over 21779.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3065, pruned_loss=0.07693, over 4274274.63 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:15:55,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1874976.0, ans=0.125 2023-06-25 01:16:17,105 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 9.851e+02 1.650e+03 2.404e+03 5.031e+03, threshold=3.301e+03, percent-clipped=35.0 2023-06-25 01:16:46,190 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-25 01:16:52,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1875156.0, ans=0.1 2023-06-25 01:17:15,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1875216.0, ans=0.125 2023-06-25 01:17:29,912 INFO [train.py:996] (1/4) Epoch 11, batch 7600, loss[loss=0.2195, simple_loss=0.2981, pruned_loss=0.07042, over 21872.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3056, pruned_loss=0.07587, over 4278133.65 frames. ], batch size: 371, lr: 2.68e-03, grad_scale: 32.0 2023-06-25 01:17:32,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1875276.0, ans=0.0 2023-06-25 01:17:33,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875276.0, ans=0.1 2023-06-25 01:18:02,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.98 vs. limit=6.0 2023-06-25 01:18:15,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1875396.0, ans=0.125 2023-06-25 01:18:19,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=22.5 2023-06-25 01:18:31,525 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:18:34,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1875456.0, ans=0.125 2023-06-25 01:18:59,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1875516.0, ans=0.125 2023-06-25 01:19:00,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-25 01:19:11,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1875516.0, ans=0.015 2023-06-25 01:19:14,324 INFO [train.py:996] (1/4) Epoch 11, batch 7650, loss[loss=0.2159, simple_loss=0.2784, pruned_loss=0.07666, over 21567.00 frames. ], tot_loss[loss=0.229, simple_loss=0.304, pruned_loss=0.07702, over 4289700.11 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:19:36,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1875636.0, ans=0.0 2023-06-25 01:19:43,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1875636.0, ans=0.125 2023-06-25 01:19:44,584 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.184e+02 7.567e+02 1.161e+03 1.543e+03 3.222e+03, threshold=2.322e+03, percent-clipped=0.0 2023-06-25 01:19:51,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1875636.0, ans=0.0 2023-06-25 01:20:42,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1875816.0, ans=0.0 2023-06-25 01:20:56,017 INFO [train.py:996] (1/4) Epoch 11, batch 7700, loss[loss=0.2967, simple_loss=0.368, pruned_loss=0.1127, over 21814.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3081, pruned_loss=0.08038, over 4289905.31 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:21:00,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1875876.0, ans=0.1 2023-06-25 01:21:42,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1875996.0, ans=0.0 2023-06-25 01:21:43,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-25 01:21:45,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1875996.0, ans=0.125 2023-06-25 01:22:10,327 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.36 vs. limit=10.0 2023-06-25 01:22:43,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1876116.0, ans=0.125 2023-06-25 01:22:45,827 INFO [train.py:996] (1/4) Epoch 11, batch 7750, loss[loss=0.2248, simple_loss=0.3103, pruned_loss=0.06965, over 21274.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3124, pruned_loss=0.07984, over 4286857.71 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:23:10,557 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.928e+02 1.247e+03 1.821e+03 3.792e+03, threshold=2.494e+03, percent-clipped=12.0 2023-06-25 01:23:55,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1876356.0, ans=0.125 2023-06-25 01:24:31,896 INFO [train.py:996] (1/4) Epoch 11, batch 7800, loss[loss=0.2615, simple_loss=0.3433, pruned_loss=0.08987, over 21557.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3153, pruned_loss=0.08054, over 4277862.31 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:24:38,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-25 01:24:39,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-25 01:24:52,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.72 vs. limit=22.5 2023-06-25 01:25:19,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1876596.0, ans=0.1 2023-06-25 01:25:31,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1876656.0, ans=0.125 2023-06-25 01:26:15,691 INFO [train.py:996] (1/4) Epoch 11, batch 7850, loss[loss=0.2051, simple_loss=0.2634, pruned_loss=0.07338, over 21458.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3071, pruned_loss=0.0792, over 4264739.34 frames. ], batch size: 230, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:26:46,367 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.852e+02 8.105e+02 1.212e+03 1.898e+03 4.667e+03, threshold=2.425e+03, percent-clipped=9.0 2023-06-25 01:26:47,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=22.5 2023-06-25 01:28:06,262 INFO [train.py:996] (1/4) Epoch 11, batch 7900, loss[loss=0.1903, simple_loss=0.2597, pruned_loss=0.06047, over 21244.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3029, pruned_loss=0.07906, over 4269074.92 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:28:13,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1877076.0, ans=0.1 2023-06-25 01:28:41,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1877136.0, ans=0.95 2023-06-25 01:28:47,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1877136.0, ans=6.0 2023-06-25 01:29:41,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1877316.0, ans=0.125 2023-06-25 01:29:50,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-25 01:29:57,779 INFO [train.py:996] (1/4) Epoch 11, batch 7950, loss[loss=0.2092, simple_loss=0.2913, pruned_loss=0.06357, over 21438.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3046, pruned_loss=0.07761, over 4263073.21 frames. ], batch size: 211, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:30:05,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1877376.0, ans=0.125 2023-06-25 01:30:24,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1877376.0, ans=0.125 2023-06-25 01:30:27,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=15.0 2023-06-25 01:30:33,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.24 vs. limit=12.0 2023-06-25 01:30:35,727 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.494e+02 9.486e+02 1.599e+03 2.410e+03 5.026e+03, threshold=3.197e+03, percent-clipped=23.0 2023-06-25 01:32:03,954 INFO [train.py:996] (1/4) Epoch 11, batch 8000, loss[loss=0.2662, simple_loss=0.3476, pruned_loss=0.09239, over 21641.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3092, pruned_loss=0.07903, over 4263593.25 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:32:23,095 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-25 01:32:51,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-25 01:32:52,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1877796.0, ans=0.1 2023-06-25 01:33:03,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1877796.0, ans=0.125 2023-06-25 01:33:11,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1877856.0, ans=0.125 2023-06-25 01:33:11,747 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:33:56,991 INFO [train.py:996] (1/4) Epoch 11, batch 8050, loss[loss=0.2121, simple_loss=0.2826, pruned_loss=0.07081, over 21397.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.311, pruned_loss=0.07879, over 4251355.20 frames. ], batch size: 194, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:34:34,631 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 8.572e+02 1.267e+03 1.861e+03 4.173e+03, threshold=2.534e+03, percent-clipped=4.0 2023-06-25 01:34:46,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1878096.0, ans=0.1 2023-06-25 01:34:56,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1878096.0, ans=0.04949747468305833 2023-06-25 01:35:34,189 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:35:45,699 INFO [train.py:996] (1/4) Epoch 11, batch 8100, loss[loss=0.2588, simple_loss=0.3215, pruned_loss=0.09807, over 21624.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3113, pruned_loss=0.07987, over 4258144.90 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:35:54,233 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-25 01:36:11,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1878276.0, ans=0.125 2023-06-25 01:36:53,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1878396.0, ans=0.125 2023-06-25 01:36:53,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1878396.0, ans=0.2 2023-06-25 01:37:01,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1878456.0, ans=0.2 2023-06-25 01:37:25,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1878516.0, ans=0.0 2023-06-25 01:37:48,333 INFO [train.py:996] (1/4) Epoch 11, batch 8150, loss[loss=0.2743, simple_loss=0.375, pruned_loss=0.08678, over 21589.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3202, pruned_loss=0.08174, over 4266112.81 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:37:48,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1878576.0, ans=0.0 2023-06-25 01:37:49,584 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-25 01:38:03,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1878636.0, ans=0.1 2023-06-25 01:38:17,990 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.023e+02 7.751e+02 1.218e+03 2.122e+03 5.445e+03, threshold=2.437e+03, percent-clipped=16.0 2023-06-25 01:39:13,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1878816.0, ans=0.125 2023-06-25 01:39:30,388 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=12.0 2023-06-25 01:39:39,802 INFO [train.py:996] (1/4) Epoch 11, batch 8200, loss[loss=0.2612, simple_loss=0.3102, pruned_loss=0.1061, over 21508.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3144, pruned_loss=0.07971, over 4257343.18 frames. ], batch size: 442, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:39:40,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1878876.0, ans=0.0 2023-06-25 01:40:38,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1878996.0, ans=0.125 2023-06-25 01:41:03,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1879056.0, ans=0.2 2023-06-25 01:41:28,948 INFO [train.py:996] (1/4) Epoch 11, batch 8250, loss[loss=0.2733, simple_loss=0.3689, pruned_loss=0.08889, over 21194.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3124, pruned_loss=0.0789, over 4252400.71 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:41:57,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1879236.0, ans=0.125 2023-06-25 01:42:00,022 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.306e+02 1.035e+03 1.633e+03 3.565e+03, threshold=2.069e+03, percent-clipped=11.0 2023-06-25 01:42:18,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1879296.0, ans=0.125 2023-06-25 01:42:20,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-25 01:42:53,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1879416.0, ans=0.125 2023-06-25 01:42:55,407 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:42:55,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1879416.0, ans=0.1 2023-06-25 01:42:56,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-25 01:43:17,175 INFO [train.py:996] (1/4) Epoch 11, batch 8300, loss[loss=0.2783, simple_loss=0.3542, pruned_loss=0.1013, over 21677.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.312, pruned_loss=0.07648, over 4259626.68 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:44:25,164 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-25 01:44:47,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1879716.0, ans=0.0 2023-06-25 01:45:04,808 INFO [train.py:996] (1/4) Epoch 11, batch 8350, loss[loss=0.2113, simple_loss=0.2836, pruned_loss=0.06947, over 21156.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3111, pruned_loss=0.07484, over 4258644.95 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:45:13,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1879776.0, ans=0.2 2023-06-25 01:45:44,604 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 7.785e+02 1.165e+03 1.706e+03 3.630e+03, threshold=2.331e+03, percent-clipped=15.0 2023-06-25 01:46:16,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1879956.0, ans=0.2 2023-06-25 01:46:40,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1880016.0, ans=10.0 2023-06-25 01:46:53,167 INFO [train.py:996] (1/4) Epoch 11, batch 8400, loss[loss=0.2102, simple_loss=0.3074, pruned_loss=0.05648, over 21695.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3083, pruned_loss=0.07234, over 4263466.43 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:47:30,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1880136.0, ans=0.125 2023-06-25 01:47:57,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1880196.0, ans=0.125 2023-06-25 01:48:41,855 INFO [train.py:996] (1/4) Epoch 11, batch 8450, loss[loss=0.2343, simple_loss=0.2996, pruned_loss=0.08452, over 21268.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3066, pruned_loss=0.07241, over 4273423.69 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:49:19,649 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1880436.0, ans=0.0 2023-06-25 01:49:20,515 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.263e+02 6.433e+02 1.170e+03 1.916e+03 4.574e+03, threshold=2.341e+03, percent-clipped=17.0 2023-06-25 01:49:59,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1880556.0, ans=0.0 2023-06-25 01:50:24,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-25 01:50:30,081 INFO [train.py:996] (1/4) Epoch 11, batch 8500, loss[loss=0.2564, simple_loss=0.308, pruned_loss=0.1024, over 21281.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3025, pruned_loss=0.07425, over 4272702.99 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:51:19,105 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-25 01:51:28,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1880796.0, ans=0.125 2023-06-25 01:51:33,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-25 01:51:54,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1880856.0, ans=0.2 2023-06-25 01:52:18,606 INFO [train.py:996] (1/4) Epoch 11, batch 8550, loss[loss=0.2532, simple_loss=0.3243, pruned_loss=0.09102, over 21301.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3021, pruned_loss=0.07575, over 4267400.03 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:52:56,698 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.844e+02 6.860e+02 9.469e+02 1.395e+03 3.551e+03, threshold=1.894e+03, percent-clipped=10.0 2023-06-25 01:53:20,854 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.00 vs. limit=22.5 2023-06-25 01:54:20,732 INFO [train.py:996] (1/4) Epoch 11, batch 8600, loss[loss=0.2836, simple_loss=0.4022, pruned_loss=0.08246, over 20763.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3125, pruned_loss=0.0793, over 4266671.78 frames. ], batch size: 607, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:54:47,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1881336.0, ans=0.125 2023-06-25 01:54:55,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1881336.0, ans=0.0 2023-06-25 01:56:09,651 INFO [train.py:996] (1/4) Epoch 11, batch 8650, loss[loss=0.1908, simple_loss=0.2947, pruned_loss=0.04348, over 21627.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3174, pruned_loss=0.07988, over 4272495.55 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:56:20,306 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:56:37,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1881636.0, ans=0.95 2023-06-25 01:56:43,776 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.081e+02 8.530e+02 1.308e+03 2.199e+03 5.345e+03, threshold=2.615e+03, percent-clipped=30.0 2023-06-25 01:57:09,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1881696.0, ans=0.125 2023-06-25 01:57:09,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1881696.0, ans=0.125 2023-06-25 01:57:14,748 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1881756.0, ans=0.1 2023-06-25 01:57:46,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-25 01:57:52,023 INFO [train.py:996] (1/4) Epoch 11, batch 8700, loss[loss=0.2331, simple_loss=0.2892, pruned_loss=0.08854, over 21485.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3092, pruned_loss=0.07641, over 4275578.39 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:58:47,326 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-25 01:59:38,945 INFO [train.py:996] (1/4) Epoch 11, batch 8750, loss[loss=0.2558, simple_loss=0.3203, pruned_loss=0.09562, over 21885.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3065, pruned_loss=0.0772, over 4286551.15 frames. ], batch size: 333, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:59:50,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1882176.0, ans=0.125 2023-06-25 02:00:25,254 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.601e+02 8.552e+02 1.572e+03 2.395e+03 4.841e+03, threshold=3.145e+03, percent-clipped=19.0 2023-06-25 02:00:53,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1882356.0, ans=0.0 2023-06-25 02:01:15,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1882416.0, ans=0.125 2023-06-25 02:01:26,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1882416.0, ans=0.1 2023-06-25 02:01:32,848 INFO [train.py:996] (1/4) Epoch 11, batch 8800, loss[loss=0.2626, simple_loss=0.3349, pruned_loss=0.09513, over 21301.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3151, pruned_loss=0.07973, over 4286995.22 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:01:46,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1882476.0, ans=0.07 2023-06-25 02:01:46,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1882476.0, ans=0.125 2023-06-25 02:02:08,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1882536.0, ans=0.125 2023-06-25 02:02:23,965 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=22.5 2023-06-25 02:02:33,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1882596.0, ans=0.035 2023-06-25 02:03:01,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1882716.0, ans=0.125 2023-06-25 02:03:02,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1882716.0, ans=15.0 2023-06-25 02:03:02,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-25 02:03:27,966 INFO [train.py:996] (1/4) Epoch 11, batch 8850, loss[loss=0.2366, simple_loss=0.3323, pruned_loss=0.07047, over 21881.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3208, pruned_loss=0.08181, over 4287256.62 frames. ], batch size: 98, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:03:57,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1882836.0, ans=0.0 2023-06-25 02:04:04,665 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 8.533e+02 1.157e+03 2.147e+03 4.267e+03, threshold=2.313e+03, percent-clipped=8.0 2023-06-25 02:04:39,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1882956.0, ans=0.0 2023-06-25 02:04:46,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1883016.0, ans=0.125 2023-06-25 02:04:54,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1883016.0, ans=0.125 2023-06-25 02:05:14,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1883016.0, ans=0.125 2023-06-25 02:05:17,230 INFO [train.py:996] (1/4) Epoch 11, batch 8900, loss[loss=0.227, simple_loss=0.3118, pruned_loss=0.07113, over 21615.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3153, pruned_loss=0.08003, over 4285557.72 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:05:18,427 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=12.0 2023-06-25 02:05:46,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1883136.0, ans=0.125 2023-06-25 02:05:58,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1883196.0, ans=0.1 2023-06-25 02:06:13,453 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-25 02:07:13,592 INFO [train.py:996] (1/4) Epoch 11, batch 8950, loss[loss=0.2014, simple_loss=0.2632, pruned_loss=0.06984, over 21174.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3154, pruned_loss=0.0791, over 4280653.32 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:07:48,295 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.752e+02 7.956e+02 1.198e+03 2.154e+03 4.592e+03, threshold=2.397e+03, percent-clipped=22.0 2023-06-25 02:08:01,770 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1883496.0, ans=0.09899494936611666 2023-06-25 02:08:55,094 INFO [train.py:996] (1/4) Epoch 11, batch 9000, loss[loss=0.2092, simple_loss=0.2643, pruned_loss=0.07706, over 21474.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3085, pruned_loss=0.07867, over 4270732.09 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:08:55,095 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 02:09:06,046 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.8518, 4.8328, 4.5486, 4.4347], device='cuda:1') 2023-06-25 02:09:12,565 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2589, simple_loss=0.3526, pruned_loss=0.08262, over 1796401.00 frames. 2023-06-25 02:09:12,566 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 02:09:33,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1883736.0, ans=0.125 2023-06-25 02:09:40,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1883736.0, ans=0.125 2023-06-25 02:09:42,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1883736.0, ans=0.0 2023-06-25 02:09:45,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1883736.0, ans=0.125 2023-06-25 02:09:47,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1883736.0, ans=0.125 2023-06-25 02:10:48,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1883916.0, ans=0.0 2023-06-25 02:11:00,328 INFO [train.py:996] (1/4) Epoch 11, batch 9050, loss[loss=0.2865, simple_loss=0.4054, pruned_loss=0.08374, over 19828.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3063, pruned_loss=0.07612, over 4266552.36 frames. ], batch size: 702, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:11:31,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1884036.0, ans=0.1 2023-06-25 02:11:44,403 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.917e+02 7.004e+02 1.025e+03 1.804e+03 4.936e+03, threshold=2.049e+03, percent-clipped=10.0 2023-06-25 02:11:53,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1884096.0, ans=0.0 2023-06-25 02:11:57,128 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2023-06-25 02:12:50,864 INFO [train.py:996] (1/4) Epoch 11, batch 9100, loss[loss=0.2353, simple_loss=0.318, pruned_loss=0.07629, over 21159.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.312, pruned_loss=0.07853, over 4269725.42 frames. ], batch size: 143, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:12:51,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1884276.0, ans=0.125 2023-06-25 02:14:40,259 INFO [train.py:996] (1/4) Epoch 11, batch 9150, loss[loss=0.2099, simple_loss=0.2996, pruned_loss=0.06017, over 21717.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3144, pruned_loss=0.07646, over 4270863.52 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:14:42,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1884576.0, ans=0.05 2023-06-25 02:15:21,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 1.034e+03 1.434e+03 2.123e+03 3.847e+03, threshold=2.868e+03, percent-clipped=26.0 2023-06-25 02:15:23,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1884696.0, ans=0.0 2023-06-25 02:15:46,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1884696.0, ans=0.1 2023-06-25 02:15:54,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1884756.0, ans=0.125 2023-06-25 02:15:56,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1884756.0, ans=0.0 2023-06-25 02:16:23,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1884816.0, ans=0.1 2023-06-25 02:16:33,384 INFO [train.py:996] (1/4) Epoch 11, batch 9200, loss[loss=0.244, simple_loss=0.3243, pruned_loss=0.08184, over 21833.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3155, pruned_loss=0.07514, over 4269117.38 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:16:37,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1884876.0, ans=0.2 2023-06-25 02:16:37,759 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-25 02:16:44,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1884876.0, ans=0.125 2023-06-25 02:17:23,576 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-25 02:18:20,300 INFO [train.py:996] (1/4) Epoch 11, batch 9250, loss[loss=0.2128, simple_loss=0.2759, pruned_loss=0.0748, over 21607.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3179, pruned_loss=0.07759, over 4271560.59 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:18:56,163 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.130e+02 8.312e+02 1.043e+03 1.613e+03 4.110e+03, threshold=2.085e+03, percent-clipped=7.0 2023-06-25 02:20:14,375 INFO [train.py:996] (1/4) Epoch 11, batch 9300, loss[loss=0.2195, simple_loss=0.2788, pruned_loss=0.08006, over 21469.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3125, pruned_loss=0.0773, over 4260159.74 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:20:32,291 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-25 02:21:04,988 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.24 vs. limit=22.5 2023-06-25 02:21:53,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1885716.0, ans=0.125 2023-06-25 02:22:02,539 INFO [train.py:996] (1/4) Epoch 11, batch 9350, loss[loss=0.2666, simple_loss=0.3402, pruned_loss=0.09653, over 21744.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3187, pruned_loss=0.07921, over 4262120.36 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:22:41,003 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.377e+02 8.449e+02 1.377e+03 2.044e+03 3.190e+03, threshold=2.753e+03, percent-clipped=23.0 2023-06-25 02:22:55,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1885896.0, ans=0.125 2023-06-25 02:23:52,739 INFO [train.py:996] (1/4) Epoch 11, batch 9400, loss[loss=0.2486, simple_loss=0.3018, pruned_loss=0.09773, over 21299.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3201, pruned_loss=0.07972, over 4260329.16 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:24:07,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1886076.0, ans=0.2 2023-06-25 02:24:45,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1886196.0, ans=0.125 2023-06-25 02:24:59,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1886256.0, ans=0.0 2023-06-25 02:25:05,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1886256.0, ans=0.125 2023-06-25 02:25:29,857 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-25 02:25:44,634 INFO [train.py:996] (1/4) Epoch 11, batch 9450, loss[loss=0.2532, simple_loss=0.3182, pruned_loss=0.0941, over 21851.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3124, pruned_loss=0.0791, over 4257017.76 frames. ], batch size: 98, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:26:20,768 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 9.191e+02 1.408e+03 2.175e+03 4.648e+03, threshold=2.816e+03, percent-clipped=10.0 2023-06-25 02:27:33,364 INFO [train.py:996] (1/4) Epoch 11, batch 9500, loss[loss=0.1795, simple_loss=0.26, pruned_loss=0.04945, over 21436.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3048, pruned_loss=0.07758, over 4267092.56 frames. ], batch size: 211, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:29:19,340 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:29:22,398 INFO [train.py:996] (1/4) Epoch 11, batch 9550, loss[loss=0.2646, simple_loss=0.3602, pruned_loss=0.08445, over 21786.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3101, pruned_loss=0.07968, over 4261046.59 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:29:57,636 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.824e+02 8.918e+02 1.397e+03 2.020e+03 4.656e+03, threshold=2.794e+03, percent-clipped=11.0 2023-06-25 02:30:06,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1887096.0, ans=0.0 2023-06-25 02:30:23,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1887156.0, ans=0.0 2023-06-25 02:31:04,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1887216.0, ans=0.0 2023-06-25 02:31:08,647 INFO [train.py:996] (1/4) Epoch 11, batch 9600, loss[loss=0.2165, simple_loss=0.2875, pruned_loss=0.07276, over 21792.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.314, pruned_loss=0.08173, over 4264347.66 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 32.0 2023-06-25 02:31:37,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1887336.0, ans=0.0 2023-06-25 02:31:57,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1887396.0, ans=0.125 2023-06-25 02:32:02,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1887396.0, ans=0.125 2023-06-25 02:32:56,752 INFO [train.py:996] (1/4) Epoch 11, batch 9650, loss[loss=0.2534, simple_loss=0.3228, pruned_loss=0.09198, over 21560.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.314, pruned_loss=0.08075, over 4271795.87 frames. ], batch size: 230, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:33:23,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1887636.0, ans=0.0 2023-06-25 02:33:34,604 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 8.589e+02 1.260e+03 1.923e+03 2.986e+03, threshold=2.520e+03, percent-clipped=3.0 2023-06-25 02:33:35,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1887696.0, ans=0.07 2023-06-25 02:34:04,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1887756.0, ans=0.125 2023-06-25 02:34:05,934 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=12.0 2023-06-25 02:34:09,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1887756.0, ans=0.0 2023-06-25 02:34:29,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1887816.0, ans=0.125 2023-06-25 02:34:32,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1887816.0, ans=0.09899494936611666 2023-06-25 02:34:45,504 INFO [train.py:996] (1/4) Epoch 11, batch 9700, loss[loss=0.2051, simple_loss=0.2843, pruned_loss=0.06296, over 21394.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.318, pruned_loss=0.0817, over 4275258.33 frames. ], batch size: 211, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:35:43,390 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-25 02:35:52,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1888056.0, ans=0.125 2023-06-25 02:36:34,153 INFO [train.py:996] (1/4) Epoch 11, batch 9750, loss[loss=0.203, simple_loss=0.2671, pruned_loss=0.06949, over 21896.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3116, pruned_loss=0.08018, over 4270873.92 frames. ], batch size: 373, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:36:34,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1888176.0, ans=0.1 2023-06-25 02:37:02,491 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-06-25 02:37:09,416 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.320e+02 8.182e+02 1.091e+03 1.675e+03 6.818e+03, threshold=2.183e+03, percent-clipped=8.0 2023-06-25 02:37:43,540 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.53 vs. limit=6.0 2023-06-25 02:37:49,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1888356.0, ans=0.2 2023-06-25 02:37:59,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1888416.0, ans=0.125 2023-06-25 02:38:19,351 INFO [train.py:996] (1/4) Epoch 11, batch 9800, loss[loss=0.2237, simple_loss=0.2995, pruned_loss=0.07394, over 21834.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3125, pruned_loss=0.08034, over 4264248.33 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:39:27,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1888656.0, ans=0.0 2023-06-25 02:39:34,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1888656.0, ans=0.0 2023-06-25 02:40:00,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1888716.0, ans=0.0 2023-06-25 02:40:05,035 INFO [train.py:996] (1/4) Epoch 11, batch 9850, loss[loss=0.1988, simple_loss=0.2639, pruned_loss=0.06681, over 21772.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.308, pruned_loss=0.07945, over 4260994.08 frames. ], batch size: 102, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:40:34,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.81 vs. limit=10.0 2023-06-25 02:40:41,995 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 6.727e+02 9.053e+02 1.353e+03 2.861e+03, threshold=1.811e+03, percent-clipped=2.0 2023-06-25 02:40:58,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1888896.0, ans=0.0 2023-06-25 02:41:01,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1888896.0, ans=0.125 2023-06-25 02:41:21,810 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-25 02:41:30,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1889016.0, ans=0.1 2023-06-25 02:41:53,319 INFO [train.py:996] (1/4) Epoch 11, batch 9900, loss[loss=0.2212, simple_loss=0.2997, pruned_loss=0.07135, over 21713.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3035, pruned_loss=0.07874, over 4258173.24 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:41:57,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1889076.0, ans=0.125 2023-06-25 02:42:02,401 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1889076.0, ans=0.0 2023-06-25 02:43:26,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1889316.0, ans=0.125 2023-06-25 02:43:40,113 INFO [train.py:996] (1/4) Epoch 11, batch 9950, loss[loss=0.2438, simple_loss=0.3101, pruned_loss=0.0888, over 21707.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3044, pruned_loss=0.08039, over 4259567.93 frames. ], batch size: 124, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:44:09,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889436.0, ans=0.1 2023-06-25 02:44:23,434 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.842e+02 7.849e+02 1.088e+03 1.572e+03 3.841e+03, threshold=2.175e+03, percent-clipped=17.0 2023-06-25 02:44:42,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1889496.0, ans=0.0 2023-06-25 02:45:27,862 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-25 02:45:36,554 INFO [train.py:996] (1/4) Epoch 11, batch 10000, loss[loss=0.1931, simple_loss=0.2688, pruned_loss=0.05876, over 21526.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3007, pruned_loss=0.07947, over 4268788.16 frames. ], batch size: 195, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 02:45:44,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889676.0, ans=0.1 2023-06-25 02:45:51,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889676.0, ans=0.1 2023-06-25 02:46:36,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-25 02:46:53,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1889856.0, ans=0.125 2023-06-25 02:46:55,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1889856.0, ans=0.125 2023-06-25 02:47:01,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1889856.0, ans=0.2 2023-06-25 02:47:25,701 INFO [train.py:996] (1/4) Epoch 11, batch 10050, loss[loss=0.2292, simple_loss=0.3011, pruned_loss=0.07866, over 21503.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3029, pruned_loss=0.08019, over 4270671.45 frames. ], batch size: 211, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:48:02,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1890036.0, ans=10.0 2023-06-25 02:48:13,137 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.178e+02 7.347e+02 1.195e+03 1.566e+03 3.839e+03, threshold=2.391e+03, percent-clipped=10.0 2023-06-25 02:48:33,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1890096.0, ans=0.0 2023-06-25 02:48:46,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1890156.0, ans=0.1 2023-06-25 02:49:16,358 INFO [train.py:996] (1/4) Epoch 11, batch 10100, loss[loss=0.2228, simple_loss=0.3273, pruned_loss=0.05914, over 21244.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.303, pruned_loss=0.07876, over 4266792.08 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:49:25,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1890276.0, ans=0.0 2023-06-25 02:49:49,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-25 02:50:22,677 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2023-06-25 02:51:12,573 INFO [train.py:996] (1/4) Epoch 11, batch 10150, loss[loss=0.2363, simple_loss=0.3165, pruned_loss=0.07809, over 21719.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3086, pruned_loss=0.08055, over 4273074.69 frames. ], batch size: 351, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:51:59,201 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.141e+02 7.484e+02 1.008e+03 1.435e+03 3.139e+03, threshold=2.017e+03, percent-clipped=8.0 2023-06-25 02:52:06,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1890696.0, ans=0.125 2023-06-25 02:52:53,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1890876.0, ans=0.07 2023-06-25 02:52:53,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1890876.0, ans=0.1 2023-06-25 02:52:54,899 INFO [train.py:996] (1/4) Epoch 11, batch 10200, loss[loss=0.2337, simple_loss=0.313, pruned_loss=0.07723, over 21251.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.309, pruned_loss=0.07944, over 4271355.66 frames. ], batch size: 143, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:53:18,026 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-25 02:53:19,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1890876.0, ans=0.04949747468305833 2023-06-25 02:53:25,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1890936.0, ans=0.025 2023-06-25 02:53:28,857 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1890936.0, ans=10.0 2023-06-25 02:53:56,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1890996.0, ans=0.125 2023-06-25 02:54:27,217 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1891116.0, ans=0.125 2023-06-25 02:54:47,848 INFO [train.py:996] (1/4) Epoch 11, batch 10250, loss[loss=0.1599, simple_loss=0.255, pruned_loss=0.03242, over 21717.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3022, pruned_loss=0.07328, over 4277934.02 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:54:58,625 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-25 02:55:06,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1891176.0, ans=0.125 2023-06-25 02:55:12,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1891236.0, ans=0.05 2023-06-25 02:55:38,024 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.531e+02 8.201e+02 1.215e+03 1.712e+03 3.588e+03, threshold=2.431e+03, percent-clipped=17.0 2023-06-25 02:55:47,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1891296.0, ans=0.0 2023-06-25 02:56:14,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1891356.0, ans=0.1 2023-06-25 02:56:46,827 INFO [train.py:996] (1/4) Epoch 11, batch 10300, loss[loss=0.2572, simple_loss=0.3543, pruned_loss=0.08009, over 21625.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3058, pruned_loss=0.07468, over 4282678.73 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:56:56,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1891476.0, ans=0.125 2023-06-25 02:57:27,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1891596.0, ans=0.125 2023-06-25 02:57:29,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1891596.0, ans=0.125 2023-06-25 02:57:54,888 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-25 02:58:38,074 INFO [train.py:996] (1/4) Epoch 11, batch 10350, loss[loss=0.1633, simple_loss=0.2181, pruned_loss=0.05423, over 21887.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3052, pruned_loss=0.07435, over 4276987.29 frames. ], batch size: 107, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 02:58:38,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1891776.0, ans=0.125 2023-06-25 02:59:25,193 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.545e+02 9.168e+02 1.323e+03 1.995e+03 3.228e+03, threshold=2.646e+03, percent-clipped=12.0 2023-06-25 02:59:27,088 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-25 03:00:32,830 INFO [train.py:996] (1/4) Epoch 11, batch 10400, loss[loss=0.2048, simple_loss=0.2556, pruned_loss=0.07701, over 21188.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.298, pruned_loss=0.07304, over 4272764.49 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:01:42,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1892256.0, ans=0.125 2023-06-25 03:02:13,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1892316.0, ans=0.2 2023-06-25 03:02:23,359 INFO [train.py:996] (1/4) Epoch 11, batch 10450, loss[loss=0.238, simple_loss=0.329, pruned_loss=0.07353, over 21840.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3043, pruned_loss=0.07553, over 4269762.22 frames. ], batch size: 371, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:02:50,315 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-25 03:03:04,289 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.499e+02 9.525e+02 1.455e+03 2.411e+03 5.571e+03, threshold=2.910e+03, percent-clipped=19.0 2023-06-25 03:03:18,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1892496.0, ans=0.125 2023-06-25 03:04:05,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1892616.0, ans=0.04949747468305833 2023-06-25 03:04:11,709 INFO [train.py:996] (1/4) Epoch 11, batch 10500, loss[loss=0.2408, simple_loss=0.2964, pruned_loss=0.09259, over 21601.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.304, pruned_loss=0.07447, over 4259767.64 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:04:22,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1892676.0, ans=0.1 2023-06-25 03:04:35,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1892736.0, ans=0.2 2023-06-25 03:04:50,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-25 03:05:23,951 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2023-06-25 03:05:28,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1892856.0, ans=0.125 2023-06-25 03:05:30,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1892856.0, ans=0.1 2023-06-25 03:05:33,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1892916.0, ans=0.0 2023-06-25 03:05:42,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1892916.0, ans=0.125 2023-06-25 03:05:49,882 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-25 03:05:57,630 INFO [train.py:996] (1/4) Epoch 11, batch 10550, loss[loss=0.1811, simple_loss=0.2501, pruned_loss=0.05605, over 21536.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2985, pruned_loss=0.07381, over 4241456.82 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:06:33,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1893036.0, ans=0.125 2023-06-25 03:06:39,266 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 7.309e+02 9.989e+02 1.510e+03 3.276e+03, threshold=1.998e+03, percent-clipped=4.0 2023-06-25 03:07:35,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1893216.0, ans=0.1 2023-06-25 03:07:50,990 INFO [train.py:996] (1/4) Epoch 11, batch 10600, loss[loss=0.1799, simple_loss=0.2782, pruned_loss=0.04082, over 21805.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.295, pruned_loss=0.07316, over 4250875.22 frames. ], batch size: 282, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:07:51,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1893276.0, ans=0.0 2023-06-25 03:07:56,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1893276.0, ans=0.125 2023-06-25 03:08:00,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1893276.0, ans=0.125 2023-06-25 03:08:12,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1893336.0, ans=0.125 2023-06-25 03:09:09,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1893456.0, ans=0.0 2023-06-25 03:09:39,371 INFO [train.py:996] (1/4) Epoch 11, batch 10650, loss[loss=0.2505, simple_loss=0.3282, pruned_loss=0.08643, over 20816.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2976, pruned_loss=0.07143, over 4252335.87 frames. ], batch size: 611, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:09:42,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1893576.0, ans=0.1 2023-06-25 03:10:08,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1893636.0, ans=0.125 2023-06-25 03:10:23,994 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.435e+02 7.785e+02 1.184e+03 1.890e+03 4.480e+03, threshold=2.368e+03, percent-clipped=23.0 2023-06-25 03:10:28,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1893696.0, ans=0.2 2023-06-25 03:11:20,286 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-25 03:11:22,677 INFO [train.py:996] (1/4) Epoch 11, batch 10700, loss[loss=0.2643, simple_loss=0.3302, pruned_loss=0.09925, over 21598.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2988, pruned_loss=0.07217, over 4255359.84 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:11:29,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1893876.0, ans=0.0 2023-06-25 03:12:14,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1893996.0, ans=0.125 2023-06-25 03:12:15,308 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.64 vs. limit=22.5 2023-06-25 03:13:09,985 INFO [train.py:996] (1/4) Epoch 11, batch 10750, loss[loss=0.2319, simple_loss=0.3263, pruned_loss=0.06877, over 21712.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3101, pruned_loss=0.0779, over 4259319.38 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:13:41,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1894236.0, ans=0.125 2023-06-25 03:14:00,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1894236.0, ans=0.125 2023-06-25 03:14:01,641 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:14:03,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1894296.0, ans=0.125 2023-06-25 03:14:06,440 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.572e+02 8.726e+02 1.242e+03 1.937e+03 5.296e+03, threshold=2.484e+03, percent-clipped=18.0 2023-06-25 03:14:26,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1894356.0, ans=0.0 2023-06-25 03:14:49,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-25 03:14:58,703 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1894416.0, ans=0.1 2023-06-25 03:15:11,182 INFO [train.py:996] (1/4) Epoch 11, batch 10800, loss[loss=0.2467, simple_loss=0.3165, pruned_loss=0.08847, over 21372.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3165, pruned_loss=0.07917, over 4261210.68 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:15:17,893 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:16:04,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1894596.0, ans=0.1 2023-06-25 03:16:15,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1894656.0, ans=0.2 2023-06-25 03:16:18,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1894656.0, ans=0.015 2023-06-25 03:16:27,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=22.5 2023-06-25 03:16:35,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1894716.0, ans=0.2 2023-06-25 03:16:59,893 INFO [train.py:996] (1/4) Epoch 11, batch 10850, loss[loss=0.1831, simple_loss=0.277, pruned_loss=0.04466, over 20790.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.316, pruned_loss=0.0782, over 4256522.58 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:17:24,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1894836.0, ans=0.0 2023-06-25 03:17:43,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1894896.0, ans=0.125 2023-06-25 03:17:48,733 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.535e+02 7.569e+02 9.387e+02 1.863e+03 6.222e+03, threshold=1.877e+03, percent-clipped=9.0 2023-06-25 03:17:49,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-25 03:17:49,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1894896.0, ans=15.0 2023-06-25 03:17:54,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1894896.0, ans=0.2 2023-06-25 03:18:02,306 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-25 03:18:50,668 INFO [train.py:996] (1/4) Epoch 11, batch 10900, loss[loss=0.205, simple_loss=0.2837, pruned_loss=0.06314, over 21244.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3081, pruned_loss=0.07639, over 4253936.96 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:19:12,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1895076.0, ans=0.0 2023-06-25 03:19:21,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1895136.0, ans=0.2 2023-06-25 03:19:22,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1895136.0, ans=0.125 2023-06-25 03:19:36,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1895196.0, ans=0.125 2023-06-25 03:20:30,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1895316.0, ans=0.125 2023-06-25 03:20:37,868 INFO [train.py:996] (1/4) Epoch 11, batch 10950, loss[loss=0.2387, simple_loss=0.3009, pruned_loss=0.0883, over 21847.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3028, pruned_loss=0.07511, over 4250316.21 frames. ], batch size: 373, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:21:20,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1895436.0, ans=0.125 2023-06-25 03:21:23,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895496.0, ans=0.1 2023-06-25 03:21:26,497 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 7.087e+02 9.989e+02 1.560e+03 2.958e+03, threshold=1.998e+03, percent-clipped=15.0 2023-06-25 03:22:25,925 INFO [train.py:996] (1/4) Epoch 11, batch 11000, loss[loss=0.2237, simple_loss=0.2998, pruned_loss=0.07387, over 21706.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3015, pruned_loss=0.07491, over 4251676.28 frames. ], batch size: 282, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:22:28,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1895676.0, ans=0.125 2023-06-25 03:23:33,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1895856.0, ans=0.125 2023-06-25 03:24:12,469 INFO [train.py:996] (1/4) Epoch 11, batch 11050, loss[loss=0.2119, simple_loss=0.2689, pruned_loss=0.07743, over 21380.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2995, pruned_loss=0.07562, over 4253661.09 frames. ], batch size: 131, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:24:15,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1895976.0, ans=0.0 2023-06-25 03:24:22,230 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.42 vs. limit=15.0 2023-06-25 03:24:45,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1896036.0, ans=0.125 2023-06-25 03:24:57,016 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 7.118e+02 9.875e+02 1.339e+03 2.675e+03, threshold=1.975e+03, percent-clipped=6.0 2023-06-25 03:25:10,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.47 vs. limit=12.0 2023-06-25 03:25:13,107 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-25 03:25:14,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1896156.0, ans=0.125 2023-06-25 03:25:54,946 INFO [train.py:996] (1/4) Epoch 11, batch 11100, loss[loss=0.2447, simple_loss=0.328, pruned_loss=0.08071, over 21574.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2973, pruned_loss=0.07589, over 4252761.30 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:26:07,619 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-25 03:27:14,756 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=15.0 2023-06-25 03:27:33,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1896516.0, ans=0.2 2023-06-25 03:27:41,714 INFO [train.py:996] (1/4) Epoch 11, batch 11150, loss[loss=0.2905, simple_loss=0.3666, pruned_loss=0.1072, over 21390.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.296, pruned_loss=0.07585, over 4249852.22 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:27:47,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1896576.0, ans=0.125 2023-06-25 03:28:04,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1896636.0, ans=0.125 2023-06-25 03:28:12,039 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-25 03:28:30,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896696.0, ans=0.1 2023-06-25 03:28:31,540 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.186e+02 9.135e+02 1.372e+03 3.865e+03, threshold=1.827e+03, percent-clipped=12.0 2023-06-25 03:28:49,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1896756.0, ans=0.0 2023-06-25 03:28:59,477 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.48 vs. limit=10.0 2023-06-25 03:29:31,057 INFO [train.py:996] (1/4) Epoch 11, batch 11200, loss[loss=0.2699, simple_loss=0.3282, pruned_loss=0.1058, over 21575.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2955, pruned_loss=0.07604, over 4249721.59 frames. ], batch size: 391, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:30:33,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1897056.0, ans=0.125 2023-06-25 03:31:19,620 INFO [train.py:996] (1/4) Epoch 11, batch 11250, loss[loss=0.1904, simple_loss=0.2827, pruned_loss=0.0491, over 20828.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2949, pruned_loss=0.07587, over 4251540.02 frames. ], batch size: 609, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:31:44,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1897236.0, ans=0.2 2023-06-25 03:31:44,863 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1897236.0, ans=0.125 2023-06-25 03:32:07,779 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 7.485e+02 1.049e+03 1.491e+03 3.670e+03, threshold=2.098e+03, percent-clipped=11.0 2023-06-25 03:32:34,611 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1897356.0, ans=0.125 2023-06-25 03:32:50,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1897416.0, ans=0.5 2023-06-25 03:33:00,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1897416.0, ans=0.125 2023-06-25 03:33:07,323 INFO [train.py:996] (1/4) Epoch 11, batch 11300, loss[loss=0.2169, simple_loss=0.3092, pruned_loss=0.06232, over 19925.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2956, pruned_loss=0.07621, over 4254864.80 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:33:10,027 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-25 03:34:24,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1897656.0, ans=0.0 2023-06-25 03:34:54,382 INFO [train.py:996] (1/4) Epoch 11, batch 11350, loss[loss=0.29, simple_loss=0.3602, pruned_loss=0.1099, over 21549.00 frames. ], tot_loss[loss=0.225, simple_loss=0.298, pruned_loss=0.07596, over 4257735.21 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:35:13,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.11 vs. limit=10.0 2023-06-25 03:35:34,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1897836.0, ans=0.125 2023-06-25 03:35:47,367 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.585e+02 7.865e+02 1.156e+03 1.769e+03 3.739e+03, threshold=2.312e+03, percent-clipped=14.0 2023-06-25 03:36:32,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1898016.0, ans=0.09899494936611666 2023-06-25 03:36:51,928 INFO [train.py:996] (1/4) Epoch 11, batch 11400, loss[loss=0.2302, simple_loss=0.3092, pruned_loss=0.0756, over 19875.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3049, pruned_loss=0.07858, over 4262623.24 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:37:22,999 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898136.0, ans=0.1 2023-06-25 03:38:16,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898316.0, ans=0.1 2023-06-25 03:38:17,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1898316.0, ans=0.125 2023-06-25 03:38:19,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1898316.0, ans=0.0 2023-06-25 03:38:39,469 INFO [train.py:996] (1/4) Epoch 11, batch 11450, loss[loss=0.285, simple_loss=0.3535, pruned_loss=0.1082, over 21393.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3055, pruned_loss=0.0775, over 4260626.49 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:38:40,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1898376.0, ans=0.95 2023-06-25 03:39:02,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1898436.0, ans=0.2 2023-06-25 03:39:05,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1898436.0, ans=0.125 2023-06-25 03:39:33,545 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.188e+02 7.976e+02 1.094e+03 1.671e+03 3.367e+03, threshold=2.188e+03, percent-clipped=9.0 2023-06-25 03:39:58,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1898556.0, ans=0.125 2023-06-25 03:40:24,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1898616.0, ans=0.05 2023-06-25 03:40:28,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1898676.0, ans=0.2 2023-06-25 03:40:29,729 INFO [train.py:996] (1/4) Epoch 11, batch 11500, loss[loss=0.2202, simple_loss=0.3151, pruned_loss=0.06263, over 21770.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3089, pruned_loss=0.07909, over 4267618.71 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:40:40,906 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1898676.0, ans=0.125 2023-06-25 03:40:45,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898676.0, ans=0.1 2023-06-25 03:41:24,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1898796.0, ans=0.0 2023-06-25 03:41:59,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898856.0, ans=0.1 2023-06-25 03:42:25,782 INFO [train.py:996] (1/4) Epoch 11, batch 11550, loss[loss=0.3792, simple_loss=0.465, pruned_loss=0.1467, over 21507.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3136, pruned_loss=0.07893, over 4270402.38 frames. ], batch size: 508, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:42:31,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1898976.0, ans=0.0 2023-06-25 03:43:01,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1899036.0, ans=0.125 2023-06-25 03:43:07,421 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-25 03:43:08,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1899036.0, ans=0.125 2023-06-25 03:43:16,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1899096.0, ans=0.125 2023-06-25 03:43:21,056 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.334e+02 7.903e+02 1.066e+03 1.850e+03 4.952e+03, threshold=2.132e+03, percent-clipped=19.0 2023-06-25 03:43:29,842 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-25 03:44:00,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1899216.0, ans=0.2 2023-06-25 03:44:16,548 INFO [train.py:996] (1/4) Epoch 11, batch 11600, loss[loss=0.2679, simple_loss=0.3595, pruned_loss=0.08812, over 21447.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3273, pruned_loss=0.08041, over 4271292.88 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:44:25,842 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:44:25,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1899276.0, ans=0.125 2023-06-25 03:44:34,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1899276.0, ans=0.2 2023-06-25 03:44:34,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1899276.0, ans=0.125 2023-06-25 03:44:37,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1899336.0, ans=0.125 2023-06-25 03:44:47,933 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-25 03:45:21,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1899396.0, ans=0.125 2023-06-25 03:45:34,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1899456.0, ans=0.0 2023-06-25 03:45:48,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1899516.0, ans=0.125 2023-06-25 03:46:03,271 INFO [train.py:996] (1/4) Epoch 11, batch 11650, loss[loss=0.2432, simple_loss=0.3356, pruned_loss=0.07545, over 21629.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3327, pruned_loss=0.08109, over 4274804.95 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:46:13,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899576.0, ans=0.1 2023-06-25 03:46:29,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-25 03:46:30,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1899636.0, ans=0.1 2023-06-25 03:46:35,877 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-25 03:47:01,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.300e+02 9.276e+02 1.301e+03 2.293e+03 3.963e+03, threshold=2.603e+03, percent-clipped=26.0 2023-06-25 03:47:18,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1899756.0, ans=0.0 2023-06-25 03:47:42,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1899816.0, ans=0.0 2023-06-25 03:47:44,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1899816.0, ans=0.04949747468305833 2023-06-25 03:47:55,811 INFO [train.py:996] (1/4) Epoch 11, batch 11700, loss[loss=0.1909, simple_loss=0.2736, pruned_loss=0.05406, over 15003.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3239, pruned_loss=0.07972, over 4274097.36 frames. ], batch size: 61, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:47:59,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1899876.0, ans=0.0 2023-06-25 03:48:10,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-25 03:48:22,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1899936.0, ans=0.05 2023-06-25 03:48:43,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1899996.0, ans=0.0 2023-06-25 03:49:04,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1900056.0, ans=0.0 2023-06-25 03:49:07,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-25 03:49:08,927 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-25 03:49:42,882 INFO [train.py:996] (1/4) Epoch 11, batch 11750, loss[loss=0.2012, simple_loss=0.2879, pruned_loss=0.05727, over 19876.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3146, pruned_loss=0.07888, over 4277394.16 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:50:36,024 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 8.041e+02 1.029e+03 1.302e+03 3.025e+03, threshold=2.058e+03, percent-clipped=2.0 2023-06-25 03:50:47,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1900296.0, ans=0.125 2023-06-25 03:51:31,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.83 vs. limit=15.0 2023-06-25 03:51:31,789 INFO [train.py:996] (1/4) Epoch 11, batch 11800, loss[loss=0.2384, simple_loss=0.3417, pruned_loss=0.06758, over 21659.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3167, pruned_loss=0.08137, over 4276492.17 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:51:54,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1900536.0, ans=0.2 2023-06-25 03:52:05,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1900536.0, ans=0.0 2023-06-25 03:52:23,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1900596.0, ans=0.125 2023-06-25 03:52:36,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1900596.0, ans=0.125 2023-06-25 03:52:44,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1900656.0, ans=0.125 2023-06-25 03:53:19,746 INFO [train.py:996] (1/4) Epoch 11, batch 11850, loss[loss=0.2475, simple_loss=0.3247, pruned_loss=0.08513, over 21629.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3172, pruned_loss=0.08054, over 4277343.15 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:53:20,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1900776.0, ans=0.0 2023-06-25 03:53:25,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1900776.0, ans=0.1 2023-06-25 03:53:55,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1900836.0, ans=0.2 2023-06-25 03:54:11,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1900896.0, ans=0.125 2023-06-25 03:54:15,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=22.5 2023-06-25 03:54:17,578 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.516e+02 7.061e+02 9.969e+02 1.583e+03 3.889e+03, threshold=1.994e+03, percent-clipped=10.0 2023-06-25 03:55:15,612 INFO [train.py:996] (1/4) Epoch 11, batch 11900, loss[loss=0.2348, simple_loss=0.3545, pruned_loss=0.05752, over 20844.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3182, pruned_loss=0.07827, over 4271640.13 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:55:52,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1901136.0, ans=0.0 2023-06-25 03:57:11,036 INFO [train.py:996] (1/4) Epoch 11, batch 11950, loss[loss=0.177, simple_loss=0.2707, pruned_loss=0.04169, over 21762.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3214, pruned_loss=0.0755, over 4266205.17 frames. ], batch size: 316, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:57:56,079 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.868e+02 8.366e+02 1.305e+03 1.891e+03 4.761e+03, threshold=2.610e+03, percent-clipped=24.0 2023-06-25 03:58:29,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1901616.0, ans=0.125 2023-06-25 03:58:50,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1901616.0, ans=0.0 2023-06-25 03:58:53,130 INFO [train.py:996] (1/4) Epoch 11, batch 12000, loss[loss=0.2153, simple_loss=0.2785, pruned_loss=0.07605, over 21450.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3141, pruned_loss=0.07373, over 4261077.28 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:58:53,130 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 03:59:11,382 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2587, simple_loss=0.3514, pruned_loss=0.08303, over 1796401.00 frames. 2023-06-25 03:59:11,383 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 03:59:31,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1901736.0, ans=0.1 2023-06-25 04:00:07,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1901796.0, ans=0.125 2023-06-25 04:00:15,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1901856.0, ans=0.125 2023-06-25 04:00:18,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1901856.0, ans=0.1 2023-06-25 04:00:25,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-25 04:00:50,849 INFO [train.py:996] (1/4) Epoch 11, batch 12050, loss[loss=0.2196, simple_loss=0.2822, pruned_loss=0.0785, over 21512.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.312, pruned_loss=0.07583, over 4267557.28 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:01:32,331 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1902036.0, ans=0.0 2023-06-25 04:01:36,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1902096.0, ans=0.2 2023-06-25 04:01:41,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1902096.0, ans=0.125 2023-06-25 04:01:44,075 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.395e+02 7.721e+02 1.099e+03 1.708e+03 2.830e+03, threshold=2.199e+03, percent-clipped=2.0 2023-06-25 04:01:51,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1902156.0, ans=0.125 2023-06-25 04:01:53,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1902156.0, ans=0.125 2023-06-25 04:02:25,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1902216.0, ans=0.125 2023-06-25 04:02:41,832 INFO [train.py:996] (1/4) Epoch 11, batch 12100, loss[loss=0.1842, simple_loss=0.2376, pruned_loss=0.06535, over 20913.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3152, pruned_loss=0.07992, over 4271941.41 frames. ], batch size: 613, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:02:49,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1902276.0, ans=0.125 2023-06-25 04:03:00,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1902336.0, ans=0.125 2023-06-25 04:03:27,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1902396.0, ans=0.0 2023-06-25 04:03:31,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-06-25 04:04:25,634 INFO [train.py:996] (1/4) Epoch 11, batch 12150, loss[loss=0.2457, simple_loss=0.3493, pruned_loss=0.07102, over 21666.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3165, pruned_loss=0.07834, over 4273657.14 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:04:28,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1902576.0, ans=0.0 2023-06-25 04:04:43,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1902576.0, ans=0.125 2023-06-25 04:04:50,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1902636.0, ans=0.0 2023-06-25 04:04:53,976 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1902636.0, ans=0.05 2023-06-25 04:04:57,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1902636.0, ans=0.125 2023-06-25 04:05:19,758 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-25 04:05:25,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1902696.0, ans=0.0 2023-06-25 04:05:28,517 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.672e+02 1.025e+03 1.712e+03 2.364e+03 4.484e+03, threshold=3.424e+03, percent-clipped=30.0 2023-06-25 04:06:12,885 INFO [train.py:996] (1/4) Epoch 11, batch 12200, loss[loss=0.2523, simple_loss=0.3071, pruned_loss=0.09875, over 21245.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3126, pruned_loss=0.07779, over 4273269.58 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:06:37,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1902936.0, ans=0.125 2023-06-25 04:06:38,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1902936.0, ans=0.125 2023-06-25 04:07:55,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1903116.0, ans=0.07 2023-06-25 04:07:58,656 INFO [train.py:996] (1/4) Epoch 11, batch 12250, loss[loss=0.178, simple_loss=0.262, pruned_loss=0.04698, over 21691.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3049, pruned_loss=0.07465, over 4271611.06 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:08:41,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=12.0 2023-06-25 04:08:59,371 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.457e+02 7.371e+02 1.190e+03 1.577e+03 4.141e+03, threshold=2.380e+03, percent-clipped=2.0 2023-06-25 04:09:16,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1903356.0, ans=0.125 2023-06-25 04:09:44,663 INFO [train.py:996] (1/4) Epoch 11, batch 12300, loss[loss=0.2109, simple_loss=0.3297, pruned_loss=0.04604, over 20744.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2977, pruned_loss=0.06918, over 4269810.95 frames. ], batch size: 607, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:09:47,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1903476.0, ans=0.125 2023-06-25 04:10:58,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-25 04:11:30,630 INFO [train.py:996] (1/4) Epoch 11, batch 12350, loss[loss=0.224, simple_loss=0.2864, pruned_loss=0.08084, over 20994.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3023, pruned_loss=0.0699, over 4264915.99 frames. ], batch size: 608, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:11:50,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1903776.0, ans=0.125 2023-06-25 04:11:50,776 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-25 04:11:56,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1903836.0, ans=0.125 2023-06-25 04:12:09,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1903836.0, ans=0.125 2023-06-25 04:12:25,375 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1903896.0, ans=0.1 2023-06-25 04:12:26,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1903896.0, ans=0.1 2023-06-25 04:12:28,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1903896.0, ans=0.0 2023-06-25 04:12:28,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1903896.0, ans=0.07 2023-06-25 04:12:31,426 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.609e+02 8.670e+02 1.217e+03 1.964e+03 4.834e+03, threshold=2.433e+03, percent-clipped=16.0 2023-06-25 04:12:31,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1903896.0, ans=0.125 2023-06-25 04:13:16,319 INFO [train.py:996] (1/4) Epoch 11, batch 12400, loss[loss=0.2429, simple_loss=0.3004, pruned_loss=0.09273, over 21295.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3043, pruned_loss=0.07343, over 4275965.84 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:13:44,474 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:13:49,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1904136.0, ans=0.125 2023-06-25 04:13:54,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1904136.0, ans=0.04949747468305833 2023-06-25 04:13:58,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1904136.0, ans=0.125 2023-06-25 04:14:02,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.93 vs. limit=10.0 2023-06-25 04:14:10,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1904196.0, ans=0.125 2023-06-25 04:14:17,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1904196.0, ans=0.2 2023-06-25 04:14:55,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1904316.0, ans=0.125 2023-06-25 04:15:03,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1904316.0, ans=0.0 2023-06-25 04:15:07,773 INFO [train.py:996] (1/4) Epoch 11, batch 12450, loss[loss=0.2803, simple_loss=0.3527, pruned_loss=0.1039, over 21804.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3083, pruned_loss=0.07653, over 4280051.07 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:15:08,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1904376.0, ans=0.125 2023-06-25 04:16:10,965 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 6.913e+02 8.483e+02 1.165e+03 2.704e+03, threshold=1.697e+03, percent-clipped=3.0 2023-06-25 04:17:02,376 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1904676.0, ans=0.125 2023-06-25 04:17:03,388 INFO [train.py:996] (1/4) Epoch 11, batch 12500, loss[loss=0.2115, simple_loss=0.3372, pruned_loss=0.04292, over 20752.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.319, pruned_loss=0.08022, over 4279918.00 frames. ], batch size: 607, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:17:44,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1904736.0, ans=0.125 2023-06-25 04:17:50,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1904796.0, ans=0.1 2023-06-25 04:18:06,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-25 04:18:19,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1904856.0, ans=0.125 2023-06-25 04:18:35,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1904856.0, ans=0.2 2023-06-25 04:18:42,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1904916.0, ans=0.1 2023-06-25 04:19:02,310 INFO [train.py:996] (1/4) Epoch 11, batch 12550, loss[loss=0.2539, simple_loss=0.3257, pruned_loss=0.09108, over 21278.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.323, pruned_loss=0.08208, over 4275893.56 frames. ], batch size: 143, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:19:03,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1904976.0, ans=0.125 2023-06-25 04:19:41,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1905096.0, ans=0.95 2023-06-25 04:20:07,424 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.343e+02 7.503e+02 1.080e+03 1.641e+03 3.839e+03, threshold=2.159e+03, percent-clipped=20.0 2023-06-25 04:20:52,627 INFO [train.py:996] (1/4) Epoch 11, batch 12600, loss[loss=0.1817, simple_loss=0.2491, pruned_loss=0.05717, over 21837.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3198, pruned_loss=0.07967, over 4266820.02 frames. ], batch size: 98, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:21:19,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1905336.0, ans=0.0 2023-06-25 04:21:26,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-25 04:21:51,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1905456.0, ans=0.07 2023-06-25 04:22:01,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1905456.0, ans=0.125 2023-06-25 04:22:14,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1905516.0, ans=0.1 2023-06-25 04:22:33,091 INFO [train.py:996] (1/4) Epoch 11, batch 12650, loss[loss=0.2264, simple_loss=0.2983, pruned_loss=0.07725, over 21401.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.312, pruned_loss=0.07635, over 4268039.67 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:22:47,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1905576.0, ans=0.125 2023-06-25 04:22:56,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1905636.0, ans=0.0 2023-06-25 04:23:00,491 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:23:37,022 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.529e+02 6.454e+02 1.042e+03 1.689e+03 3.142e+03, threshold=2.085e+03, percent-clipped=12.0 2023-06-25 04:23:51,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1905756.0, ans=0.125 2023-06-25 04:23:52,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1905756.0, ans=0.125 2023-06-25 04:23:58,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1905816.0, ans=0.125 2023-06-25 04:24:28,097 INFO [train.py:996] (1/4) Epoch 11, batch 12700, loss[loss=0.2511, simple_loss=0.3169, pruned_loss=0.09271, over 21363.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3118, pruned_loss=0.07883, over 4271518.12 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:24:53,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1905936.0, ans=0.2 2023-06-25 04:25:16,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1905996.0, ans=0.0 2023-06-25 04:26:13,877 INFO [train.py:996] (1/4) Epoch 11, batch 12750, loss[loss=0.2814, simple_loss=0.357, pruned_loss=0.1029, over 21570.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3138, pruned_loss=0.07936, over 4267585.28 frames. ], batch size: 509, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:26:38,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1906236.0, ans=0.2 2023-06-25 04:26:51,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1906296.0, ans=0.0 2023-06-25 04:27:09,559 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.849e+02 1.051e+03 1.343e+03 1.949e+03 4.528e+03, threshold=2.685e+03, percent-clipped=20.0 2023-06-25 04:27:11,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1906296.0, ans=0.125 2023-06-25 04:27:26,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1906356.0, ans=0.2 2023-06-25 04:28:00,583 INFO [train.py:996] (1/4) Epoch 11, batch 12800, loss[loss=0.2635, simple_loss=0.3369, pruned_loss=0.09499, over 21746.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3143, pruned_loss=0.07996, over 4278346.88 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:28:14,424 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:29:42,133 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:29:50,504 INFO [train.py:996] (1/4) Epoch 11, batch 12850, loss[loss=0.2524, simple_loss=0.3355, pruned_loss=0.08468, over 21341.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.317, pruned_loss=0.08123, over 4275141.63 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:30:06,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1906776.0, ans=0.0 2023-06-25 04:30:31,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1906836.0, ans=0.125 2023-06-25 04:30:53,514 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.548e+02 7.415e+02 1.066e+03 1.369e+03 3.330e+03, threshold=2.132e+03, percent-clipped=6.0 2023-06-25 04:30:57,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1906956.0, ans=0.125 2023-06-25 04:31:15,206 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-25 04:31:21,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1907016.0, ans=0.125 2023-06-25 04:31:26,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1907016.0, ans=6.0 2023-06-25 04:31:43,327 INFO [train.py:996] (1/4) Epoch 11, batch 12900, loss[loss=0.2502, simple_loss=0.3309, pruned_loss=0.08475, over 21874.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3127, pruned_loss=0.07662, over 4271933.27 frames. ], batch size: 373, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:33:26,792 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-25 04:33:27,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1907316.0, ans=0.125 2023-06-25 04:33:29,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1907316.0, ans=0.0 2023-06-25 04:33:33,472 INFO [train.py:996] (1/4) Epoch 11, batch 12950, loss[loss=0.1674, simple_loss=0.2461, pruned_loss=0.04436, over 21217.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.312, pruned_loss=0.07615, over 4276877.36 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:33:39,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1907376.0, ans=0.04949747468305833 2023-06-25 04:34:29,119 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1907496.0, ans=0.2 2023-06-25 04:34:31,956 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 6.927e+02 1.132e+03 1.522e+03 3.743e+03, threshold=2.263e+03, percent-clipped=8.0 2023-06-25 04:35:21,594 INFO [train.py:996] (1/4) Epoch 11, batch 13000, loss[loss=0.1985, simple_loss=0.274, pruned_loss=0.06148, over 21126.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3105, pruned_loss=0.07573, over 4272962.93 frames. ], batch size: 608, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:35:25,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1907676.0, ans=0.0 2023-06-25 04:36:10,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-25 04:36:42,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1907916.0, ans=0.2 2023-06-25 04:36:51,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1907916.0, ans=0.125 2023-06-25 04:37:06,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1907976.0, ans=0.2 2023-06-25 04:37:07,532 INFO [train.py:996] (1/4) Epoch 11, batch 13050, loss[loss=0.2703, simple_loss=0.3321, pruned_loss=0.1042, over 21779.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3053, pruned_loss=0.07386, over 4274012.65 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:37:31,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-25 04:37:56,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.30 vs. limit=10.0 2023-06-25 04:38:05,027 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.554e+02 7.302e+02 9.567e+02 1.329e+03 2.389e+03, threshold=1.913e+03, percent-clipped=1.0 2023-06-25 04:38:24,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1908156.0, ans=0.015 2023-06-25 04:38:42,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1908216.0, ans=0.0 2023-06-25 04:38:53,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1908216.0, ans=0.1 2023-06-25 04:38:55,907 INFO [train.py:996] (1/4) Epoch 11, batch 13100, loss[loss=0.2353, simple_loss=0.3204, pruned_loss=0.07512, over 21483.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3083, pruned_loss=0.07428, over 4275278.21 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:39:52,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1908396.0, ans=0.125 2023-06-25 04:40:40,283 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-25 04:40:41,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1908516.0, ans=0.0 2023-06-25 04:40:45,504 INFO [train.py:996] (1/4) Epoch 11, batch 13150, loss[loss=0.2543, simple_loss=0.3173, pruned_loss=0.09561, over 21620.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.312, pruned_loss=0.07636, over 4273423.66 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:41:31,535 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1908696.0, ans=0.0 2023-06-25 04:41:55,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.487e+02 7.569e+02 1.234e+03 1.722e+03 3.917e+03, threshold=2.467e+03, percent-clipped=21.0 2023-06-25 04:42:46,131 INFO [train.py:996] (1/4) Epoch 11, batch 13200, loss[loss=0.2585, simple_loss=0.3265, pruned_loss=0.09532, over 21674.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3121, pruned_loss=0.07716, over 4278052.34 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 04:43:11,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1908936.0, ans=0.2 2023-06-25 04:43:37,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-25 04:43:45,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1908996.0, ans=0.2 2023-06-25 04:43:45,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1908996.0, ans=0.1 2023-06-25 04:43:50,138 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.64 vs. limit=22.5 2023-06-25 04:43:59,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1909056.0, ans=0.125 2023-06-25 04:44:01,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1909056.0, ans=0.0 2023-06-25 04:44:01,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=12.0 2023-06-25 04:44:29,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1909116.0, ans=0.0 2023-06-25 04:44:30,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-25 04:44:34,077 INFO [train.py:996] (1/4) Epoch 11, batch 13250, loss[loss=0.2267, simple_loss=0.3121, pruned_loss=0.07058, over 21844.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3119, pruned_loss=0.0785, over 4280493.08 frames. ], batch size: 316, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:44:41,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1909176.0, ans=0.125 2023-06-25 04:45:11,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-25 04:45:23,362 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1909296.0, ans=0.0 2023-06-25 04:45:32,789 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 1.027e+03 1.488e+03 2.200e+03 4.599e+03, threshold=2.975e+03, percent-clipped=16.0 2023-06-25 04:45:47,603 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-25 04:46:08,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1909416.0, ans=0.0 2023-06-25 04:46:11,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1909416.0, ans=0.125 2023-06-25 04:46:21,118 INFO [train.py:996] (1/4) Epoch 11, batch 13300, loss[loss=0.2548, simple_loss=0.3347, pruned_loss=0.08748, over 21764.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3137, pruned_loss=0.07831, over 4285229.92 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:46:57,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1909536.0, ans=0.0 2023-06-25 04:47:02,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1909536.0, ans=0.125 2023-06-25 04:47:25,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1909656.0, ans=0.125 2023-06-25 04:47:28,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1909656.0, ans=0.125 2023-06-25 04:48:09,248 INFO [train.py:996] (1/4) Epoch 11, batch 13350, loss[loss=0.2813, simple_loss=0.3575, pruned_loss=0.1026, over 21436.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3185, pruned_loss=0.08158, over 4282668.23 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:49:08,201 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.321e+02 8.273e+02 1.155e+03 1.760e+03 3.459e+03, threshold=2.310e+03, percent-clipped=3.0 2023-06-25 04:49:09,441 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.58 vs. limit=5.0 2023-06-25 04:49:52,112 INFO [train.py:996] (1/4) Epoch 11, batch 13400, loss[loss=0.2359, simple_loss=0.3104, pruned_loss=0.08069, over 21831.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3193, pruned_loss=0.08349, over 4270243.72 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:50:06,808 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-25 04:50:09,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1910076.0, ans=0.2 2023-06-25 04:50:35,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1910196.0, ans=0.0 2023-06-25 04:51:06,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1910256.0, ans=0.0 2023-06-25 04:51:39,486 INFO [train.py:996] (1/4) Epoch 11, batch 13450, loss[loss=0.215, simple_loss=0.2773, pruned_loss=0.0763, over 16899.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3223, pruned_loss=0.08568, over 4269137.17 frames. ], batch size: 60, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:51:55,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1910436.0, ans=0.2 2023-06-25 04:52:27,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1910496.0, ans=0.125 2023-06-25 04:52:32,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1910496.0, ans=0.0 2023-06-25 04:52:36,803 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.628e+02 8.174e+02 1.187e+03 1.780e+03 3.541e+03, threshold=2.373e+03, percent-clipped=13.0 2023-06-25 04:52:46,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-25 04:53:11,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1910616.0, ans=0.5 2023-06-25 04:53:14,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1910616.0, ans=0.125 2023-06-25 04:53:26,294 INFO [train.py:996] (1/4) Epoch 11, batch 13500, loss[loss=0.1829, simple_loss=0.2408, pruned_loss=0.06254, over 21345.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.315, pruned_loss=0.08267, over 4269074.47 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:53:53,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1910736.0, ans=0.1 2023-06-25 04:54:32,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1910856.0, ans=0.0 2023-06-25 04:54:34,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1910856.0, ans=0.125 2023-06-25 04:54:49,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1910856.0, ans=0.0 2023-06-25 04:55:07,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1910916.0, ans=0.125 2023-06-25 04:55:13,750 INFO [train.py:996] (1/4) Epoch 11, batch 13550, loss[loss=0.2837, simple_loss=0.3832, pruned_loss=0.0921, over 21688.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3182, pruned_loss=0.08198, over 4275359.77 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:55:26,095 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:55:58,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1911096.0, ans=0.0 2023-06-25 04:56:11,390 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.050e+02 7.777e+02 1.227e+03 1.710e+03 3.921e+03, threshold=2.454e+03, percent-clipped=8.0 2023-06-25 04:56:47,067 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.12 vs. limit=15.0 2023-06-25 04:57:01,030 INFO [train.py:996] (1/4) Epoch 11, batch 13600, loss[loss=0.2268, simple_loss=0.2946, pruned_loss=0.07952, over 21768.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3163, pruned_loss=0.08198, over 4273302.03 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 04:57:22,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1911276.0, ans=0.2 2023-06-25 04:57:22,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1911276.0, ans=0.0 2023-06-25 04:58:14,178 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-25 04:58:15,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1911456.0, ans=0.07 2023-06-25 04:58:42,962 INFO [train.py:996] (1/4) Epoch 11, batch 13650, loss[loss=0.2018, simple_loss=0.2631, pruned_loss=0.0702, over 21344.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3106, pruned_loss=0.07888, over 4272441.92 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:59:10,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1911636.0, ans=0.125 2023-06-25 04:59:19,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1911636.0, ans=0.0 2023-06-25 04:59:26,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1911696.0, ans=0.125 2023-06-25 04:59:48,576 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.419e+02 6.510e+02 1.024e+03 1.563e+03 2.533e+03, threshold=2.048e+03, percent-clipped=2.0 2023-06-25 04:59:58,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1911756.0, ans=0.125 2023-06-25 04:59:58,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1911756.0, ans=0.5 2023-06-25 05:00:11,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1911756.0, ans=0.2 2023-06-25 05:00:35,837 INFO [train.py:996] (1/4) Epoch 11, batch 13700, loss[loss=0.2104, simple_loss=0.2713, pruned_loss=0.07473, over 21232.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3073, pruned_loss=0.07803, over 4263473.49 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:01:11,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1911936.0, ans=0.1 2023-06-25 05:02:30,051 INFO [train.py:996] (1/4) Epoch 11, batch 13750, loss[loss=0.1921, simple_loss=0.2511, pruned_loss=0.06659, over 21228.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3046, pruned_loss=0.07658, over 4260480.80 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:02:38,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-06-25 05:03:33,636 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.440e+02 9.228e+02 1.294e+03 2.214e+03 4.699e+03, threshold=2.588e+03, percent-clipped=28.0 2023-06-25 05:03:39,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1912356.0, ans=0.125 2023-06-25 05:04:20,983 INFO [train.py:996] (1/4) Epoch 11, batch 13800, loss[loss=0.2577, simple_loss=0.3668, pruned_loss=0.07429, over 21691.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3091, pruned_loss=0.07594, over 4266363.86 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:04:27,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-25 05:04:29,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1912476.0, ans=0.125 2023-06-25 05:05:08,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1912596.0, ans=0.1 2023-06-25 05:05:44,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1912656.0, ans=0.0 2023-06-25 05:06:00,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1912716.0, ans=0.1 2023-06-25 05:06:07,149 INFO [train.py:996] (1/4) Epoch 11, batch 13850, loss[loss=0.2413, simple_loss=0.3212, pruned_loss=0.0807, over 21771.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3166, pruned_loss=0.07656, over 4259735.88 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:06:30,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1912776.0, ans=0.2 2023-06-25 05:06:34,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1912836.0, ans=0.2 2023-06-25 05:07:04,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1912896.0, ans=0.125 2023-06-25 05:07:13,530 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.731e+02 7.739e+02 1.067e+03 1.553e+03 4.213e+03, threshold=2.133e+03, percent-clipped=6.0 2023-06-25 05:07:22,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1912956.0, ans=0.125 2023-06-25 05:07:42,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1913016.0, ans=0.0 2023-06-25 05:07:51,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1913076.0, ans=0.07 2023-06-25 05:07:52,246 INFO [train.py:996] (1/4) Epoch 11, batch 13900, loss[loss=0.2543, simple_loss=0.3233, pruned_loss=0.09266, over 21906.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3197, pruned_loss=0.0798, over 4266955.57 frames. ], batch size: 316, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:08:20,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1913136.0, ans=0.0 2023-06-25 05:08:30,986 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-25 05:08:48,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1913196.0, ans=0.1 2023-06-25 05:09:45,238 INFO [train.py:996] (1/4) Epoch 11, batch 13950, loss[loss=0.2406, simple_loss=0.3045, pruned_loss=0.08841, over 19986.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3193, pruned_loss=0.08186, over 4272409.29 frames. ], batch size: 703, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:09:57,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1913376.0, ans=0.2 2023-06-25 05:10:13,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1913436.0, ans=0.07 2023-06-25 05:10:14,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1913436.0, ans=0.0 2023-06-25 05:10:17,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1913436.0, ans=0.125 2023-06-25 05:10:31,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1913496.0, ans=0.0 2023-06-25 05:10:34,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-25 05:10:50,572 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.156e+02 8.731e+02 1.156e+03 1.746e+03 2.860e+03, threshold=2.312e+03, percent-clipped=13.0 2023-06-25 05:10:54,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1913556.0, ans=0.0 2023-06-25 05:11:29,098 INFO [train.py:996] (1/4) Epoch 11, batch 14000, loss[loss=0.1978, simple_loss=0.2805, pruned_loss=0.05751, over 21584.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3149, pruned_loss=0.07973, over 4272743.20 frames. ], batch size: 230, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:11:34,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1913676.0, ans=0.125 2023-06-25 05:11:47,843 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-25 05:13:16,594 INFO [train.py:996] (1/4) Epoch 11, batch 14050, loss[loss=0.2172, simple_loss=0.2863, pruned_loss=0.07407, over 21673.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3111, pruned_loss=0.07638, over 4268607.93 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:13:36,896 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-25 05:13:39,975 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-25 05:13:49,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1914036.0, ans=0.5 2023-06-25 05:14:24,602 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.465e+02 7.745e+02 1.137e+03 1.921e+03 3.840e+03, threshold=2.273e+03, percent-clipped=15.0 2023-06-25 05:14:33,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1914156.0, ans=0.0 2023-06-25 05:15:04,522 INFO [train.py:996] (1/4) Epoch 11, batch 14100, loss[loss=0.2439, simple_loss=0.3105, pruned_loss=0.08863, over 19899.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3041, pruned_loss=0.07606, over 4262497.66 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:15:30,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1914336.0, ans=0.0 2023-06-25 05:16:36,476 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1914516.0, ans=0.125 2023-06-25 05:16:46,932 INFO [train.py:996] (1/4) Epoch 11, batch 14150, loss[loss=0.2274, simple_loss=0.3485, pruned_loss=0.05311, over 19837.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3082, pruned_loss=0.07735, over 4241814.36 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:17:41,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1914696.0, ans=0.125 2023-06-25 05:17:50,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1914756.0, ans=0.2 2023-06-25 05:17:51,156 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.783e+02 7.327e+02 9.497e+02 1.308e+03 3.394e+03, threshold=1.899e+03, percent-clipped=3.0 2023-06-25 05:18:20,345 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:18:29,912 INFO [train.py:996] (1/4) Epoch 11, batch 14200, loss[loss=0.2315, simple_loss=0.3055, pruned_loss=0.07877, over 21820.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.308, pruned_loss=0.0764, over 4243704.07 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:19:12,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1914996.0, ans=0.125 2023-06-25 05:19:21,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-25 05:20:09,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1915116.0, ans=0.2 2023-06-25 05:20:14,315 INFO [train.py:996] (1/4) Epoch 11, batch 14250, loss[loss=0.1929, simple_loss=0.2757, pruned_loss=0.05506, over 21663.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3019, pruned_loss=0.07559, over 4246944.44 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:20:27,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1915176.0, ans=0.125 2023-06-25 05:20:56,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1915236.0, ans=0.2 2023-06-25 05:21:24,834 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.492e+02 7.052e+02 9.633e+02 1.519e+03 2.693e+03, threshold=1.927e+03, percent-clipped=14.0 2023-06-25 05:22:03,011 INFO [train.py:996] (1/4) Epoch 11, batch 14300, loss[loss=0.335, simple_loss=0.4433, pruned_loss=0.1133, over 21256.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3043, pruned_loss=0.07472, over 4233710.37 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:23:25,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1915656.0, ans=0.1 2023-06-25 05:23:49,084 INFO [train.py:996] (1/4) Epoch 11, batch 14350, loss[loss=0.2216, simple_loss=0.3093, pruned_loss=0.0669, over 21861.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3104, pruned_loss=0.07566, over 4236997.88 frames. ], batch size: 371, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:24:01,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1915776.0, ans=0.1 2023-06-25 05:24:31,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1915836.0, ans=0.025 2023-06-25 05:24:47,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1915896.0, ans=0.07 2023-06-25 05:24:56,261 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.843e+02 8.086e+02 1.263e+03 2.324e+03 6.942e+03, threshold=2.526e+03, percent-clipped=29.0 2023-06-25 05:25:27,525 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-06-25 05:25:27,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1916016.0, ans=15.0 2023-06-25 05:25:29,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-25 05:25:34,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.94 vs. limit=10.0 2023-06-25 05:25:34,883 INFO [train.py:996] (1/4) Epoch 11, batch 14400, loss[loss=0.2456, simple_loss=0.3072, pruned_loss=0.092, over 21691.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3076, pruned_loss=0.07565, over 4241831.82 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 05:25:59,557 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-25 05:26:03,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1916136.0, ans=0.1 2023-06-25 05:26:03,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1916136.0, ans=0.125 2023-06-25 05:26:15,548 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=15.0 2023-06-25 05:26:47,063 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:26:58,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1916256.0, ans=0.125 2023-06-25 05:27:00,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1916256.0, ans=0.0 2023-06-25 05:27:28,167 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1916376.0, ans=0.1 2023-06-25 05:27:29,134 INFO [train.py:996] (1/4) Epoch 11, batch 14450, loss[loss=0.2311, simple_loss=0.2897, pruned_loss=0.08623, over 21306.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3051, pruned_loss=0.07699, over 4254053.99 frames. ], batch size: 177, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:27:47,012 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1916376.0, ans=0.125 2023-06-25 05:27:56,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1916436.0, ans=0.0 2023-06-25 05:28:30,490 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.134e+02 8.129e+02 1.231e+03 1.648e+03 3.274e+03, threshold=2.462e+03, percent-clipped=7.0 2023-06-25 05:28:32,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1916556.0, ans=0.125 2023-06-25 05:28:34,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1916556.0, ans=0.0 2023-06-25 05:28:57,690 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-25 05:29:07,870 INFO [train.py:996] (1/4) Epoch 11, batch 14500, loss[loss=0.2163, simple_loss=0.3179, pruned_loss=0.05731, over 19991.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3, pruned_loss=0.07602, over 4259377.57 frames. ], batch size: 703, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:29:22,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1916676.0, ans=0.2 2023-06-25 05:30:11,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1916796.0, ans=0.125 2023-06-25 05:30:12,807 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-25 05:30:39,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1916916.0, ans=0.1 2023-06-25 05:30:39,870 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-25 05:30:57,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1916916.0, ans=0.125 2023-06-25 05:31:01,913 INFO [train.py:996] (1/4) Epoch 11, batch 14550, loss[loss=0.2608, simple_loss=0.3363, pruned_loss=0.09261, over 21691.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3059, pruned_loss=0.07825, over 4260092.86 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:31:13,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1916976.0, ans=0.125 2023-06-25 05:31:25,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1917036.0, ans=0.2 2023-06-25 05:32:03,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1917096.0, ans=0.0 2023-06-25 05:32:03,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1917096.0, ans=0.125 2023-06-25 05:32:13,607 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.818e+02 8.368e+02 1.257e+03 1.782e+03 3.337e+03, threshold=2.514e+03, percent-clipped=4.0 2023-06-25 05:32:44,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1917216.0, ans=0.125 2023-06-25 05:32:56,093 INFO [train.py:996] (1/4) Epoch 11, batch 14600, loss[loss=0.243, simple_loss=0.3353, pruned_loss=0.07538, over 21743.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.313, pruned_loss=0.08216, over 4259126.50 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:33:03,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1917276.0, ans=0.0 2023-06-25 05:33:08,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1917276.0, ans=0.2 2023-06-25 05:34:00,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1917456.0, ans=0.125 2023-06-25 05:34:44,125 INFO [train.py:996] (1/4) Epoch 11, batch 14650, loss[loss=0.2566, simple_loss=0.3308, pruned_loss=0.09122, over 21386.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3146, pruned_loss=0.0811, over 4264345.85 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:34:44,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1917576.0, ans=0.05 2023-06-25 05:35:00,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1917636.0, ans=0.0 2023-06-25 05:35:41,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1917696.0, ans=0.0 2023-06-25 05:35:50,888 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 8.792e+02 1.262e+03 1.854e+03 3.152e+03, threshold=2.525e+03, percent-clipped=6.0 2023-06-25 05:35:54,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1917756.0, ans=0.125 2023-06-25 05:36:33,372 INFO [train.py:996] (1/4) Epoch 11, batch 14700, loss[loss=0.1723, simple_loss=0.2666, pruned_loss=0.03896, over 21663.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3083, pruned_loss=0.07519, over 4257812.10 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:37:00,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1917936.0, ans=0.125 2023-06-25 05:37:53,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-25 05:38:22,518 INFO [train.py:996] (1/4) Epoch 11, batch 14750, loss[loss=0.2149, simple_loss=0.2747, pruned_loss=0.07755, over 20080.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3122, pruned_loss=0.07675, over 4264787.11 frames. ], batch size: 703, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:39:02,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-25 05:39:16,199 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1918296.0, ans=0.125 2023-06-25 05:39:42,621 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.909e+02 1.142e+03 1.631e+03 3.263e+03, threshold=2.283e+03, percent-clipped=2.0 2023-06-25 05:39:50,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-25 05:40:17,896 INFO [train.py:996] (1/4) Epoch 11, batch 14800, loss[loss=0.2299, simple_loss=0.2972, pruned_loss=0.08135, over 21471.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3246, pruned_loss=0.08276, over 4262780.63 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:40:32,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1918476.0, ans=0.125 2023-06-25 05:40:34,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1918476.0, ans=0.125 2023-06-25 05:40:39,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1918536.0, ans=0.0 2023-06-25 05:40:40,279 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=15.0 2023-06-25 05:40:44,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1918536.0, ans=0.2 2023-06-25 05:40:52,228 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-25 05:40:56,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1918596.0, ans=0.125 2023-06-25 05:40:58,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1918596.0, ans=0.125 2023-06-25 05:41:30,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1918656.0, ans=0.0 2023-06-25 05:41:52,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1918716.0, ans=0.125 2023-06-25 05:42:13,946 INFO [train.py:996] (1/4) Epoch 11, batch 14850, loss[loss=0.2669, simple_loss=0.3372, pruned_loss=0.09828, over 21823.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3187, pruned_loss=0.08226, over 4257880.85 frames. ], batch size: 372, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:42:28,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1918776.0, ans=0.07 2023-06-25 05:43:25,182 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.029e+02 9.409e+02 1.250e+03 2.186e+03 4.588e+03, threshold=2.500e+03, percent-clipped=20.0 2023-06-25 05:43:25,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1918956.0, ans=0.2 2023-06-25 05:43:39,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1919016.0, ans=0.0 2023-06-25 05:44:03,313 INFO [train.py:996] (1/4) Epoch 11, batch 14900, loss[loss=0.29, simple_loss=0.3698, pruned_loss=0.1051, over 20682.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3226, pruned_loss=0.08367, over 4257810.48 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:44:41,916 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-25 05:45:34,588 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:45:38,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1919316.0, ans=0.035 2023-06-25 05:45:50,691 INFO [train.py:996] (1/4) Epoch 11, batch 14950, loss[loss=0.2421, simple_loss=0.3163, pruned_loss=0.0839, over 21353.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3239, pruned_loss=0.08295, over 4260482.04 frames. ], batch size: 176, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:47:02,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-25 05:47:03,174 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.346e+02 1.154e+03 1.605e+03 2.804e+03, threshold=2.309e+03, percent-clipped=2.0 2023-06-25 05:47:20,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1919556.0, ans=0.125 2023-06-25 05:47:39,739 INFO [train.py:996] (1/4) Epoch 11, batch 15000, loss[loss=0.2674, simple_loss=0.3808, pruned_loss=0.07703, over 20740.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3257, pruned_loss=0.08469, over 4271107.07 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:47:39,740 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 05:48:02,329 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2537, simple_loss=0.3474, pruned_loss=0.08002, over 1796401.00 frames. 2023-06-25 05:48:02,329 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 05:48:28,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1919736.0, ans=0.125 2023-06-25 05:49:50,634 INFO [train.py:996] (1/4) Epoch 11, batch 15050, loss[loss=0.2237, simple_loss=0.3172, pruned_loss=0.06506, over 21708.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3249, pruned_loss=0.08558, over 4269799.29 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:50:03,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1919976.0, ans=0.1 2023-06-25 05:50:54,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1920156.0, ans=10.0 2023-06-25 05:50:57,305 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.512e+02 8.828e+02 1.154e+03 1.761e+03 2.876e+03, threshold=2.308e+03, percent-clipped=7.0 2023-06-25 05:51:24,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1920216.0, ans=0.0 2023-06-25 05:51:39,410 INFO [train.py:996] (1/4) Epoch 11, batch 15100, loss[loss=0.2538, simple_loss=0.3386, pruned_loss=0.08453, over 21858.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3273, pruned_loss=0.08582, over 4269596.09 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:51:59,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1920336.0, ans=0.1 2023-06-25 05:51:59,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1920336.0, ans=0.125 2023-06-25 05:52:26,614 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1920396.0, ans=0.125 2023-06-25 05:53:29,040 INFO [train.py:996] (1/4) Epoch 11, batch 15150, loss[loss=0.2132, simple_loss=0.2794, pruned_loss=0.07355, over 21311.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3232, pruned_loss=0.08556, over 4267590.08 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:53:38,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1920576.0, ans=0.05 2023-06-25 05:54:03,029 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-25 05:54:44,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.398e+02 9.134e+02 1.396e+03 2.248e+03 4.445e+03, threshold=2.791e+03, percent-clipped=24.0 2023-06-25 05:54:49,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1920756.0, ans=0.125 2023-06-25 05:54:52,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1920756.0, ans=0.0 2023-06-25 05:54:54,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1920756.0, ans=0.125 2023-06-25 05:55:18,880 INFO [train.py:996] (1/4) Epoch 11, batch 15200, loss[loss=0.1832, simple_loss=0.2763, pruned_loss=0.04507, over 21799.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3151, pruned_loss=0.0819, over 4269600.24 frames. ], batch size: 317, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:55:19,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1920876.0, ans=0.2 2023-06-25 05:55:49,526 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:55:51,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1920936.0, ans=0.04949747468305833 2023-06-25 05:55:51,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1920936.0, ans=0.0 2023-06-25 05:56:30,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1921056.0, ans=0.125 2023-06-25 05:56:44,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1921116.0, ans=0.5 2023-06-25 05:57:06,654 INFO [train.py:996] (1/4) Epoch 11, batch 15250, loss[loss=0.245, simple_loss=0.2968, pruned_loss=0.09661, over 21318.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3109, pruned_loss=0.08084, over 4268465.36 frames. ], batch size: 473, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:57:31,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1921236.0, ans=0.2 2023-06-25 05:57:49,270 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-25 05:58:19,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.800e+02 7.712e+02 1.026e+03 1.486e+03 3.458e+03, threshold=2.053e+03, percent-clipped=2.0 2023-06-25 05:58:27,680 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-25 05:58:30,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1921416.0, ans=0.0 2023-06-25 05:58:53,097 INFO [train.py:996] (1/4) Epoch 11, batch 15300, loss[loss=0.2336, simple_loss=0.3083, pruned_loss=0.07942, over 21600.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3116, pruned_loss=0.0823, over 4270436.35 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:00:48,357 INFO [train.py:996] (1/4) Epoch 11, batch 15350, loss[loss=0.1974, simple_loss=0.2685, pruned_loss=0.06321, over 20780.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.316, pruned_loss=0.08432, over 4273242.36 frames. ], batch size: 609, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:00:57,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1921776.0, ans=0.0 2023-06-25 06:01:21,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-25 06:01:24,281 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:01:53,602 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.974e+02 1.016e+03 1.491e+03 3.012e+03, threshold=2.032e+03, percent-clipped=10.0 2023-06-25 06:02:19,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1922016.0, ans=0.0 2023-06-25 06:02:27,022 INFO [train.py:996] (1/4) Epoch 11, batch 15400, loss[loss=0.294, simple_loss=0.4039, pruned_loss=0.0921, over 19763.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3167, pruned_loss=0.08291, over 4276826.33 frames. ], batch size: 703, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:04:11,534 INFO [train.py:996] (1/4) Epoch 11, batch 15450, loss[loss=0.2107, simple_loss=0.2982, pruned_loss=0.06164, over 21763.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3157, pruned_loss=0.08257, over 4283225.70 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:05:00,288 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:05:25,665 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.669e+02 7.354e+02 9.513e+02 1.338e+03 2.588e+03, threshold=1.903e+03, percent-clipped=5.0 2023-06-25 06:06:03,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1922676.0, ans=0.125 2023-06-25 06:06:04,760 INFO [train.py:996] (1/4) Epoch 11, batch 15500, loss[loss=0.2032, simple_loss=0.2868, pruned_loss=0.05987, over 16176.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3189, pruned_loss=0.08237, over 4275779.79 frames. ], batch size: 60, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:06:06,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1922676.0, ans=0.1 2023-06-25 06:06:52,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-25 06:07:54,196 INFO [train.py:996] (1/4) Epoch 11, batch 15550, loss[loss=0.2178, simple_loss=0.2948, pruned_loss=0.07037, over 21695.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3161, pruned_loss=0.07896, over 4267324.48 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:08:11,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-25 06:08:12,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-25 06:08:47,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.39 vs. limit=22.5 2023-06-25 06:09:07,398 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.113e+02 7.965e+02 1.145e+03 1.833e+03 5.244e+03, threshold=2.290e+03, percent-clipped=21.0 2023-06-25 06:09:21,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1923216.0, ans=0.0 2023-06-25 06:09:33,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1923216.0, ans=0.2 2023-06-25 06:09:41,647 INFO [train.py:996] (1/4) Epoch 11, batch 15600, loss[loss=0.2538, simple_loss=0.3197, pruned_loss=0.09398, over 21371.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3104, pruned_loss=0.07743, over 4264338.45 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:09:51,666 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-25 06:09:57,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1923276.0, ans=0.2 2023-06-25 06:09:58,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.39 vs. limit=15.0 2023-06-25 06:10:02,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1923336.0, ans=0.0 2023-06-25 06:10:15,900 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:10:28,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1923396.0, ans=0.2 2023-06-25 06:10:52,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1923456.0, ans=0.05 2023-06-25 06:11:33,818 INFO [train.py:996] (1/4) Epoch 11, batch 15650, loss[loss=0.2229, simple_loss=0.2836, pruned_loss=0.08106, over 21752.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3083, pruned_loss=0.07688, over 4263519.30 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:12:14,937 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:12:14,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1923696.0, ans=0.04949747468305833 2023-06-25 06:12:43,345 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.231e+02 1.048e+03 1.538e+03 3.677e+03, threshold=2.096e+03, percent-clipped=8.0 2023-06-25 06:13:19,761 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-06-25 06:13:23,092 INFO [train.py:996] (1/4) Epoch 11, batch 15700, loss[loss=0.208, simple_loss=0.2713, pruned_loss=0.07234, over 21279.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3039, pruned_loss=0.07631, over 4252210.33 frames. ], batch size: 144, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:13:43,027 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:13:45,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1923936.0, ans=0.125 2023-06-25 06:14:05,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1923996.0, ans=0.0 2023-06-25 06:14:17,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1923996.0, ans=0.0 2023-06-25 06:14:26,363 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.40 vs. limit=15.0 2023-06-25 06:15:04,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1924116.0, ans=0.0 2023-06-25 06:15:08,066 INFO [train.py:996] (1/4) Epoch 11, batch 15750, loss[loss=0.276, simple_loss=0.3206, pruned_loss=0.1157, over 21401.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3003, pruned_loss=0.07634, over 4258652.01 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:15:29,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1924236.0, ans=0.05 2023-06-25 06:15:32,250 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1924236.0, ans=0.0 2023-06-25 06:15:49,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1924296.0, ans=0.125 2023-06-25 06:16:19,556 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.466e+02 7.452e+02 1.136e+03 1.633e+03 2.643e+03, threshold=2.272e+03, percent-clipped=11.0 2023-06-25 06:16:38,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1924416.0, ans=0.0 2023-06-25 06:16:39,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.49 vs. limit=22.5 2023-06-25 06:16:55,697 INFO [train.py:996] (1/4) Epoch 11, batch 15800, loss[loss=0.2258, simple_loss=0.2947, pruned_loss=0.07841, over 21646.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2959, pruned_loss=0.07568, over 4259547.60 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:18:28,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1924716.0, ans=0.125 2023-06-25 06:18:45,293 INFO [train.py:996] (1/4) Epoch 11, batch 15850, loss[loss=0.1722, simple_loss=0.2315, pruned_loss=0.05643, over 20013.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2975, pruned_loss=0.07783, over 4259335.27 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:18:51,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.82 vs. limit=22.5 2023-06-25 06:19:17,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1924836.0, ans=0.2 2023-06-25 06:19:36,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1924896.0, ans=0.0 2023-06-25 06:19:42,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1924896.0, ans=0.0 2023-06-25 06:19:57,112 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 6.760e+02 9.766e+02 1.376e+03 2.542e+03, threshold=1.953e+03, percent-clipped=1.0 2023-06-25 06:20:28,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1925016.0, ans=0.04949747468305833 2023-06-25 06:20:34,250 INFO [train.py:996] (1/4) Epoch 11, batch 15900, loss[loss=0.2188, simple_loss=0.2938, pruned_loss=0.07187, over 21715.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2952, pruned_loss=0.07798, over 4260018.48 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:20:34,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1925076.0, ans=0.125 2023-06-25 06:21:10,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1925136.0, ans=0.125 2023-06-25 06:21:16,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1925196.0, ans=0.09899494936611666 2023-06-25 06:21:31,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1925196.0, ans=0.125 2023-06-25 06:22:22,493 INFO [train.py:996] (1/4) Epoch 11, batch 15950, loss[loss=0.2105, simple_loss=0.2894, pruned_loss=0.06577, over 21825.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2953, pruned_loss=0.07519, over 4265098.14 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:22:30,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1925376.0, ans=15.0 2023-06-25 06:23:35,062 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.240e+02 8.208e+02 1.106e+03 1.560e+03 3.108e+03, threshold=2.211e+03, percent-clipped=12.0 2023-06-25 06:23:58,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1925616.0, ans=0.125 2023-06-25 06:24:12,237 INFO [train.py:996] (1/4) Epoch 11, batch 16000, loss[loss=0.1553, simple_loss=0.2446, pruned_loss=0.03301, over 21426.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2961, pruned_loss=0.07308, over 4265485.59 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:24:14,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1925676.0, ans=0.2 2023-06-25 06:24:19,852 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:24:31,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1925736.0, ans=0.0 2023-06-25 06:24:36,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1925736.0, ans=0.0 2023-06-25 06:24:51,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1925796.0, ans=0.125 2023-06-25 06:24:52,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1925796.0, ans=0.0 2023-06-25 06:25:58,392 INFO [train.py:996] (1/4) Epoch 11, batch 16050, loss[loss=0.2068, simple_loss=0.3022, pruned_loss=0.05568, over 21665.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3009, pruned_loss=0.07214, over 4266570.03 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:26:55,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1926096.0, ans=0.125 2023-06-25 06:27:06,124 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.209e+02 1.010e+03 1.605e+03 2.461e+03 5.413e+03, threshold=3.210e+03, percent-clipped=30.0 2023-06-25 06:27:15,063 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:27:36,727 INFO [train.py:996] (1/4) Epoch 11, batch 16100, loss[loss=0.2494, simple_loss=0.322, pruned_loss=0.08835, over 21904.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3056, pruned_loss=0.07401, over 4278291.33 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:28:35,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1926396.0, ans=0.125 2023-06-25 06:28:52,627 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:28:53,237 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-25 06:29:17,223 INFO [train.py:996] (1/4) Epoch 11, batch 16150, loss[loss=0.2, simple_loss=0.2894, pruned_loss=0.05533, over 21796.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3072, pruned_loss=0.07579, over 4284170.85 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:29:18,407 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.86 vs. limit=22.5 2023-06-25 06:29:48,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1926636.0, ans=0.125 2023-06-25 06:30:11,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1926696.0, ans=0.05 2023-06-25 06:30:12,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1926696.0, ans=0.05 2023-06-25 06:30:40,141 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.105e+02 8.824e+02 1.229e+03 1.712e+03 3.510e+03, threshold=2.459e+03, percent-clipped=5.0 2023-06-25 06:31:16,344 INFO [train.py:996] (1/4) Epoch 11, batch 16200, loss[loss=0.189, simple_loss=0.2407, pruned_loss=0.06868, over 20303.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3116, pruned_loss=0.07739, over 4283119.40 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:31:23,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1926876.0, ans=0.125 2023-06-25 06:31:30,131 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-25 06:32:07,285 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:32:16,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1927056.0, ans=0.5 2023-06-25 06:32:39,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1927116.0, ans=0.1 2023-06-25 06:32:47,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.31 vs. limit=10.0 2023-06-25 06:33:02,365 INFO [train.py:996] (1/4) Epoch 11, batch 16250, loss[loss=0.1827, simple_loss=0.2648, pruned_loss=0.05031, over 21797.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3109, pruned_loss=0.07718, over 4282819.24 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:33:02,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1927176.0, ans=0.1 2023-06-25 06:34:00,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1927296.0, ans=0.04949747468305833 2023-06-25 06:34:08,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1927356.0, ans=0.0 2023-06-25 06:34:11,390 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.609e+02 8.190e+02 1.044e+03 1.433e+03 2.783e+03, threshold=2.088e+03, percent-clipped=4.0 2023-06-25 06:34:37,709 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:34:44,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1927416.0, ans=0.2 2023-06-25 06:34:49,108 INFO [train.py:996] (1/4) Epoch 11, batch 16300, loss[loss=0.1668, simple_loss=0.2523, pruned_loss=0.04066, over 21394.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3038, pruned_loss=0.07332, over 4281522.81 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:35:54,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-25 06:36:36,594 INFO [train.py:996] (1/4) Epoch 11, batch 16350, loss[loss=0.2402, simple_loss=0.3139, pruned_loss=0.08327, over 21705.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3036, pruned_loss=0.07399, over 4266252.40 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:37:28,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-25 06:37:52,925 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.557e+02 7.101e+02 1.051e+03 1.461e+03 2.820e+03, threshold=2.102e+03, percent-clipped=5.0 2023-06-25 06:38:24,471 INFO [train.py:996] (1/4) Epoch 11, batch 16400, loss[loss=0.2236, simple_loss=0.2941, pruned_loss=0.07651, over 21492.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3073, pruned_loss=0.07517, over 4263724.08 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:39:25,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1928256.0, ans=0.125 2023-06-25 06:39:55,147 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1928316.0, ans=0.125 2023-06-25 06:40:08,742 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:40:09,745 INFO [train.py:996] (1/4) Epoch 11, batch 16450, loss[loss=0.2092, simple_loss=0.2965, pruned_loss=0.06093, over 21850.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3065, pruned_loss=0.07571, over 4268842.99 frames. ], batch size: 316, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:40:42,669 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1928436.0, ans=0.5 2023-06-25 06:41:02,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1928496.0, ans=0.125 2023-06-25 06:41:22,527 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.821e+02 6.919e+02 9.825e+02 1.554e+03 3.786e+03, threshold=1.965e+03, percent-clipped=13.0 2023-06-25 06:41:53,331 INFO [train.py:996] (1/4) Epoch 11, batch 16500, loss[loss=0.185, simple_loss=0.2512, pruned_loss=0.05942, over 21588.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.307, pruned_loss=0.07714, over 4274929.60 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:42:09,401 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.18 vs. limit=15.0 2023-06-25 06:42:25,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1928736.0, ans=0.0 2023-06-25 06:43:18,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1928916.0, ans=0.0 2023-06-25 06:43:37,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1928916.0, ans=0.125 2023-06-25 06:43:39,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1928916.0, ans=0.125 2023-06-25 06:43:39,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1928916.0, ans=0.0 2023-06-25 06:43:44,085 INFO [train.py:996] (1/4) Epoch 11, batch 16550, loss[loss=0.3025, simple_loss=0.3736, pruned_loss=0.1157, over 21450.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3082, pruned_loss=0.07626, over 4261971.91 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:43:53,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1928976.0, ans=0.2 2023-06-25 06:45:07,339 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.454e+02 9.180e+02 1.462e+03 2.154e+03 5.250e+03, threshold=2.924e+03, percent-clipped=28.0 2023-06-25 06:45:31,245 INFO [train.py:996] (1/4) Epoch 11, batch 16600, loss[loss=0.2137, simple_loss=0.3261, pruned_loss=0.0506, over 20798.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3135, pruned_loss=0.07807, over 4270568.76 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:45:34,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-25 06:46:00,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1929336.0, ans=0.125 2023-06-25 06:46:06,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1929336.0, ans=0.09899494936611666 2023-06-25 06:46:47,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1929456.0, ans=0.0 2023-06-25 06:46:49,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1929456.0, ans=0.2 2023-06-25 06:47:06,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1929516.0, ans=0.1 2023-06-25 06:47:21,352 INFO [train.py:996] (1/4) Epoch 11, batch 16650, loss[loss=0.2979, simple_loss=0.3687, pruned_loss=0.1136, over 21294.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3234, pruned_loss=0.08105, over 4273352.45 frames. ], batch size: 507, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:47:46,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1929636.0, ans=0.125 2023-06-25 06:48:03,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1929636.0, ans=0.0 2023-06-25 06:48:48,118 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.799e+02 8.385e+02 1.061e+03 1.516e+03 3.591e+03, threshold=2.122e+03, percent-clipped=0.0 2023-06-25 06:48:59,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1929816.0, ans=0.0 2023-06-25 06:49:00,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1929816.0, ans=0.2 2023-06-25 06:49:09,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1929816.0, ans=0.125 2023-06-25 06:49:18,289 INFO [train.py:996] (1/4) Epoch 11, batch 16700, loss[loss=0.2286, simple_loss=0.3031, pruned_loss=0.07705, over 21708.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3241, pruned_loss=0.08217, over 4269823.53 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:50:02,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1929936.0, ans=0.125 2023-06-25 06:50:11,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1929996.0, ans=0.125 2023-06-25 06:50:12,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1929996.0, ans=0.125 2023-06-25 06:50:46,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1930056.0, ans=0.2 2023-06-25 06:50:57,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1930116.0, ans=0.125 2023-06-25 06:51:19,150 INFO [train.py:996] (1/4) Epoch 11, batch 16750, loss[loss=0.3278, simple_loss=0.4106, pruned_loss=0.1224, over 21412.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3267, pruned_loss=0.08415, over 4269782.71 frames. ], batch size: 507, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:51:19,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1930176.0, ans=0.125 2023-06-25 06:51:20,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-25 06:51:44,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1930236.0, ans=0.125 2023-06-25 06:52:39,843 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.774e+02 8.238e+02 1.096e+03 1.590e+03 4.377e+03, threshold=2.192e+03, percent-clipped=15.0 2023-06-25 06:52:59,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-25 06:53:14,816 INFO [train.py:996] (1/4) Epoch 11, batch 16800, loss[loss=0.2321, simple_loss=0.2977, pruned_loss=0.08326, over 21436.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3287, pruned_loss=0.08394, over 4268329.20 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:53:20,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1930476.0, ans=0.125 2023-06-25 06:53:30,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1930536.0, ans=0.125 2023-06-25 06:53:38,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1930536.0, ans=0.0 2023-06-25 06:53:57,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1930596.0, ans=0.0 2023-06-25 06:54:24,711 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1930656.0, ans=0.1 2023-06-25 06:54:33,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-25 06:54:59,774 INFO [train.py:996] (1/4) Epoch 11, batch 16850, loss[loss=0.2353, simple_loss=0.3107, pruned_loss=0.07996, over 21870.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3253, pruned_loss=0.08391, over 4278441.97 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:55:19,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-25 06:55:21,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1930836.0, ans=0.05 2023-06-25 06:56:12,568 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.019e+02 7.924e+02 1.109e+03 1.823e+03 3.367e+03, threshold=2.218e+03, percent-clipped=14.0 2023-06-25 06:56:28,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1931016.0, ans=0.125 2023-06-25 06:56:40,195 INFO [train.py:996] (1/4) Epoch 11, batch 16900, loss[loss=0.2015, simple_loss=0.2654, pruned_loss=0.06882, over 21199.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3186, pruned_loss=0.08191, over 4280320.91 frames. ], batch size: 143, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:56:50,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-25 06:57:16,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1931196.0, ans=0.0 2023-06-25 06:57:24,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1931196.0, ans=0.0 2023-06-25 06:57:43,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1931256.0, ans=0.0 2023-06-25 06:57:50,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-25 06:58:04,434 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-06-25 06:58:07,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1931316.0, ans=0.0 2023-06-25 06:58:23,619 INFO [train.py:996] (1/4) Epoch 11, batch 16950, loss[loss=0.223, simple_loss=0.299, pruned_loss=0.07349, over 21143.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3128, pruned_loss=0.0806, over 4280328.73 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:58:49,367 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-25 06:58:57,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1931436.0, ans=0.0 2023-06-25 06:59:41,839 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.539e+02 6.475e+02 7.581e+02 1.089e+03 2.288e+03, threshold=1.516e+03, percent-clipped=2.0 2023-06-25 07:00:09,584 INFO [train.py:996] (1/4) Epoch 11, batch 17000, loss[loss=0.2641, simple_loss=0.3223, pruned_loss=0.1029, over 21614.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3099, pruned_loss=0.08167, over 4286528.10 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:00:23,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1931676.0, ans=0.0 2023-06-25 07:00:29,634 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.66 vs. limit=15.0 2023-06-25 07:00:41,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1931736.0, ans=0.1 2023-06-25 07:00:42,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1931736.0, ans=0.0 2023-06-25 07:01:48,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1931916.0, ans=0.125 2023-06-25 07:01:55,710 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-25 07:01:56,043 INFO [train.py:996] (1/4) Epoch 11, batch 17050, loss[loss=0.2262, simple_loss=0.3162, pruned_loss=0.06814, over 21838.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3158, pruned_loss=0.08354, over 4286999.94 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:02:07,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1931976.0, ans=0.07 2023-06-25 07:02:57,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1932156.0, ans=0.125 2023-06-25 07:03:22,791 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.436e+02 8.949e+02 1.142e+03 1.744e+03 3.951e+03, threshold=2.284e+03, percent-clipped=33.0 2023-06-25 07:03:29,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1932216.0, ans=0.1 2023-06-25 07:03:42,198 INFO [train.py:996] (1/4) Epoch 11, batch 17100, loss[loss=0.2163, simple_loss=0.2933, pruned_loss=0.06965, over 21887.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3141, pruned_loss=0.08335, over 4286279.20 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:03:52,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-25 07:04:33,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1932396.0, ans=0.0 2023-06-25 07:05:05,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1932456.0, ans=0.2 2023-06-25 07:05:10,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932516.0, ans=0.1 2023-06-25 07:05:19,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1932516.0, ans=0.035 2023-06-25 07:05:29,218 INFO [train.py:996] (1/4) Epoch 11, batch 17150, loss[loss=0.2344, simple_loss=0.2973, pruned_loss=0.08574, over 21882.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3122, pruned_loss=0.08326, over 4289887.27 frames. ], batch size: 118, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:06:48,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1932756.0, ans=0.0 2023-06-25 07:06:55,812 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.253e+02 7.399e+02 1.011e+03 1.479e+03 2.669e+03, threshold=2.021e+03, percent-clipped=4.0 2023-06-25 07:07:16,417 INFO [train.py:996] (1/4) Epoch 11, batch 17200, loss[loss=0.2606, simple_loss=0.3353, pruned_loss=0.09291, over 21539.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3124, pruned_loss=0.08319, over 4286506.07 frames. ], batch size: 414, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:07:42,826 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-25 07:08:33,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1933056.0, ans=0.05 2023-06-25 07:08:34,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933056.0, ans=0.1 2023-06-25 07:09:10,169 INFO [train.py:996] (1/4) Epoch 11, batch 17250, loss[loss=0.2916, simple_loss=0.3595, pruned_loss=0.1119, over 21412.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3158, pruned_loss=0.08404, over 4278790.72 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:09:57,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1933236.0, ans=0.125 2023-06-25 07:10:31,131 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.220e+02 7.674e+02 1.037e+03 1.511e+03 3.569e+03, threshold=2.074e+03, percent-clipped=11.0 2023-06-25 07:10:55,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1933476.0, ans=0.07 2023-06-25 07:10:56,492 INFO [train.py:996] (1/4) Epoch 11, batch 17300, loss[loss=0.2597, simple_loss=0.3353, pruned_loss=0.09203, over 21801.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3232, pruned_loss=0.08694, over 4284492.52 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:11:20,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1933476.0, ans=0.125 2023-06-25 07:11:24,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1933536.0, ans=0.2 2023-06-25 07:11:31,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1933536.0, ans=0.125 2023-06-25 07:11:43,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1933536.0, ans=0.125 2023-06-25 07:11:49,165 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.62 vs. limit=15.0 2023-06-25 07:12:28,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1933716.0, ans=0.1 2023-06-25 07:12:50,941 INFO [train.py:996] (1/4) Epoch 11, batch 17350, loss[loss=0.2188, simple_loss=0.3054, pruned_loss=0.06612, over 21761.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3251, pruned_loss=0.08737, over 4282455.62 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:14:08,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.102e+02 8.837e+02 1.250e+03 1.745e+03 4.253e+03, threshold=2.500e+03, percent-clipped=18.0 2023-06-25 07:14:09,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1933956.0, ans=0.125 2023-06-25 07:14:32,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1934016.0, ans=0.125 2023-06-25 07:14:41,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1934016.0, ans=0.1 2023-06-25 07:14:45,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1934076.0, ans=0.0 2023-06-25 07:14:46,102 INFO [train.py:996] (1/4) Epoch 11, batch 17400, loss[loss=0.2072, simple_loss=0.2956, pruned_loss=0.05938, over 21557.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3199, pruned_loss=0.08302, over 4265491.24 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:16:12,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1934256.0, ans=0.1 2023-06-25 07:16:33,115 INFO [train.py:996] (1/4) Epoch 11, batch 17450, loss[loss=0.1945, simple_loss=0.261, pruned_loss=0.06403, over 21174.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3151, pruned_loss=0.08003, over 4267449.09 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:16:41,765 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.51 vs. limit=15.0 2023-06-25 07:16:44,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1934376.0, ans=0.0 2023-06-25 07:16:44,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1934376.0, ans=0.2 2023-06-25 07:16:44,149 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1934376.0, ans=0.125 2023-06-25 07:17:21,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1934496.0, ans=0.0 2023-06-25 07:17:27,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1934496.0, ans=0.05 2023-06-25 07:18:01,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1934556.0, ans=0.125 2023-06-25 07:18:04,028 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 7.727e+02 1.188e+03 2.165e+03 4.981e+03, threshold=2.376e+03, percent-clipped=19.0 2023-06-25 07:18:22,119 INFO [train.py:996] (1/4) Epoch 11, batch 17500, loss[loss=0.2706, simple_loss=0.3272, pruned_loss=0.107, over 21722.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3113, pruned_loss=0.07778, over 4273551.80 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:18:37,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1934736.0, ans=0.125 2023-06-25 07:18:53,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1934736.0, ans=0.1 2023-06-25 07:19:00,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1934796.0, ans=0.125 2023-06-25 07:19:18,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1934856.0, ans=0.5 2023-06-25 07:19:28,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1934856.0, ans=0.125 2023-06-25 07:19:47,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-25 07:19:49,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1934916.0, ans=0.125 2023-06-25 07:20:05,469 INFO [train.py:996] (1/4) Epoch 11, batch 17550, loss[loss=0.2319, simple_loss=0.3266, pruned_loss=0.06864, over 21372.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3113, pruned_loss=0.0768, over 4268596.74 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:20:19,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1934976.0, ans=0.025 2023-06-25 07:20:53,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1935096.0, ans=0.0 2023-06-25 07:20:57,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1935096.0, ans=0.0 2023-06-25 07:21:29,551 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.993e+02 7.228e+02 9.267e+02 1.344e+03 3.002e+03, threshold=1.853e+03, percent-clipped=5.0 2023-06-25 07:21:49,265 INFO [train.py:996] (1/4) Epoch 11, batch 17600, loss[loss=0.2678, simple_loss=0.337, pruned_loss=0.09932, over 21444.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3141, pruned_loss=0.07729, over 4262198.40 frames. ], batch size: 211, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:22:17,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1935336.0, ans=0.2 2023-06-25 07:22:29,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1935396.0, ans=0.125 2023-06-25 07:22:37,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1935396.0, ans=0.125 2023-06-25 07:23:07,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1935456.0, ans=0.125 2023-06-25 07:23:43,432 INFO [train.py:996] (1/4) Epoch 11, batch 17650, loss[loss=0.2266, simple_loss=0.3069, pruned_loss=0.07319, over 21677.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3139, pruned_loss=0.07819, over 4270171.24 frames. ], batch size: 415, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:23:43,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1935576.0, ans=0.0 2023-06-25 07:24:03,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1935576.0, ans=0.125 2023-06-25 07:24:16,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1935636.0, ans=0.2 2023-06-25 07:24:24,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1935696.0, ans=0.0 2023-06-25 07:24:30,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1935696.0, ans=0.2 2023-06-25 07:24:30,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1935696.0, ans=0.0 2023-06-25 07:25:12,531 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.151e+02 8.324e+02 1.406e+03 1.795e+03 4.059e+03, threshold=2.812e+03, percent-clipped=23.0 2023-06-25 07:25:13,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1935816.0, ans=0.125 2023-06-25 07:25:28,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1935816.0, ans=0.0 2023-06-25 07:25:30,942 INFO [train.py:996] (1/4) Epoch 11, batch 17700, loss[loss=0.2612, simple_loss=0.3471, pruned_loss=0.08767, over 21910.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.307, pruned_loss=0.07523, over 4273774.87 frames. ], batch size: 372, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:26:19,324 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-25 07:26:46,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1936056.0, ans=0.125 2023-06-25 07:26:50,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1936056.0, ans=0.2 2023-06-25 07:26:55,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1936056.0, ans=0.125 2023-06-25 07:27:18,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1936116.0, ans=0.1 2023-06-25 07:27:21,739 INFO [train.py:996] (1/4) Epoch 11, batch 17750, loss[loss=0.2714, simple_loss=0.344, pruned_loss=0.09945, over 21983.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3146, pruned_loss=0.07898, over 4264557.32 frames. ], batch size: 317, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:27:59,873 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1936236.0, ans=6.0 2023-06-25 07:28:34,792 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-25 07:28:50,296 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 6.855e+02 8.351e+02 1.068e+03 2.757e+03, threshold=1.670e+03, percent-clipped=0.0 2023-06-25 07:29:09,865 INFO [train.py:996] (1/4) Epoch 11, batch 17800, loss[loss=0.2229, simple_loss=0.2986, pruned_loss=0.07357, over 21427.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3136, pruned_loss=0.07817, over 4265265.01 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:30:01,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1936596.0, ans=0.07 2023-06-25 07:30:06,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1936596.0, ans=0.125 2023-06-25 07:30:07,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-25 07:30:17,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1936656.0, ans=0.125 2023-06-25 07:30:17,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1936656.0, ans=0.05 2023-06-25 07:30:19,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1936656.0, ans=0.125 2023-06-25 07:30:41,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1936716.0, ans=0.2 2023-06-25 07:30:57,471 INFO [train.py:996] (1/4) Epoch 11, batch 17850, loss[loss=0.2335, simple_loss=0.299, pruned_loss=0.08397, over 20013.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3147, pruned_loss=0.07875, over 4265958.46 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:31:28,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1936836.0, ans=0.125 2023-06-25 07:31:32,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1936836.0, ans=0.1 2023-06-25 07:31:38,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1936836.0, ans=0.125 2023-06-25 07:31:42,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1936896.0, ans=0.125 2023-06-25 07:32:18,482 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1936956.0, ans=0.0 2023-06-25 07:32:22,656 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 9.470e+02 1.328e+03 1.940e+03 3.459e+03, threshold=2.655e+03, percent-clipped=37.0 2023-06-25 07:32:31,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1937016.0, ans=0.125 2023-06-25 07:32:32,848 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-25 07:32:39,641 INFO [train.py:996] (1/4) Epoch 11, batch 17900, loss[loss=0.2719, simple_loss=0.3463, pruned_loss=0.09869, over 21809.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3201, pruned_loss=0.08138, over 4274097.97 frames. ], batch size: 118, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:33:00,611 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:33:49,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1937256.0, ans=0.125 2023-06-25 07:34:01,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1937256.0, ans=0.02 2023-06-25 07:34:03,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1937256.0, ans=0.125 2023-06-25 07:34:37,311 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-25 07:34:41,381 INFO [train.py:996] (1/4) Epoch 11, batch 17950, loss[loss=0.1988, simple_loss=0.2932, pruned_loss=0.05221, over 21777.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3192, pruned_loss=0.07828, over 4267220.61 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:35:26,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1937496.0, ans=0.125 2023-06-25 07:35:35,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1937556.0, ans=0.2 2023-06-25 07:35:46,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1937556.0, ans=0.125 2023-06-25 07:35:59,053 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.777e+02 1.188e+03 1.807e+03 3.395e+03, threshold=2.377e+03, percent-clipped=4.0 2023-06-25 07:36:27,519 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.28 vs. limit=15.0 2023-06-25 07:36:27,838 INFO [train.py:996] (1/4) Epoch 11, batch 18000, loss[loss=0.2267, simple_loss=0.2911, pruned_loss=0.08118, over 21665.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3121, pruned_loss=0.07729, over 4269589.80 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:36:27,839 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 07:36:45,000 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2562, simple_loss=0.3557, pruned_loss=0.07833, over 1796401.00 frames. 2023-06-25 07:36:45,001 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 07:37:20,455 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-25 07:37:34,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1937796.0, ans=0.2 2023-06-25 07:38:09,733 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1937916.0, ans=0.2 2023-06-25 07:38:17,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1937916.0, ans=0.125 2023-06-25 07:38:33,171 INFO [train.py:996] (1/4) Epoch 11, batch 18050, loss[loss=0.2004, simple_loss=0.2651, pruned_loss=0.06785, over 20764.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3065, pruned_loss=0.07661, over 4261604.24 frames. ], batch size: 608, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:38:37,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-25 07:38:56,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1938036.0, ans=0.125 2023-06-25 07:39:06,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1938096.0, ans=0.125 2023-06-25 07:39:59,302 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.782e+02 7.777e+02 1.078e+03 1.586e+03 2.998e+03, threshold=2.156e+03, percent-clipped=7.0 2023-06-25 07:40:21,749 INFO [train.py:996] (1/4) Epoch 11, batch 18100, loss[loss=0.264, simple_loss=0.3576, pruned_loss=0.08517, over 21591.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3112, pruned_loss=0.07859, over 4262598.76 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:41:19,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1938396.0, ans=0.0 2023-06-25 07:41:33,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1938456.0, ans=0.04949747468305833 2023-06-25 07:42:08,831 INFO [train.py:996] (1/4) Epoch 11, batch 18150, loss[loss=0.2415, simple_loss=0.3309, pruned_loss=0.07604, over 21919.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3149, pruned_loss=0.07863, over 4266803.10 frames. ], batch size: 373, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:42:17,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1938576.0, ans=0.025 2023-06-25 07:42:25,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1938636.0, ans=0.2 2023-06-25 07:43:06,546 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-25 07:43:31,677 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.895e+02 7.455e+02 1.236e+03 1.816e+03 3.616e+03, threshold=2.471e+03, percent-clipped=14.0 2023-06-25 07:43:34,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1938816.0, ans=0.125 2023-06-25 07:43:39,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-25 07:43:54,197 INFO [train.py:996] (1/4) Epoch 11, batch 18200, loss[loss=0.2073, simple_loss=0.2766, pruned_loss=0.06907, over 21547.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3075, pruned_loss=0.07839, over 4253081.36 frames. ], batch size: 195, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:44:16,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1938936.0, ans=0.07 2023-06-25 07:45:33,106 INFO [train.py:996] (1/4) Epoch 11, batch 18250, loss[loss=0.2501, simple_loss=0.3124, pruned_loss=0.09391, over 21831.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2998, pruned_loss=0.07535, over 4250188.94 frames. ], batch size: 416, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:45:33,707 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1939176.0, ans=0.125 2023-06-25 07:45:49,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1939176.0, ans=0.95 2023-06-25 07:46:21,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1939296.0, ans=0.125 2023-06-25 07:46:56,935 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.451e+02 6.639e+02 9.483e+02 1.514e+03 2.544e+03, threshold=1.897e+03, percent-clipped=1.0 2023-06-25 07:47:21,096 INFO [train.py:996] (1/4) Epoch 11, batch 18300, loss[loss=0.2412, simple_loss=0.3031, pruned_loss=0.08969, over 21910.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2981, pruned_loss=0.07486, over 4254615.98 frames. ], batch size: 118, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:47:54,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1939536.0, ans=0.2 2023-06-25 07:48:00,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1939596.0, ans=0.2 2023-06-25 07:48:22,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1939656.0, ans=0.1 2023-06-25 07:48:31,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1939656.0, ans=0.0 2023-06-25 07:48:46,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1939716.0, ans=0.1 2023-06-25 07:48:49,555 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:49:01,562 INFO [train.py:996] (1/4) Epoch 11, batch 18350, loss[loss=0.1629, simple_loss=0.2362, pruned_loss=0.04481, over 16309.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3034, pruned_loss=0.07415, over 4243431.18 frames. ], batch size: 61, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:49:13,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1939776.0, ans=0.125 2023-06-25 07:49:37,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1939836.0, ans=0.035 2023-06-25 07:49:45,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1939896.0, ans=0.0 2023-06-25 07:50:32,433 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.196e+02 8.074e+02 1.390e+03 1.835e+03 4.417e+03, threshold=2.780e+03, percent-clipped=23.0 2023-06-25 07:50:55,402 INFO [train.py:996] (1/4) Epoch 11, batch 18400, loss[loss=0.1757, simple_loss=0.2576, pruned_loss=0.04688, over 21140.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2984, pruned_loss=0.07246, over 4242773.47 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 32.0 2023-06-25 07:51:06,425 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1940076.0, ans=0.1 2023-06-25 07:51:20,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-25 07:52:43,127 INFO [train.py:996] (1/4) Epoch 11, batch 18450, loss[loss=0.2034, simple_loss=0.2686, pruned_loss=0.06915, over 21994.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2953, pruned_loss=0.06889, over 4253511.65 frames. ], batch size: 103, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:52:57,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-25 07:53:02,225 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1940436.0, ans=0.035 2023-06-25 07:53:32,775 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1940496.0, ans=0.125 2023-06-25 07:53:41,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-25 07:54:04,998 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.624e+02 6.992e+02 1.032e+03 1.619e+03 3.807e+03, threshold=2.064e+03, percent-clipped=5.0 2023-06-25 07:54:07,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1940616.0, ans=0.0 2023-06-25 07:54:25,118 INFO [train.py:996] (1/4) Epoch 11, batch 18500, loss[loss=0.1923, simple_loss=0.2768, pruned_loss=0.05388, over 21793.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2924, pruned_loss=0.0686, over 4251073.37 frames. ], batch size: 316, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:54:51,427 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1940736.0, ans=0.125 2023-06-25 07:55:53,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1940916.0, ans=0.125 2023-06-25 07:56:02,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1940916.0, ans=0.125 2023-06-25 07:56:07,586 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:56:15,451 INFO [train.py:996] (1/4) Epoch 11, batch 18550, loss[loss=0.1785, simple_loss=0.2392, pruned_loss=0.05888, over 20774.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2902, pruned_loss=0.06744, over 4240114.70 frames. ], batch size: 608, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:56:34,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-25 07:57:21,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1941096.0, ans=0.1 2023-06-25 07:57:49,423 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.217e+02 7.256e+02 1.032e+03 1.520e+03 3.767e+03, threshold=2.064e+03, percent-clipped=11.0 2023-06-25 07:58:03,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1941276.0, ans=0.2 2023-06-25 07:58:04,490 INFO [train.py:996] (1/4) Epoch 11, batch 18600, loss[loss=0.2248, simple_loss=0.3009, pruned_loss=0.07433, over 21633.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2899, pruned_loss=0.06922, over 4230554.57 frames. ], batch size: 391, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:58:13,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1941276.0, ans=0.05 2023-06-25 07:58:23,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1941336.0, ans=0.125 2023-06-25 07:59:13,677 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:59:51,220 INFO [train.py:996] (1/4) Epoch 11, batch 18650, loss[loss=0.2289, simple_loss=0.293, pruned_loss=0.08239, over 21363.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2879, pruned_loss=0.06936, over 4229666.28 frames. ], batch size: 473, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:00:07,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1941636.0, ans=0.125 2023-06-25 08:00:10,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1941636.0, ans=0.0 2023-06-25 08:00:13,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1941636.0, ans=0.125 2023-06-25 08:00:32,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1941696.0, ans=0.125 2023-06-25 08:00:32,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1941696.0, ans=0.125 2023-06-25 08:00:42,152 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-06-25 08:01:21,717 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.720e+02 7.139e+02 9.409e+02 1.577e+03 2.753e+03, threshold=1.882e+03, percent-clipped=11.0 2023-06-25 08:01:23,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1941816.0, ans=0.2 2023-06-25 08:01:35,908 INFO [train.py:996] (1/4) Epoch 11, batch 18700, loss[loss=0.2403, simple_loss=0.2912, pruned_loss=0.09467, over 21753.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2864, pruned_loss=0.07131, over 4247225.36 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:01:36,587 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1941876.0, ans=0.0 2023-06-25 08:01:49,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1941876.0, ans=0.0 2023-06-25 08:01:55,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1941936.0, ans=10.0 2023-06-25 08:03:24,909 INFO [train.py:996] (1/4) Epoch 11, batch 18750, loss[loss=0.2649, simple_loss=0.3482, pruned_loss=0.0908, over 21632.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2898, pruned_loss=0.07419, over 4244596.48 frames. ], batch size: 389, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:04:42,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1942356.0, ans=0.1 2023-06-25 08:04:43,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1942356.0, ans=0.2 2023-06-25 08:04:47,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-25 08:04:50,037 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.888e+02 8.399e+02 1.249e+03 1.994e+03 4.167e+03, threshold=2.497e+03, percent-clipped=25.0 2023-06-25 08:05:01,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-25 08:05:11,247 INFO [train.py:996] (1/4) Epoch 11, batch 18800, loss[loss=0.1842, simple_loss=0.2819, pruned_loss=0.0433, over 21867.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2967, pruned_loss=0.07551, over 4238797.97 frames. ], batch size: 371, lr: 2.64e-03, grad_scale: 32.0 2023-06-25 08:05:24,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1942476.0, ans=0.125 2023-06-25 08:05:41,462 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-25 08:06:39,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1942716.0, ans=0.0 2023-06-25 08:06:56,528 INFO [train.py:996] (1/4) Epoch 11, batch 18850, loss[loss=0.2023, simple_loss=0.2714, pruned_loss=0.06659, over 21506.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2972, pruned_loss=0.07234, over 4238825.42 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:07:23,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1942836.0, ans=0.125 2023-06-25 08:08:01,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1942956.0, ans=0.125 2023-06-25 08:08:21,206 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.163e+02 6.140e+02 8.289e+02 1.259e+03 4.459e+03, threshold=1.658e+03, percent-clipped=10.0 2023-06-25 08:08:35,153 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=8.0 2023-06-25 08:08:40,580 INFO [train.py:996] (1/4) Epoch 11, batch 18900, loss[loss=0.2031, simple_loss=0.273, pruned_loss=0.06659, over 21823.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2931, pruned_loss=0.07202, over 4250488.09 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:09:02,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1943136.0, ans=0.0 2023-06-25 08:09:03,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1943136.0, ans=0.0 2023-06-25 08:09:13,562 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1943196.0, ans=0.125 2023-06-25 08:09:33,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1943196.0, ans=0.0 2023-06-25 08:10:10,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1943316.0, ans=0.125 2023-06-25 08:10:27,770 INFO [train.py:996] (1/4) Epoch 11, batch 18950, loss[loss=0.2106, simple_loss=0.2962, pruned_loss=0.06254, over 21811.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2947, pruned_loss=0.07456, over 4264437.49 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:10:29,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1943376.0, ans=0.2 2023-06-25 08:11:12,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1943496.0, ans=0.1 2023-06-25 08:11:23,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.64 vs. limit=15.0 2023-06-25 08:11:39,334 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-06-25 08:12:02,082 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.904e+02 8.258e+02 1.054e+03 1.529e+03 3.478e+03, threshold=2.107e+03, percent-clipped=19.0 2023-06-25 08:12:15,316 INFO [train.py:996] (1/4) Epoch 11, batch 19000, loss[loss=0.2765, simple_loss=0.345, pruned_loss=0.104, over 21777.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3014, pruned_loss=0.07581, over 4269556.87 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:12:25,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1943676.0, ans=0.125 2023-06-25 08:12:48,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1943796.0, ans=0.2 2023-06-25 08:13:23,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1943856.0, ans=0.125 2023-06-25 08:13:37,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1943856.0, ans=0.125 2023-06-25 08:14:01,787 INFO [train.py:996] (1/4) Epoch 11, batch 19050, loss[loss=0.2559, simple_loss=0.3221, pruned_loss=0.09484, over 21857.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3065, pruned_loss=0.07934, over 4271498.26 frames. ], batch size: 118, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:14:34,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1944036.0, ans=0.1 2023-06-25 08:15:33,625 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.658e+02 1.037e+03 1.522e+03 3.485e+03, threshold=2.073e+03, percent-clipped=12.0 2023-06-25 08:15:34,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1944216.0, ans=0.0 2023-06-25 08:15:48,086 INFO [train.py:996] (1/4) Epoch 11, batch 19100, loss[loss=0.2554, simple_loss=0.3304, pruned_loss=0.09018, over 20012.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3051, pruned_loss=0.0805, over 4272061.11 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:15:50,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1944276.0, ans=0.07 2023-06-25 08:15:53,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1944276.0, ans=0.125 2023-06-25 08:16:08,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1944336.0, ans=22.5 2023-06-25 08:17:11,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1944456.0, ans=0.0 2023-06-25 08:17:35,911 INFO [train.py:996] (1/4) Epoch 11, batch 19150, loss[loss=0.2244, simple_loss=0.3229, pruned_loss=0.06301, over 21699.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3074, pruned_loss=0.08091, over 4263273.48 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:17:42,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=12.0 2023-06-25 08:18:28,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1944696.0, ans=0.2 2023-06-25 08:19:14,110 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.349e+02 9.714e+02 1.394e+03 2.160e+03 4.455e+03, threshold=2.788e+03, percent-clipped=28.0 2023-06-25 08:19:26,299 INFO [train.py:996] (1/4) Epoch 11, batch 19200, loss[loss=0.2118, simple_loss=0.288, pruned_loss=0.06781, over 21862.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3152, pruned_loss=0.08057, over 4266382.03 frames. ], batch size: 98, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:20:14,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1944936.0, ans=0.125 2023-06-25 08:20:16,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-25 08:20:19,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.34 vs. limit=10.0 2023-06-25 08:20:35,854 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1944996.0, ans=0.125 2023-06-25 08:21:11,494 INFO [train.py:996] (1/4) Epoch 11, batch 19250, loss[loss=0.1846, simple_loss=0.2863, pruned_loss=0.04145, over 21774.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3158, pruned_loss=0.07571, over 4262144.47 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:21:17,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1945176.0, ans=0.125 2023-06-25 08:21:30,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1945176.0, ans=0.0 2023-06-25 08:22:46,277 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.037e+02 6.757e+02 9.006e+02 1.219e+03 2.409e+03, threshold=1.801e+03, percent-clipped=0.0 2023-06-25 08:22:57,452 INFO [train.py:996] (1/4) Epoch 11, batch 19300, loss[loss=0.2454, simple_loss=0.3233, pruned_loss=0.08374, over 21574.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3127, pruned_loss=0.07451, over 4262429.44 frames. ], batch size: 471, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:24:13,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1945656.0, ans=0.125 2023-06-25 08:24:22,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1945656.0, ans=0.1 2023-06-25 08:24:52,181 INFO [train.py:996] (1/4) Epoch 11, batch 19350, loss[loss=0.1918, simple_loss=0.2793, pruned_loss=0.05211, over 21793.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3084, pruned_loss=0.07114, over 4268421.47 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:25:33,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1945896.0, ans=0.125 2023-06-25 08:25:48,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1945896.0, ans=0.125 2023-06-25 08:25:53,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1945896.0, ans=0.125 2023-06-25 08:25:56,198 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1945956.0, ans=0.0 2023-06-25 08:26:02,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1945956.0, ans=0.1 2023-06-25 08:26:18,160 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.662e+02 8.721e+02 1.407e+03 2.132e+03 4.703e+03, threshold=2.815e+03, percent-clipped=33.0 2023-06-25 08:26:36,768 INFO [train.py:996] (1/4) Epoch 11, batch 19400, loss[loss=0.183, simple_loss=0.2596, pruned_loss=0.05321, over 21260.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3071, pruned_loss=0.07159, over 4276842.33 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:26:49,046 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1946076.0, ans=0.125 2023-06-25 08:27:17,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1946136.0, ans=0.0 2023-06-25 08:28:22,719 INFO [train.py:996] (1/4) Epoch 11, batch 19450, loss[loss=0.2232, simple_loss=0.293, pruned_loss=0.07672, over 21676.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3042, pruned_loss=0.07397, over 4289708.10 frames. ], batch size: 391, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:29:03,437 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:29:27,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1946496.0, ans=0.2 2023-06-25 08:29:28,765 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:29:35,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1946556.0, ans=0.125 2023-06-25 08:29:53,114 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.635e+02 8.363e+02 1.164e+03 1.702e+03 3.020e+03, threshold=2.327e+03, percent-clipped=5.0 2023-06-25 08:30:08,975 INFO [train.py:996] (1/4) Epoch 11, batch 19500, loss[loss=0.2047, simple_loss=0.2699, pruned_loss=0.06973, over 20800.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2998, pruned_loss=0.07543, over 4263188.13 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:30:11,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1946676.0, ans=0.0 2023-06-25 08:31:47,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1946916.0, ans=0.0 2023-06-25 08:31:52,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1946916.0, ans=0.125 2023-06-25 08:31:57,018 INFO [train.py:996] (1/4) Epoch 11, batch 19550, loss[loss=0.2748, simple_loss=0.3605, pruned_loss=0.09453, over 21555.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2974, pruned_loss=0.07391, over 4258153.09 frames. ], batch size: 508, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:33:31,280 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.735e+02 7.971e+02 1.072e+03 1.636e+03 3.226e+03, threshold=2.144e+03, percent-clipped=9.0 2023-06-25 08:33:40,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1947276.0, ans=0.0 2023-06-25 08:33:41,365 INFO [train.py:996] (1/4) Epoch 11, batch 19600, loss[loss=0.275, simple_loss=0.3473, pruned_loss=0.1013, over 21838.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2989, pruned_loss=0.07415, over 4263450.00 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:33:41,822 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1947276.0, ans=0.125 2023-06-25 08:35:36,926 INFO [train.py:996] (1/4) Epoch 11, batch 19650, loss[loss=0.2567, simple_loss=0.3219, pruned_loss=0.09577, over 21968.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3039, pruned_loss=0.07788, over 4270579.13 frames. ], batch size: 316, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:35:44,403 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1947576.0, ans=0.0 2023-06-25 08:36:11,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1947636.0, ans=0.0 2023-06-25 08:37:15,098 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.249e+02 7.644e+02 9.843e+02 1.375e+03 3.676e+03, threshold=1.969e+03, percent-clipped=9.0 2023-06-25 08:37:30,369 INFO [train.py:996] (1/4) Epoch 11, batch 19700, loss[loss=0.2062, simple_loss=0.2883, pruned_loss=0.06205, over 21625.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3073, pruned_loss=0.07851, over 4271368.62 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:37:30,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1947876.0, ans=0.125 2023-06-25 08:37:39,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1947876.0, ans=0.0 2023-06-25 08:37:59,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1947936.0, ans=0.2 2023-06-25 08:38:47,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1948056.0, ans=0.125 2023-06-25 08:39:12,088 INFO [train.py:996] (1/4) Epoch 11, batch 19750, loss[loss=0.2059, simple_loss=0.2866, pruned_loss=0.06258, over 21436.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3159, pruned_loss=0.07961, over 4277383.44 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:39:20,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1948176.0, ans=0.0 2023-06-25 08:39:41,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1948236.0, ans=0.2 2023-06-25 08:39:42,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1948236.0, ans=0.125 2023-06-25 08:40:26,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1948356.0, ans=0.125 2023-06-25 08:40:40,583 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-25 08:40:49,902 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.288e+02 1.013e+03 1.397e+03 2.237e+03 5.539e+03, threshold=2.794e+03, percent-clipped=30.0 2023-06-25 08:40:52,681 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-25 08:41:00,496 INFO [train.py:996] (1/4) Epoch 11, batch 19800, loss[loss=0.2143, simple_loss=0.2813, pruned_loss=0.07369, over 21686.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3155, pruned_loss=0.08022, over 4278527.16 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:41:12,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-25 08:41:13,938 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.59 vs. limit=22.5 2023-06-25 08:41:24,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1948536.0, ans=0.2 2023-06-25 08:41:52,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1948596.0, ans=0.1 2023-06-25 08:42:19,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.93 vs. limit=10.0 2023-06-25 08:42:29,715 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1948716.0, ans=0.1 2023-06-25 08:42:47,231 INFO [train.py:996] (1/4) Epoch 11, batch 19850, loss[loss=0.2594, simple_loss=0.364, pruned_loss=0.07739, over 21266.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3093, pruned_loss=0.07598, over 4275256.55 frames. ], batch size: 549, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:43:15,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1948836.0, ans=0.0 2023-06-25 08:43:16,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1948836.0, ans=0.125 2023-06-25 08:44:08,725 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-25 08:44:23,852 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 7.609e+02 1.066e+03 1.634e+03 3.345e+03, threshold=2.132e+03, percent-clipped=4.0 2023-06-25 08:44:26,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1949016.0, ans=0.1 2023-06-25 08:44:27,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1949016.0, ans=0.125 2023-06-25 08:44:29,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1949016.0, ans=0.125 2023-06-25 08:44:33,362 INFO [train.py:996] (1/4) Epoch 11, batch 19900, loss[loss=0.2014, simple_loss=0.2972, pruned_loss=0.05276, over 21803.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.309, pruned_loss=0.07338, over 4276095.01 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:45:06,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1949136.0, ans=0.125 2023-06-25 08:45:20,368 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-25 08:45:34,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1949196.0, ans=0.125 2023-06-25 08:45:34,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1949196.0, ans=0.1 2023-06-25 08:45:58,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1949256.0, ans=0.125 2023-06-25 08:46:01,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1949316.0, ans=0.125 2023-06-25 08:46:19,625 INFO [train.py:996] (1/4) Epoch 11, batch 19950, loss[loss=0.214, simple_loss=0.2988, pruned_loss=0.06463, over 21570.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.303, pruned_loss=0.07359, over 4278585.92 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:47:46,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1949616.0, ans=0.0 2023-06-25 08:47:53,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.809e+02 7.247e+02 1.068e+03 1.569e+03 2.873e+03, threshold=2.135e+03, percent-clipped=11.0 2023-06-25 08:48:03,754 INFO [train.py:996] (1/4) Epoch 11, batch 20000, loss[loss=0.2376, simple_loss=0.3031, pruned_loss=0.08608, over 21253.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3038, pruned_loss=0.07384, over 4269058.05 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 08:48:12,847 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-25 08:48:55,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.64 vs. limit=10.0 2023-06-25 08:49:28,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1949916.0, ans=0.0 2023-06-25 08:49:30,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1949916.0, ans=0.125 2023-06-25 08:49:36,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1949916.0, ans=0.05 2023-06-25 08:49:40,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1949916.0, ans=0.5 2023-06-25 08:49:45,807 INFO [train.py:996] (1/4) Epoch 11, batch 20050, loss[loss=0.2341, simple_loss=0.2941, pruned_loss=0.0871, over 21408.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3056, pruned_loss=0.07643, over 4276578.73 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 08:49:54,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1949976.0, ans=0.035 2023-06-25 08:50:18,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1950036.0, ans=0.0 2023-06-25 08:51:06,892 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-25 08:51:18,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1950216.0, ans=0.125 2023-06-25 08:51:23,080 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.669e+02 8.021e+02 1.064e+03 1.748e+03 3.117e+03, threshold=2.127e+03, percent-clipped=13.0 2023-06-25 08:51:27,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1950216.0, ans=0.2 2023-06-25 08:51:33,744 INFO [train.py:996] (1/4) Epoch 11, batch 20100, loss[loss=0.2369, simple_loss=0.3251, pruned_loss=0.07436, over 21416.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.308, pruned_loss=0.07832, over 4284017.83 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:52:48,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-25 08:52:58,253 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1950456.0, ans=0.125 2023-06-25 08:53:28,308 INFO [train.py:996] (1/4) Epoch 11, batch 20150, loss[loss=0.2461, simple_loss=0.3244, pruned_loss=0.08389, over 21275.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3162, pruned_loss=0.0807, over 4283729.58 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:53:57,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1950636.0, ans=0.125 2023-06-25 08:55:01,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.89 vs. limit=15.0 2023-06-25 08:55:17,099 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.413e+02 8.388e+02 1.067e+03 1.531e+03 4.094e+03, threshold=2.133e+03, percent-clipped=12.0 2023-06-25 08:55:25,370 INFO [train.py:996] (1/4) Epoch 11, batch 20200, loss[loss=0.2001, simple_loss=0.2753, pruned_loss=0.06249, over 21389.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3216, pruned_loss=0.08335, over 4283436.10 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:55:28,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1950876.0, ans=0.0 2023-06-25 08:55:33,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1950876.0, ans=0.125 2023-06-25 08:56:34,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1951056.0, ans=0.2 2023-06-25 08:57:12,755 INFO [train.py:996] (1/4) Epoch 11, batch 20250, loss[loss=0.2114, simple_loss=0.286, pruned_loss=0.06837, over 21177.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3223, pruned_loss=0.08237, over 4286506.47 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:57:13,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1951176.0, ans=0.125 2023-06-25 08:57:43,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1951236.0, ans=0.0 2023-06-25 08:57:44,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1951236.0, ans=0.2 2023-06-25 08:58:08,097 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1951296.0, ans=0.09899494936611666 2023-06-25 08:58:17,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1951356.0, ans=0.1 2023-06-25 08:58:17,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1951356.0, ans=0.0 2023-06-25 08:58:38,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1951356.0, ans=0.125 2023-06-25 08:58:52,153 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 7.038e+02 1.016e+03 1.334e+03 4.106e+03, threshold=2.032e+03, percent-clipped=11.0 2023-06-25 08:59:05,789 INFO [train.py:996] (1/4) Epoch 11, batch 20300, loss[loss=0.2071, simple_loss=0.2931, pruned_loss=0.06059, over 21358.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3187, pruned_loss=0.07895, over 4276701.92 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:59:14,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1951476.0, ans=0.125 2023-06-25 08:59:34,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1951536.0, ans=0.125 2023-06-25 08:59:41,588 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1951596.0, ans=0.0 2023-06-25 08:59:41,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1951596.0, ans=0.07 2023-06-25 09:00:00,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1951656.0, ans=0.125 2023-06-25 09:00:23,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1951716.0, ans=0.1 2023-06-25 09:00:46,166 INFO [train.py:996] (1/4) Epoch 11, batch 20350, loss[loss=0.2704, simple_loss=0.3342, pruned_loss=0.1033, over 21290.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3188, pruned_loss=0.07967, over 4260491.94 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:00:54,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1951776.0, ans=0.125 2023-06-25 09:02:24,595 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.181e+02 7.498e+02 1.071e+03 1.543e+03 3.638e+03, threshold=2.141e+03, percent-clipped=16.0 2023-06-25 09:02:31,886 INFO [train.py:996] (1/4) Epoch 11, batch 20400, loss[loss=0.2697, simple_loss=0.3517, pruned_loss=0.09383, over 21667.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.322, pruned_loss=0.0825, over 4251048.07 frames. ], batch size: 414, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:02:37,737 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.40 vs. limit=15.0 2023-06-25 09:03:34,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1952256.0, ans=0.125 2023-06-25 09:04:16,705 INFO [train.py:996] (1/4) Epoch 11, batch 20450, loss[loss=0.2405, simple_loss=0.301, pruned_loss=0.08999, over 21022.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3244, pruned_loss=0.08557, over 4258643.85 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:04:17,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1952376.0, ans=0.0 2023-06-25 09:04:27,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-25 09:04:38,281 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.67 vs. limit=15.0 2023-06-25 09:05:18,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1952556.0, ans=0.125 2023-06-25 09:05:50,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1952616.0, ans=0.0 2023-06-25 09:05:51,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1952616.0, ans=0.125 2023-06-25 09:05:55,802 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.393e+02 8.343e+02 1.181e+03 1.747e+03 3.039e+03, threshold=2.362e+03, percent-clipped=12.0 2023-06-25 09:06:02,397 INFO [train.py:996] (1/4) Epoch 11, batch 20500, loss[loss=0.2089, simple_loss=0.2738, pruned_loss=0.07199, over 21152.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3195, pruned_loss=0.08582, over 4258250.00 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:06:06,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1952676.0, ans=0.0 2023-06-25 09:06:22,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1952736.0, ans=0.125 2023-06-25 09:06:32,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1952736.0, ans=0.0 2023-06-25 09:07:48,665 INFO [train.py:996] (1/4) Epoch 11, batch 20550, loss[loss=0.2248, simple_loss=0.3092, pruned_loss=0.07022, over 21178.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3133, pruned_loss=0.08425, over 4261535.91 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:08:13,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1953036.0, ans=0.125 2023-06-25 09:08:20,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1953036.0, ans=0.125 2023-06-25 09:08:27,760 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-06-25 09:09:28,348 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.073e+02 8.204e+02 1.449e+03 2.191e+03 5.725e+03, threshold=2.898e+03, percent-clipped=18.0 2023-06-25 09:09:40,421 INFO [train.py:996] (1/4) Epoch 11, batch 20600, loss[loss=0.2361, simple_loss=0.3033, pruned_loss=0.08441, over 21655.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3142, pruned_loss=0.08207, over 4258289.67 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:11:26,273 INFO [train.py:996] (1/4) Epoch 11, batch 20650, loss[loss=0.2322, simple_loss=0.2857, pruned_loss=0.0893, over 21165.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3094, pruned_loss=0.08184, over 4269548.14 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:11:41,642 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-25 09:12:15,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1953696.0, ans=0.1 2023-06-25 09:12:20,834 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:12:22,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1953696.0, ans=0.125 2023-06-25 09:13:03,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1953816.0, ans=0.125 2023-06-25 09:13:04,254 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.039e+02 6.601e+02 8.640e+02 1.224e+03 2.485e+03, threshold=1.728e+03, percent-clipped=0.0 2023-06-25 09:13:16,544 INFO [train.py:996] (1/4) Epoch 11, batch 20700, loss[loss=0.2041, simple_loss=0.2888, pruned_loss=0.05969, over 21663.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3029, pruned_loss=0.07863, over 4263420.20 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:14:05,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1953996.0, ans=0.125 2023-06-25 09:14:34,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1954056.0, ans=0.1 2023-06-25 09:15:08,007 INFO [train.py:996] (1/4) Epoch 11, batch 20750, loss[loss=0.2265, simple_loss=0.3148, pruned_loss=0.06909, over 21401.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3055, pruned_loss=0.0782, over 4260933.46 frames. ], batch size: 194, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:15:17,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1954176.0, ans=0.125 2023-06-25 09:15:45,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1954236.0, ans=0.0 2023-06-25 09:16:08,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1954296.0, ans=0.0 2023-06-25 09:16:48,250 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.092e+02 1.287e+03 1.980e+03 4.706e+03, threshold=2.574e+03, percent-clipped=34.0 2023-06-25 09:16:48,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1954416.0, ans=0.2 2023-06-25 09:16:55,079 INFO [train.py:996] (1/4) Epoch 11, batch 20800, loss[loss=0.2252, simple_loss=0.2832, pruned_loss=0.08363, over 21437.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3087, pruned_loss=0.07967, over 4258984.44 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 09:17:45,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1954596.0, ans=0.125 2023-06-25 09:17:46,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1954596.0, ans=0.125 2023-06-25 09:18:00,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1954656.0, ans=0.2 2023-06-25 09:18:38,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-25 09:18:40,349 INFO [train.py:996] (1/4) Epoch 11, batch 20850, loss[loss=0.2005, simple_loss=0.2647, pruned_loss=0.06808, over 21237.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3017, pruned_loss=0.0777, over 4261257.56 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:19:09,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1954836.0, ans=0.125 2023-06-25 09:19:33,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1954896.0, ans=0.0 2023-06-25 09:20:11,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1955016.0, ans=0.0 2023-06-25 09:20:20,822 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.513e+02 8.708e+02 1.139e+03 1.646e+03 3.626e+03, threshold=2.277e+03, percent-clipped=8.0 2023-06-25 09:20:25,826 INFO [train.py:996] (1/4) Epoch 11, batch 20900, loss[loss=0.2357, simple_loss=0.3101, pruned_loss=0.08065, over 21560.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3039, pruned_loss=0.07845, over 4258848.20 frames. ], batch size: 195, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:21:03,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1955196.0, ans=0.125 2023-06-25 09:21:28,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1955196.0, ans=0.1 2023-06-25 09:21:51,926 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1955316.0, ans=0.2 2023-06-25 09:22:08,635 INFO [train.py:996] (1/4) Epoch 11, batch 20950, loss[loss=0.1876, simple_loss=0.2685, pruned_loss=0.05334, over 21585.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2989, pruned_loss=0.07461, over 4256160.77 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:22:28,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1955436.0, ans=0.0 2023-06-25 09:23:40,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.503e+02 1.270e+03 1.885e+03 4.065e+03, threshold=2.540e+03, percent-clipped=13.0 2023-06-25 09:23:45,538 INFO [train.py:996] (1/4) Epoch 11, batch 21000, loss[loss=0.2097, simple_loss=0.2788, pruned_loss=0.0703, over 21812.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2963, pruned_loss=0.07447, over 4248701.03 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:23:45,539 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 09:24:03,606 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2627, simple_loss=0.3591, pruned_loss=0.08313, over 1796401.00 frames. 2023-06-25 09:24:03,607 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 09:24:07,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1955676.0, ans=0.0 2023-06-25 09:24:11,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-25 09:24:31,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-25 09:25:46,561 INFO [train.py:996] (1/4) Epoch 11, batch 21050, loss[loss=0.1822, simple_loss=0.2655, pruned_loss=0.04943, over 20161.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2941, pruned_loss=0.07461, over 4234129.53 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:26:22,142 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:26:41,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1956096.0, ans=0.125 2023-06-25 09:26:52,573 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1956156.0, ans=0.0 2023-06-25 09:27:27,194 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.534e+02 6.581e+02 8.702e+02 1.278e+03 3.016e+03, threshold=1.740e+03, percent-clipped=3.0 2023-06-25 09:27:30,708 INFO [train.py:996] (1/4) Epoch 11, batch 21100, loss[loss=0.2117, simple_loss=0.2747, pruned_loss=0.0744, over 20180.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2898, pruned_loss=0.07398, over 4244218.66 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:27:34,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1956276.0, ans=0.125 2023-06-25 09:28:05,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1956336.0, ans=0.09899494936611666 2023-06-25 09:28:37,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1956456.0, ans=0.125 2023-06-25 09:28:43,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1956456.0, ans=0.1 2023-06-25 09:29:15,546 INFO [train.py:996] (1/4) Epoch 11, batch 21150, loss[loss=0.1743, simple_loss=0.2381, pruned_loss=0.05527, over 21547.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.287, pruned_loss=0.07424, over 4239462.03 frames. ], batch size: 213, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:29:16,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1956576.0, ans=0.0 2023-06-25 09:29:27,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.43 vs. limit=10.0 2023-06-25 09:29:39,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1956636.0, ans=0.1 2023-06-25 09:29:52,503 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1956696.0, ans=0.125 2023-06-25 09:30:15,218 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1956696.0, ans=0.0 2023-06-25 09:30:16,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1956696.0, ans=0.0 2023-06-25 09:30:27,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1956756.0, ans=0.125 2023-06-25 09:30:55,565 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 7.371e+02 1.068e+03 1.667e+03 5.764e+03, threshold=2.137e+03, percent-clipped=24.0 2023-06-25 09:30:59,120 INFO [train.py:996] (1/4) Epoch 11, batch 21200, loss[loss=0.2012, simple_loss=0.2735, pruned_loss=0.06446, over 21993.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2827, pruned_loss=0.07289, over 4255498.55 frames. ], batch size: 103, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:32:37,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1957176.0, ans=0.125 2023-06-25 09:32:38,412 INFO [train.py:996] (1/4) Epoch 11, batch 21250, loss[loss=0.2655, simple_loss=0.3467, pruned_loss=0.09217, over 21322.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2819, pruned_loss=0.07303, over 4265199.64 frames. ], batch size: 551, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:33:23,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1957296.0, ans=0.125 2023-06-25 09:33:45,199 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=15.0 2023-06-25 09:33:49,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1957356.0, ans=0.125 2023-06-25 09:34:16,843 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 8.531e+02 1.344e+03 2.187e+03 4.666e+03, threshold=2.689e+03, percent-clipped=25.0 2023-06-25 09:34:18,284 INFO [train.py:996] (1/4) Epoch 11, batch 21300, loss[loss=0.2146, simple_loss=0.2931, pruned_loss=0.06812, over 21677.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2895, pruned_loss=0.07576, over 4273323.42 frames. ], batch size: 230, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:34:44,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1957536.0, ans=0.2 2023-06-25 09:35:21,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1957596.0, ans=0.125 2023-06-25 09:35:47,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1957716.0, ans=0.125 2023-06-25 09:35:56,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1957716.0, ans=0.5 2023-06-25 09:36:04,276 INFO [train.py:996] (1/4) Epoch 11, batch 21350, loss[loss=0.216, simple_loss=0.3043, pruned_loss=0.06388, over 21789.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2946, pruned_loss=0.07679, over 4278943.31 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:36:04,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1957776.0, ans=0.0 2023-06-25 09:36:31,697 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-25 09:37:00,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1957896.0, ans=0.0 2023-06-25 09:37:10,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1957956.0, ans=0.0 2023-06-25 09:37:26,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1957956.0, ans=0.0 2023-06-25 09:37:37,360 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-25 09:37:55,870 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.389e+02 7.142e+02 1.027e+03 1.660e+03 3.891e+03, threshold=2.053e+03, percent-clipped=5.0 2023-06-25 09:37:57,554 INFO [train.py:996] (1/4) Epoch 11, batch 21400, loss[loss=0.2959, simple_loss=0.3663, pruned_loss=0.1127, over 21536.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3005, pruned_loss=0.07793, over 4278867.67 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:38:47,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1958196.0, ans=0.0 2023-06-25 09:38:50,082 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-25 09:39:03,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1958256.0, ans=0.125 2023-06-25 09:39:16,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1958256.0, ans=0.2 2023-06-25 09:39:41,657 INFO [train.py:996] (1/4) Epoch 11, batch 21450, loss[loss=0.248, simple_loss=0.309, pruned_loss=0.09352, over 21302.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3023, pruned_loss=0.07837, over 4274289.90 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:39:42,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1958376.0, ans=0.0 2023-06-25 09:39:53,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1958376.0, ans=0.0 2023-06-25 09:39:55,095 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-25 09:40:20,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1958436.0, ans=0.125 2023-06-25 09:41:04,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1958616.0, ans=0.125 2023-06-25 09:41:25,335 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.203e+02 7.277e+02 9.911e+02 1.372e+03 2.622e+03, threshold=1.982e+03, percent-clipped=4.0 2023-06-25 09:41:27,018 INFO [train.py:996] (1/4) Epoch 11, batch 21500, loss[loss=0.229, simple_loss=0.2876, pruned_loss=0.08519, over 21746.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2994, pruned_loss=0.07891, over 4270746.26 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:41:27,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1958676.0, ans=0.125 2023-06-25 09:42:27,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1958796.0, ans=0.04949747468305833 2023-06-25 09:43:11,048 INFO [train.py:996] (1/4) Epoch 11, batch 21550, loss[loss=0.2023, simple_loss=0.2602, pruned_loss=0.07222, over 21303.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.291, pruned_loss=0.07524, over 4264488.91 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:43:48,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1959096.0, ans=0.0 2023-06-25 09:43:48,692 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:44:31,875 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 09:44:51,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 7.990e+02 1.429e+03 2.000e+03 5.379e+03, threshold=2.857e+03, percent-clipped=25.0 2023-06-25 09:44:53,185 INFO [train.py:996] (1/4) Epoch 11, batch 21600, loss[loss=0.2162, simple_loss=0.306, pruned_loss=0.0632, over 21596.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2866, pruned_loss=0.07329, over 4267083.80 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:45:18,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1959276.0, ans=0.0 2023-06-25 09:45:28,267 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1959336.0, ans=0.1 2023-06-25 09:46:04,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1959456.0, ans=0.125 2023-06-25 09:46:28,653 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-25 09:46:40,937 INFO [train.py:996] (1/4) Epoch 11, batch 21650, loss[loss=0.2867, simple_loss=0.3777, pruned_loss=0.09784, over 21505.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2931, pruned_loss=0.07225, over 4266056.63 frames. ], batch size: 507, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:46:49,939 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1959576.0, ans=10.0 2023-06-25 09:47:40,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-25 09:48:11,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1959816.0, ans=0.125 2023-06-25 09:48:23,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1959816.0, ans=0.2 2023-06-25 09:48:25,905 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.099e+02 8.539e+02 1.351e+03 1.899e+03 3.491e+03, threshold=2.702e+03, percent-clipped=7.0 2023-06-25 09:48:27,776 INFO [train.py:996] (1/4) Epoch 11, batch 21700, loss[loss=0.2248, simple_loss=0.3149, pruned_loss=0.06735, over 21654.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2948, pruned_loss=0.07063, over 4269639.84 frames. ], batch size: 414, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:49:19,467 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=12.0 2023-06-25 09:49:27,260 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1960056.0, ans=0.2 2023-06-25 09:50:12,979 INFO [train.py:996] (1/4) Epoch 11, batch 21750, loss[loss=0.2201, simple_loss=0.2865, pruned_loss=0.07685, over 21880.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2908, pruned_loss=0.07053, over 4272142.17 frames. ], batch size: 107, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:50:41,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1960236.0, ans=0.125 2023-06-25 09:51:11,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1960296.0, ans=0.0 2023-06-25 09:51:58,454 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.456e+02 8.216e+02 1.100e+03 1.452e+03 3.027e+03, threshold=2.200e+03, percent-clipped=1.0 2023-06-25 09:51:59,874 INFO [train.py:996] (1/4) Epoch 11, batch 21800, loss[loss=0.191, simple_loss=0.2534, pruned_loss=0.06429, over 21613.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2884, pruned_loss=0.07162, over 4264928.64 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:52:13,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1960476.0, ans=0.2 2023-06-25 09:52:56,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1960596.0, ans=15.0 2023-06-25 09:53:01,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-06-25 09:53:45,110 INFO [train.py:996] (1/4) Epoch 11, batch 21850, loss[loss=0.2463, simple_loss=0.3563, pruned_loss=0.0682, over 21260.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2936, pruned_loss=0.07251, over 4253194.21 frames. ], batch size: 549, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:54:08,110 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-06-25 09:54:25,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1960836.0, ans=0.1 2023-06-25 09:54:38,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1960896.0, ans=0.1 2023-06-25 09:54:42,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1960896.0, ans=0.1 2023-06-25 09:54:51,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1960956.0, ans=0.04949747468305833 2023-06-25 09:54:52,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-25 09:55:02,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1960956.0, ans=0.125 2023-06-25 09:55:09,221 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-25 09:55:11,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1961016.0, ans=0.125 2023-06-25 09:55:27,841 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.221e+02 7.360e+02 1.053e+03 1.458e+03 3.571e+03, threshold=2.107e+03, percent-clipped=7.0 2023-06-25 09:55:35,160 INFO [train.py:996] (1/4) Epoch 11, batch 21900, loss[loss=0.2141, simple_loss=0.2806, pruned_loss=0.07381, over 14335.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2965, pruned_loss=0.07408, over 4259446.39 frames. ], batch size: 60, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:55:58,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-25 09:56:42,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1961256.0, ans=0.0 2023-06-25 09:56:44,035 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1961256.0, ans=0.125 2023-06-25 09:56:56,563 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.72 vs. limit=15.0 2023-06-25 09:57:20,436 INFO [train.py:996] (1/4) Epoch 11, batch 21950, loss[loss=0.1919, simple_loss=0.2584, pruned_loss=0.06269, over 21301.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2907, pruned_loss=0.07354, over 4270324.34 frames. ], batch size: 144, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 09:57:25,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.56 vs. limit=6.0 2023-06-25 09:57:27,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1961376.0, ans=0.2 2023-06-25 09:58:57,730 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.305e+02 6.629e+02 8.784e+02 1.230e+03 3.737e+03, threshold=1.757e+03, percent-clipped=5.0 2023-06-25 09:58:59,387 INFO [train.py:996] (1/4) Epoch 11, batch 22000, loss[loss=0.1588, simple_loss=0.2369, pruned_loss=0.04039, over 21510.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2854, pruned_loss=0.07112, over 4268422.52 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 09:59:09,968 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1961676.0, ans=15.0 2023-06-25 09:59:29,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1961736.0, ans=0.125 2023-06-25 09:59:42,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.30 vs. limit=15.0 2023-06-25 10:00:34,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-25 10:00:36,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1961916.0, ans=0.0 2023-06-25 10:00:50,113 INFO [train.py:996] (1/4) Epoch 11, batch 22050, loss[loss=0.2399, simple_loss=0.3197, pruned_loss=0.08008, over 21392.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2878, pruned_loss=0.07168, over 4254717.89 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:00:50,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1961976.0, ans=0.1 2023-06-25 10:01:12,668 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:01:34,647 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1962096.0, ans=0.125 2023-06-25 10:02:37,406 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.423e+02 9.011e+02 1.269e+03 1.922e+03 5.194e+03, threshold=2.539e+03, percent-clipped=30.0 2023-06-25 10:02:37,432 INFO [train.py:996] (1/4) Epoch 11, batch 22100, loss[loss=0.2916, simple_loss=0.3989, pruned_loss=0.09218, over 19712.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2994, pruned_loss=0.07683, over 4247613.26 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:03:05,704 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1962336.0, ans=0.1 2023-06-25 10:03:21,710 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1962396.0, ans=0.2 2023-06-25 10:03:37,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1962456.0, ans=0.1 2023-06-25 10:03:39,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1962456.0, ans=0.125 2023-06-25 10:03:47,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=12.0 2023-06-25 10:04:23,276 INFO [train.py:996] (1/4) Epoch 11, batch 22150, loss[loss=0.2052, simple_loss=0.277, pruned_loss=0.06671, over 21674.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3028, pruned_loss=0.07848, over 4260696.66 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:04:52,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1962636.0, ans=0.125 2023-06-25 10:04:53,369 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-25 10:05:48,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1962816.0, ans=0.0 2023-06-25 10:06:07,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1962816.0, ans=0.1 2023-06-25 10:06:07,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1962816.0, ans=0.125 2023-06-25 10:06:10,659 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.244e+02 8.641e+02 1.312e+03 2.175e+03 4.145e+03, threshold=2.624e+03, percent-clipped=16.0 2023-06-25 10:06:10,681 INFO [train.py:996] (1/4) Epoch 11, batch 22200, loss[loss=0.2543, simple_loss=0.3182, pruned_loss=0.09526, over 21774.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3061, pruned_loss=0.07901, over 4265954.55 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:06:14,611 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:07:10,353 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1963056.0, ans=0.0 2023-06-25 10:07:25,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1963056.0, ans=0.0 2023-06-25 10:07:56,912 INFO [train.py:996] (1/4) Epoch 11, batch 22250, loss[loss=0.3143, simple_loss=0.3733, pruned_loss=0.1276, over 21336.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3117, pruned_loss=0.08029, over 4269417.22 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:08:37,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1963296.0, ans=0.125 2023-06-25 10:08:51,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1963296.0, ans=0.125 2023-06-25 10:09:14,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1963356.0, ans=0.07 2023-06-25 10:09:44,569 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.502e+02 7.191e+02 1.032e+03 1.470e+03 3.757e+03, threshold=2.063e+03, percent-clipped=7.0 2023-06-25 10:09:44,591 INFO [train.py:996] (1/4) Epoch 11, batch 22300, loss[loss=0.2828, simple_loss=0.3443, pruned_loss=0.1107, over 21353.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3137, pruned_loss=0.0825, over 4274091.27 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:09:52,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1963476.0, ans=0.0 2023-06-25 10:09:53,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1963476.0, ans=0.1 2023-06-25 10:10:35,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1963596.0, ans=0.125 2023-06-25 10:10:42,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1963596.0, ans=0.1 2023-06-25 10:10:47,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1963656.0, ans=0.0 2023-06-25 10:11:22,799 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1963716.0, ans=0.125 2023-06-25 10:11:26,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1963716.0, ans=10.0 2023-06-25 10:11:34,596 INFO [train.py:996] (1/4) Epoch 11, batch 22350, loss[loss=0.241, simple_loss=0.3065, pruned_loss=0.08777, over 21774.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3119, pruned_loss=0.08374, over 4280344.03 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:12:01,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1963836.0, ans=0.5 2023-06-25 10:12:56,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1963956.0, ans=0.125 2023-06-25 10:13:21,916 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 7.078e+02 9.389e+02 1.336e+03 2.790e+03, threshold=1.878e+03, percent-clipped=4.0 2023-06-25 10:13:21,941 INFO [train.py:996] (1/4) Epoch 11, batch 22400, loss[loss=0.1943, simple_loss=0.2632, pruned_loss=0.06271, over 21263.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3086, pruned_loss=0.08077, over 4283761.23 frames. ], batch size: 608, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 10:13:49,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1964136.0, ans=0.0 2023-06-25 10:14:32,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1964256.0, ans=0.125 2023-06-25 10:14:54,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-25 10:15:05,398 INFO [train.py:996] (1/4) Epoch 11, batch 22450, loss[loss=0.1874, simple_loss=0.2508, pruned_loss=0.062, over 21095.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3031, pruned_loss=0.07958, over 4259329.87 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:15:24,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1964436.0, ans=0.2 2023-06-25 10:15:40,153 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1964436.0, ans=0.025 2023-06-25 10:15:41,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1964436.0, ans=0.07 2023-06-25 10:16:03,302 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-25 10:16:03,403 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-25 10:16:46,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1964616.0, ans=0.0 2023-06-25 10:16:53,984 INFO [train.py:996] (1/4) Epoch 11, batch 22500, loss[loss=0.2429, simple_loss=0.3259, pruned_loss=0.07995, over 21540.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2983, pruned_loss=0.07865, over 4268095.60 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:16:55,665 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.803e+02 7.459e+02 1.048e+03 1.318e+03 4.554e+03, threshold=2.097e+03, percent-clipped=12.0 2023-06-25 10:17:23,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1964736.0, ans=0.125 2023-06-25 10:17:49,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1964796.0, ans=0.2 2023-06-25 10:18:40,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1964976.0, ans=0.2 2023-06-25 10:18:41,468 INFO [train.py:996] (1/4) Epoch 11, batch 22550, loss[loss=0.1731, simple_loss=0.227, pruned_loss=0.05957, over 20755.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3003, pruned_loss=0.07874, over 4265944.81 frames. ], batch size: 609, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:20:06,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1965156.0, ans=0.125 2023-06-25 10:20:10,802 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-25 10:20:36,206 INFO [train.py:996] (1/4) Epoch 11, batch 22600, loss[loss=0.3421, simple_loss=0.4128, pruned_loss=0.1356, over 21453.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3045, pruned_loss=0.07903, over 4269547.57 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:20:37,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1965276.0, ans=0.2 2023-06-25 10:20:39,519 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.096e+02 1.052e+03 1.426e+03 2.192e+03 4.902e+03, threshold=2.852e+03, percent-clipped=27.0 2023-06-25 10:20:45,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=17.26 vs. limit=15.0 2023-06-25 10:21:02,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1965336.0, ans=0.125 2023-06-25 10:21:19,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1965396.0, ans=0.07 2023-06-25 10:21:43,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1965456.0, ans=0.2 2023-06-25 10:22:02,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1965516.0, ans=0.125 2023-06-25 10:22:03,948 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-25 10:22:15,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1965516.0, ans=0.0 2023-06-25 10:22:20,079 INFO [train.py:996] (1/4) Epoch 11, batch 22650, loss[loss=0.2016, simple_loss=0.2701, pruned_loss=0.06658, over 21400.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3009, pruned_loss=0.07809, over 4264801.48 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:22:27,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1965576.0, ans=0.1 2023-06-25 10:22:28,706 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1965576.0, ans=0.125 2023-06-25 10:22:54,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1965636.0, ans=0.125 2023-06-25 10:23:09,919 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1965696.0, ans=0.0 2023-06-25 10:23:15,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1965696.0, ans=0.0 2023-06-25 10:23:17,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1965756.0, ans=0.125 2023-06-25 10:23:22,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1965756.0, ans=0.09899494936611666 2023-06-25 10:24:02,903 INFO [train.py:996] (1/4) Epoch 11, batch 22700, loss[loss=0.2151, simple_loss=0.2774, pruned_loss=0.07638, over 21822.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2943, pruned_loss=0.07714, over 4251813.45 frames. ], batch size: 352, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:24:06,050 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.606e+02 1.053e+03 1.643e+03 3.332e+03, threshold=2.107e+03, percent-clipped=4.0 2023-06-25 10:24:10,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1965876.0, ans=0.1 2023-06-25 10:24:27,558 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-25 10:25:04,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1966056.0, ans=0.125 2023-06-25 10:25:19,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1966056.0, ans=0.05 2023-06-25 10:25:50,183 INFO [train.py:996] (1/4) Epoch 11, batch 22750, loss[loss=0.2101, simple_loss=0.2716, pruned_loss=0.07426, over 20738.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2968, pruned_loss=0.07866, over 4259490.61 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:26:02,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1966176.0, ans=0.125 2023-06-25 10:26:04,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1966176.0, ans=0.125 2023-06-25 10:26:38,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1966296.0, ans=0.125 2023-06-25 10:27:00,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1966356.0, ans=0.125 2023-06-25 10:27:02,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1966356.0, ans=0.1 2023-06-25 10:27:40,889 INFO [train.py:996] (1/4) Epoch 11, batch 22800, loss[loss=0.2617, simple_loss=0.3232, pruned_loss=0.1001, over 21842.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3019, pruned_loss=0.08066, over 4266548.43 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:27:43,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1966476.0, ans=0.1 2023-06-25 10:27:44,255 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.117e+02 8.724e+02 1.380e+03 2.378e+03 6.132e+03, threshold=2.761e+03, percent-clipped=34.0 2023-06-25 10:27:46,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1966476.0, ans=0.0 2023-06-25 10:28:28,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1966596.0, ans=0.125 2023-06-25 10:29:13,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1966716.0, ans=0.1 2023-06-25 10:29:25,769 INFO [train.py:996] (1/4) Epoch 11, batch 22850, loss[loss=0.251, simple_loss=0.3344, pruned_loss=0.08379, over 19935.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.298, pruned_loss=0.08033, over 4274953.84 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:29:55,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-25 10:31:12,052 INFO [train.py:996] (1/4) Epoch 11, batch 22900, loss[loss=0.323, simple_loss=0.4195, pruned_loss=0.1132, over 21457.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2991, pruned_loss=0.08018, over 4259403.36 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:31:14,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1967076.0, ans=0.125 2023-06-25 10:31:15,877 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 6.842e+02 1.024e+03 1.500e+03 4.089e+03, threshold=2.047e+03, percent-clipped=2.0 2023-06-25 10:31:26,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1967076.0, ans=0.125 2023-06-25 10:31:30,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1967136.0, ans=0.07 2023-06-25 10:31:39,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1967136.0, ans=0.125 2023-06-25 10:32:27,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1967256.0, ans=0.125 2023-06-25 10:32:41,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1967256.0, ans=0.125 2023-06-25 10:32:46,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1967316.0, ans=0.125 2023-06-25 10:32:59,829 INFO [train.py:996] (1/4) Epoch 11, batch 22950, loss[loss=0.2363, simple_loss=0.3505, pruned_loss=0.061, over 21676.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3106, pruned_loss=0.07803, over 4261168.27 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:33:10,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1967376.0, ans=0.1 2023-06-25 10:33:35,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-25 10:33:42,782 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1967496.0, ans=0.2 2023-06-25 10:33:53,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1967496.0, ans=0.0 2023-06-25 10:34:21,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1967556.0, ans=0.1 2023-06-25 10:34:24,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1967556.0, ans=0.125 2023-06-25 10:34:44,066 INFO [train.py:996] (1/4) Epoch 11, batch 23000, loss[loss=0.2331, simple_loss=0.3658, pruned_loss=0.05015, over 20756.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3117, pruned_loss=0.07677, over 4266515.12 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:34:47,333 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.236e+02 8.204e+02 1.340e+03 2.035e+03 4.542e+03, threshold=2.680e+03, percent-clipped=23.0 2023-06-25 10:35:16,983 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.62 vs. limit=22.5 2023-06-25 10:35:19,054 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-25 10:35:20,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.39 vs. limit=22.5 2023-06-25 10:36:09,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1967856.0, ans=0.125 2023-06-25 10:36:31,927 INFO [train.py:996] (1/4) Epoch 11, batch 23050, loss[loss=0.2748, simple_loss=0.3449, pruned_loss=0.1023, over 21587.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3125, pruned_loss=0.07822, over 4272898.20 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:37:05,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1968036.0, ans=0.0 2023-06-25 10:38:15,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1968216.0, ans=0.125 2023-06-25 10:38:18,105 INFO [train.py:996] (1/4) Epoch 11, batch 23100, loss[loss=0.2255, simple_loss=0.2845, pruned_loss=0.08321, over 21592.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3089, pruned_loss=0.07922, over 4271867.49 frames. ], batch size: 415, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:38:28,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1968276.0, ans=0.125 2023-06-25 10:38:28,846 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-25 10:38:29,163 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.702e+02 7.516e+02 1.022e+03 1.433e+03 4.307e+03, threshold=2.044e+03, percent-clipped=3.0 2023-06-25 10:39:03,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1968336.0, ans=0.0 2023-06-25 10:39:29,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1968456.0, ans=0.125 2023-06-25 10:39:46,744 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-25 10:39:51,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1968516.0, ans=0.125 2023-06-25 10:39:52,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1968516.0, ans=0.125 2023-06-25 10:40:00,067 INFO [train.py:996] (1/4) Epoch 11, batch 23150, loss[loss=0.2396, simple_loss=0.3016, pruned_loss=0.08876, over 21593.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.304, pruned_loss=0.07874, over 4279001.69 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:40:19,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1968576.0, ans=0.125 2023-06-25 10:40:43,949 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1968696.0, ans=0.2 2023-06-25 10:41:28,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-25 10:41:38,193 INFO [train.py:996] (1/4) Epoch 11, batch 23200, loss[loss=0.2497, simple_loss=0.3188, pruned_loss=0.09027, over 21281.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3044, pruned_loss=0.07995, over 4285122.09 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:41:43,107 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.976e+02 7.986e+02 1.089e+03 1.684e+03 3.717e+03, threshold=2.178e+03, percent-clipped=18.0 2023-06-25 10:42:08,454 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:42:34,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1968996.0, ans=0.125 2023-06-25 10:43:19,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1969116.0, ans=0.0 2023-06-25 10:43:30,232 INFO [train.py:996] (1/4) Epoch 11, batch 23250, loss[loss=0.2934, simple_loss=0.3373, pruned_loss=0.1247, over 21808.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3046, pruned_loss=0.08089, over 4294079.37 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:43:50,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1969236.0, ans=0.2 2023-06-25 10:44:16,489 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1969296.0, ans=0.0 2023-06-25 10:45:16,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1969476.0, ans=0.125 2023-06-25 10:45:17,590 INFO [train.py:996] (1/4) Epoch 11, batch 23300, loss[loss=0.2369, simple_loss=0.3108, pruned_loss=0.08153, over 20069.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3101, pruned_loss=0.08241, over 4286300.48 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:45:22,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.331e+02 7.944e+02 1.056e+03 1.535e+03 4.546e+03, threshold=2.112e+03, percent-clipped=10.0 2023-06-25 10:46:19,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.12 vs. limit=6.0 2023-06-25 10:47:03,385 INFO [train.py:996] (1/4) Epoch 11, batch 23350, loss[loss=0.1828, simple_loss=0.2769, pruned_loss=0.0444, over 21805.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3136, pruned_loss=0.08157, over 4273135.38 frames. ], batch size: 372, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:47:29,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1969836.0, ans=0.0 2023-06-25 10:47:41,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1969836.0, ans=0.125 2023-06-25 10:48:04,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1969896.0, ans=0.025 2023-06-25 10:48:04,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1969896.0, ans=0.04949747468305833 2023-06-25 10:48:46,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-25 10:48:54,042 INFO [train.py:996] (1/4) Epoch 11, batch 23400, loss[loss=0.2487, simple_loss=0.3168, pruned_loss=0.09031, over 21928.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.308, pruned_loss=0.0781, over 4278940.81 frames. ], batch size: 333, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:49:07,029 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 8.471e+02 1.302e+03 1.874e+03 3.604e+03, threshold=2.604e+03, percent-clipped=20.0 2023-06-25 10:50:02,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1970256.0, ans=0.125 2023-06-25 10:50:23,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-25 10:50:34,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1970316.0, ans=0.125 2023-06-25 10:50:34,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1970316.0, ans=0.125 2023-06-25 10:50:48,366 INFO [train.py:996] (1/4) Epoch 11, batch 23450, loss[loss=0.2839, simple_loss=0.3456, pruned_loss=0.1111, over 21744.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3106, pruned_loss=0.08096, over 4281545.97 frames. ], batch size: 298, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:51:30,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1970496.0, ans=0.0 2023-06-25 10:52:27,822 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-25 10:52:32,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-25 10:52:34,820 INFO [train.py:996] (1/4) Epoch 11, batch 23500, loss[loss=0.2584, simple_loss=0.3198, pruned_loss=0.0985, over 21865.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3113, pruned_loss=0.08233, over 4280226.97 frames. ], batch size: 351, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:52:41,437 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.140e+02 8.381e+02 1.197e+03 1.768e+03 4.081e+03, threshold=2.394e+03, percent-clipped=6.0 2023-06-25 10:52:59,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.39 vs. limit=10.0 2023-06-25 10:53:03,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1970736.0, ans=0.125 2023-06-25 10:53:05,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970736.0, ans=0.1 2023-06-25 10:53:34,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1970856.0, ans=0.1 2023-06-25 10:54:19,599 INFO [train.py:996] (1/4) Epoch 11, batch 23550, loss[loss=0.2028, simple_loss=0.2725, pruned_loss=0.06656, over 21414.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3058, pruned_loss=0.08262, over 4276092.36 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:54:26,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1970976.0, ans=0.125 2023-06-25 10:54:28,639 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-25 10:55:31,167 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-25 10:56:01,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=1971216.0, ans=22.5 2023-06-25 10:56:04,921 INFO [train.py:996] (1/4) Epoch 11, batch 23600, loss[loss=0.2324, simple_loss=0.3038, pruned_loss=0.08045, over 21786.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3068, pruned_loss=0.0826, over 4251326.29 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:56:11,644 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-25 10:56:17,388 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.780e+02 1.013e+03 1.475e+03 2.570e+03, threshold=2.026e+03, percent-clipped=2.0 2023-06-25 10:56:17,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1971276.0, ans=0.1 2023-06-25 10:56:29,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1971336.0, ans=0.125 2023-06-25 10:56:39,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1971336.0, ans=0.2 2023-06-25 10:57:03,874 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.75 vs. limit=22.5 2023-06-25 10:57:31,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1971456.0, ans=0.0 2023-06-25 10:57:54,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-25 10:57:56,530 INFO [train.py:996] (1/4) Epoch 11, batch 23650, loss[loss=0.1796, simple_loss=0.2666, pruned_loss=0.04629, over 21296.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3061, pruned_loss=0.08063, over 4256385.70 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:58:04,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1971576.0, ans=0.125 2023-06-25 10:59:07,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1971756.0, ans=0.0 2023-06-25 10:59:18,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1971756.0, ans=0.125 2023-06-25 10:59:33,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1971816.0, ans=0.125 2023-06-25 10:59:36,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1971816.0, ans=0.1 2023-06-25 10:59:41,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1971816.0, ans=0.125 2023-06-25 10:59:44,393 INFO [train.py:996] (1/4) Epoch 11, batch 23700, loss[loss=0.1573, simple_loss=0.2373, pruned_loss=0.03867, over 19895.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3071, pruned_loss=0.07969, over 4257858.76 frames. ], batch size: 704, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:59:56,594 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.722e+02 1.155e+03 1.933e+03 4.444e+03, threshold=2.311e+03, percent-clipped=20.0 2023-06-25 11:00:13,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1971936.0, ans=0.125 2023-06-25 11:01:13,956 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-25 11:01:39,855 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.24 vs. limit=10.0 2023-06-25 11:01:40,524 INFO [train.py:996] (1/4) Epoch 11, batch 23750, loss[loss=0.1974, simple_loss=0.2984, pruned_loss=0.04818, over 20738.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3101, pruned_loss=0.07959, over 4260578.20 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:01:41,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1972176.0, ans=0.2 2023-06-25 11:02:48,208 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-25 11:03:28,684 INFO [train.py:996] (1/4) Epoch 11, batch 23800, loss[loss=0.2108, simple_loss=0.3041, pruned_loss=0.05881, over 21627.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3083, pruned_loss=0.07764, over 4265649.79 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:03:35,190 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.508e+02 9.725e+02 1.368e+03 2.347e+03 4.369e+03, threshold=2.737e+03, percent-clipped=25.0 2023-06-25 11:04:01,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972536.0, ans=0.1 2023-06-25 11:04:35,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1972656.0, ans=0.0 2023-06-25 11:05:05,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1972716.0, ans=0.125 2023-06-25 11:05:16,768 INFO [train.py:996] (1/4) Epoch 11, batch 23850, loss[loss=0.306, simple_loss=0.3757, pruned_loss=0.1182, over 21762.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3148, pruned_loss=0.07946, over 4267518.77 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:06:13,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1972896.0, ans=0.125 2023-06-25 11:06:17,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1972896.0, ans=0.0 2023-06-25 11:06:30,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1972956.0, ans=0.07 2023-06-25 11:07:14,247 INFO [train.py:996] (1/4) Epoch 11, batch 23900, loss[loss=0.2548, simple_loss=0.3397, pruned_loss=0.0849, over 21557.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3201, pruned_loss=0.08083, over 4269417.47 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:07:20,919 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.632e+02 1.020e+03 1.662e+03 2.575e+03 5.101e+03, threshold=3.324e+03, percent-clipped=22.0 2023-06-25 11:07:52,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2023-06-25 11:08:06,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1973196.0, ans=0.02 2023-06-25 11:08:15,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1973256.0, ans=0.125 2023-06-25 11:08:20,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1973256.0, ans=0.0 2023-06-25 11:08:48,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1973316.0, ans=0.95 2023-06-25 11:08:51,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1973316.0, ans=0.0 2023-06-25 11:08:54,234 INFO [train.py:996] (1/4) Epoch 11, batch 23950, loss[loss=0.2773, simple_loss=0.3271, pruned_loss=0.1138, over 21271.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.316, pruned_loss=0.08115, over 4268435.56 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:09:08,518 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2023-06-25 11:09:34,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1973436.0, ans=0.0 2023-06-25 11:10:10,371 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1973556.0, ans=0.015 2023-06-25 11:10:14,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.46 vs. limit=22.5 2023-06-25 11:10:47,929 INFO [train.py:996] (1/4) Epoch 11, batch 24000, loss[loss=0.3263, simple_loss=0.3733, pruned_loss=0.1396, over 21471.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3173, pruned_loss=0.08402, over 4264979.34 frames. ], batch size: 510, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:10:47,929 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 11:11:07,124 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.263, simple_loss=0.3578, pruned_loss=0.08405, over 1796401.00 frames. 2023-06-25 11:11:07,125 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 11:11:09,794 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1973676.0, ans=0.125 2023-06-25 11:11:14,109 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.140e+02 7.509e+02 1.143e+03 1.580e+03 3.381e+03, threshold=2.286e+03, percent-clipped=1.0 2023-06-25 11:11:29,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1973736.0, ans=0.125 2023-06-25 11:11:46,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1973796.0, ans=0.125 2023-06-25 11:12:55,332 INFO [train.py:996] (1/4) Epoch 11, batch 24050, loss[loss=0.197, simple_loss=0.2747, pruned_loss=0.05962, over 21237.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3187, pruned_loss=0.08477, over 4261300.31 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:13:46,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1974096.0, ans=0.125 2023-06-25 11:14:44,605 INFO [train.py:996] (1/4) Epoch 11, batch 24100, loss[loss=0.294, simple_loss=0.3688, pruned_loss=0.1096, over 21736.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3191, pruned_loss=0.08334, over 4268232.93 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:14:50,906 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.104e+02 8.872e+02 1.198e+03 1.771e+03 4.014e+03, threshold=2.396e+03, percent-clipped=16.0 2023-06-25 11:15:07,725 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:15:30,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1974396.0, ans=0.1 2023-06-25 11:15:32,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1974396.0, ans=0.125 2023-06-25 11:15:36,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1974396.0, ans=0.125 2023-06-25 11:15:55,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1974456.0, ans=0.2 2023-06-25 11:16:29,024 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 11:16:29,672 INFO [train.py:996] (1/4) Epoch 11, batch 24150, loss[loss=0.2437, simple_loss=0.3223, pruned_loss=0.08254, over 21790.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3177, pruned_loss=0.08381, over 4270861.12 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:16:35,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1974576.0, ans=0.02 2023-06-25 11:17:10,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-25 11:17:27,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974696.0, ans=0.1 2023-06-25 11:18:20,283 INFO [train.py:996] (1/4) Epoch 11, batch 24200, loss[loss=0.245, simple_loss=0.3169, pruned_loss=0.08652, over 21235.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3212, pruned_loss=0.08465, over 4275653.89 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:18:34,538 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.944e+02 9.608e+02 1.226e+03 1.956e+03 3.417e+03, threshold=2.452e+03, percent-clipped=15.0 2023-06-25 11:19:28,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1975056.0, ans=0.1 2023-06-25 11:19:58,573 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-25 11:20:15,729 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-25 11:20:16,266 INFO [train.py:996] (1/4) Epoch 11, batch 24250, loss[loss=0.2112, simple_loss=0.3115, pruned_loss=0.05551, over 21754.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3179, pruned_loss=0.07985, over 4278638.79 frames. ], batch size: 332, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:20:21,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1975176.0, ans=0.125 2023-06-25 11:21:29,356 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1975356.0, ans=0.125 2023-06-25 11:21:55,669 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:22:04,809 INFO [train.py:996] (1/4) Epoch 11, batch 24300, loss[loss=0.1682, simple_loss=0.2365, pruned_loss=0.04995, over 21831.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3129, pruned_loss=0.07466, over 4270207.88 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:22:12,743 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.467e+02 7.478e+02 1.137e+03 1.748e+03 3.902e+03, threshold=2.274e+03, percent-clipped=10.0 2023-06-25 11:23:04,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1975596.0, ans=0.125 2023-06-25 11:23:13,922 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1975656.0, ans=0.125 2023-06-25 11:23:18,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1975656.0, ans=0.2 2023-06-25 11:23:41,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1975716.0, ans=0.0 2023-06-25 11:23:46,303 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-25 11:23:49,733 INFO [train.py:996] (1/4) Epoch 11, batch 24350, loss[loss=0.2393, simple_loss=0.3087, pruned_loss=0.08488, over 21888.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3108, pruned_loss=0.07462, over 4270907.68 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:23:53,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1975776.0, ans=0.2 2023-06-25 11:24:19,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1975836.0, ans=0.125 2023-06-25 11:24:19,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1975836.0, ans=0.125 2023-06-25 11:24:45,968 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:25:24,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1976016.0, ans=0.0 2023-06-25 11:25:43,152 INFO [train.py:996] (1/4) Epoch 11, batch 24400, loss[loss=0.2236, simple_loss=0.2843, pruned_loss=0.08141, over 20204.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.312, pruned_loss=0.07721, over 4269471.43 frames. ], batch size: 707, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:26:00,906 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.788e+02 8.688e+02 1.209e+03 1.955e+03 3.228e+03, threshold=2.419e+03, percent-clipped=16.0 2023-06-25 11:26:03,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1976076.0, ans=0.1 2023-06-25 11:26:44,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1976256.0, ans=0.0 2023-06-25 11:26:56,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1976256.0, ans=0.125 2023-06-25 11:27:19,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-25 11:27:20,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1976316.0, ans=0.125 2023-06-25 11:27:24,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1976316.0, ans=0.1 2023-06-25 11:27:26,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1976316.0, ans=0.1 2023-06-25 11:27:36,892 INFO [train.py:996] (1/4) Epoch 11, batch 24450, loss[loss=0.2193, simple_loss=0.299, pruned_loss=0.06978, over 21460.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3144, pruned_loss=0.07849, over 4267738.13 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:28:47,335 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:29:05,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1976616.0, ans=0.2 2023-06-25 11:29:07,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1976616.0, ans=0.125 2023-06-25 11:29:25,320 INFO [train.py:996] (1/4) Epoch 11, batch 24500, loss[loss=0.2135, simple_loss=0.3085, pruned_loss=0.05924, over 21680.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.315, pruned_loss=0.07907, over 4268072.57 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:29:29,622 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-25 11:29:34,928 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.743e+02 7.294e+02 9.026e+02 1.332e+03 4.707e+03, threshold=1.805e+03, percent-clipped=7.0 2023-06-25 11:29:48,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1976736.0, ans=0.125 2023-06-25 11:31:02,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1976916.0, ans=0.125 2023-06-25 11:31:11,570 INFO [train.py:996] (1/4) Epoch 11, batch 24550, loss[loss=0.2478, simple_loss=0.3245, pruned_loss=0.08559, over 21809.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3181, pruned_loss=0.08104, over 4275616.19 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:31:35,405 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1977036.0, ans=0.125 2023-06-25 11:31:57,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1977096.0, ans=0.2 2023-06-25 11:32:02,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1977096.0, ans=0.0 2023-06-25 11:32:06,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1977096.0, ans=0.2 2023-06-25 11:33:02,827 INFO [train.py:996] (1/4) Epoch 11, batch 24600, loss[loss=0.1868, simple_loss=0.2545, pruned_loss=0.05958, over 21356.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3138, pruned_loss=0.08042, over 4279789.63 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:33:13,023 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.578e+02 9.421e+02 1.303e+03 2.147e+03 3.735e+03, threshold=2.606e+03, percent-clipped=31.0 2023-06-25 11:33:20,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-25 11:33:47,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1977396.0, ans=0.0 2023-06-25 11:34:25,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1977456.0, ans=0.1 2023-06-25 11:34:31,365 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-25 11:34:51,978 INFO [train.py:996] (1/4) Epoch 11, batch 24650, loss[loss=0.2199, simple_loss=0.2761, pruned_loss=0.08185, over 21274.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3072, pruned_loss=0.07877, over 4274265.78 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:35:08,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-25 11:35:14,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-25 11:36:17,333 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.44 vs. limit=15.0 2023-06-25 11:36:36,911 INFO [train.py:996] (1/4) Epoch 11, batch 24700, loss[loss=0.253, simple_loss=0.306, pruned_loss=0.1, over 21364.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3033, pruned_loss=0.07781, over 4266800.16 frames. ], batch size: 473, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:36:46,544 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.759e+02 8.060e+02 1.267e+03 1.761e+03 3.816e+03, threshold=2.533e+03, percent-clipped=4.0 2023-06-25 11:37:17,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1977996.0, ans=0.1 2023-06-25 11:38:17,806 INFO [train.py:996] (1/4) Epoch 11, batch 24750, loss[loss=0.1933, simple_loss=0.2586, pruned_loss=0.06398, over 21348.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2969, pruned_loss=0.07596, over 4263167.73 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:38:39,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1978236.0, ans=0.1 2023-06-25 11:39:05,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1978296.0, ans=0.125 2023-06-25 11:39:57,347 INFO [train.py:996] (1/4) Epoch 11, batch 24800, loss[loss=0.2454, simple_loss=0.3241, pruned_loss=0.08339, over 21228.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2915, pruned_loss=0.07522, over 4264017.05 frames. ], batch size: 549, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 11:40:14,476 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.678e+02 6.530e+02 8.946e+02 1.365e+03 3.586e+03, threshold=1.789e+03, percent-clipped=4.0 2023-06-25 11:40:16,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1978476.0, ans=0.125 2023-06-25 11:40:44,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1978596.0, ans=0.5 2023-06-25 11:41:30,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1978716.0, ans=0.1 2023-06-25 11:41:48,343 INFO [train.py:996] (1/4) Epoch 11, batch 24850, loss[loss=0.2148, simple_loss=0.2781, pruned_loss=0.07569, over 21492.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2894, pruned_loss=0.07513, over 4262864.46 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:41:57,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1978776.0, ans=0.125 2023-06-25 11:42:00,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1978776.0, ans=0.125 2023-06-25 11:42:02,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1978776.0, ans=0.125 2023-06-25 11:42:23,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1978836.0, ans=0.0 2023-06-25 11:42:34,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1978896.0, ans=0.2 2023-06-25 11:43:34,978 INFO [train.py:996] (1/4) Epoch 11, batch 24900, loss[loss=0.2375, simple_loss=0.3166, pruned_loss=0.07921, over 21588.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2929, pruned_loss=0.07603, over 4273177.81 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:43:40,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1979076.0, ans=0.2 2023-06-25 11:43:47,684 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.897e+02 9.704e+02 1.419e+03 1.998e+03 4.449e+03, threshold=2.839e+03, percent-clipped=31.0 2023-06-25 11:44:44,578 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-25 11:44:52,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979316.0, ans=0.1 2023-06-25 11:45:12,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1979316.0, ans=0.125 2023-06-25 11:45:15,044 INFO [train.py:996] (1/4) Epoch 11, batch 24950, loss[loss=0.2454, simple_loss=0.3167, pruned_loss=0.08708, over 21794.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3005, pruned_loss=0.07969, over 4274901.22 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:45:53,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1979436.0, ans=0.125 2023-06-25 11:45:55,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1979496.0, ans=0.125 2023-06-25 11:46:19,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979496.0, ans=0.1 2023-06-25 11:46:19,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1979496.0, ans=0.125 2023-06-25 11:46:45,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1979616.0, ans=0.0 2023-06-25 11:46:47,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1979616.0, ans=0.125 2023-06-25 11:46:58,072 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-06-25 11:47:03,413 INFO [train.py:996] (1/4) Epoch 11, batch 25000, loss[loss=0.2381, simple_loss=0.3024, pruned_loss=0.08684, over 21562.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3079, pruned_loss=0.08175, over 4279212.38 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:47:08,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979676.0, ans=0.1 2023-06-25 11:47:23,263 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.368e+02 7.420e+02 9.724e+02 1.691e+03 3.300e+03, threshold=1.945e+03, percent-clipped=1.0 2023-06-25 11:47:36,741 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1979736.0, ans=0.125 2023-06-25 11:47:36,743 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1979736.0, ans=0.0 2023-06-25 11:48:43,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1979916.0, ans=0.125 2023-06-25 11:48:48,903 INFO [train.py:996] (1/4) Epoch 11, batch 25050, loss[loss=0.2466, simple_loss=0.2963, pruned_loss=0.09844, over 21219.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3026, pruned_loss=0.08082, over 4285699.61 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:49:49,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1980096.0, ans=0.0 2023-06-25 11:49:51,451 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-25 11:49:56,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1980096.0, ans=0.5 2023-06-25 11:50:18,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1980156.0, ans=0.0 2023-06-25 11:50:37,649 INFO [train.py:996] (1/4) Epoch 11, batch 25100, loss[loss=0.218, simple_loss=0.3009, pruned_loss=0.06755, over 21583.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2986, pruned_loss=0.07991, over 4285526.27 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:50:58,409 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.064e+02 8.337e+02 1.105e+03 1.657e+03 3.761e+03, threshold=2.211e+03, percent-clipped=17.0 2023-06-25 11:51:05,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1980336.0, ans=0.09899494936611666 2023-06-25 11:51:08,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1980336.0, ans=0.0 2023-06-25 11:51:38,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1980396.0, ans=0.0 2023-06-25 11:52:07,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.12 vs. limit=15.0 2023-06-25 11:52:22,296 INFO [train.py:996] (1/4) Epoch 11, batch 25150, loss[loss=0.1937, simple_loss=0.2814, pruned_loss=0.05301, over 21658.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3008, pruned_loss=0.07782, over 4287018.72 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:52:26,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980576.0, ans=0.1 2023-06-25 11:53:18,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1980696.0, ans=0.0 2023-06-25 11:53:36,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1980756.0, ans=0.0 2023-06-25 11:53:39,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980756.0, ans=0.1 2023-06-25 11:54:00,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1980816.0, ans=0.125 2023-06-25 11:54:08,597 INFO [train.py:996] (1/4) Epoch 11, batch 25200, loss[loss=0.1979, simple_loss=0.2922, pruned_loss=0.05187, over 21690.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.301, pruned_loss=0.07637, over 4277781.70 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:54:20,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980876.0, ans=0.1 2023-06-25 11:54:21,787 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.466e+02 1.183e+03 1.682e+03 4.504e+03, threshold=2.365e+03, percent-clipped=14.0 2023-06-25 11:54:56,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1980996.0, ans=0.2 2023-06-25 11:55:19,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1981056.0, ans=0.125 2023-06-25 11:55:56,179 INFO [train.py:996] (1/4) Epoch 11, batch 25250, loss[loss=0.1752, simple_loss=0.2563, pruned_loss=0.04701, over 21589.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.299, pruned_loss=0.07458, over 4271434.04 frames. ], batch size: 263, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:55:58,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1981176.0, ans=0.125 2023-06-25 11:56:01,869 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1981176.0, ans=0.125 2023-06-25 11:56:08,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1981176.0, ans=0.05 2023-06-25 11:56:12,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1981236.0, ans=0.0 2023-06-25 11:56:25,500 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1981236.0, ans=0.125 2023-06-25 11:57:40,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1981416.0, ans=0.125 2023-06-25 11:57:44,440 INFO [train.py:996] (1/4) Epoch 11, batch 25300, loss[loss=0.1936, simple_loss=0.252, pruned_loss=0.0676, over 20735.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2981, pruned_loss=0.0743, over 4274933.46 frames. ], batch size: 608, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:57:56,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1981476.0, ans=0.125 2023-06-25 11:57:57,600 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.052e+02 7.897e+02 1.317e+03 1.738e+03 3.362e+03, threshold=2.634e+03, percent-clipped=11.0 2023-06-25 11:58:16,326 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1981536.0, ans=0.0 2023-06-25 11:58:59,693 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-25 11:59:07,931 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-25 11:59:32,074 INFO [train.py:996] (1/4) Epoch 11, batch 25350, loss[loss=0.2316, simple_loss=0.3164, pruned_loss=0.07342, over 20691.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2988, pruned_loss=0.07363, over 4261977.83 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:00:34,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1981896.0, ans=0.0 2023-06-25 12:01:17,480 INFO [train.py:996] (1/4) Epoch 11, batch 25400, loss[loss=0.2184, simple_loss=0.276, pruned_loss=0.08041, over 21207.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2971, pruned_loss=0.07364, over 4263934.74 frames. ], batch size: 144, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:01:29,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1982076.0, ans=0.125 2023-06-25 12:01:37,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.367e+02 9.282e+02 1.307e+03 1.888e+03 3.568e+03, threshold=2.613e+03, percent-clipped=8.0 2023-06-25 12:01:40,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1982136.0, ans=0.125 2023-06-25 12:01:49,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1982136.0, ans=0.125 2023-06-25 12:02:04,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1982196.0, ans=0.1 2023-06-25 12:02:45,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1982316.0, ans=0.09899494936611666 2023-06-25 12:02:54,914 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1982316.0, ans=0.0 2023-06-25 12:03:02,689 INFO [train.py:996] (1/4) Epoch 11, batch 25450, loss[loss=0.2175, simple_loss=0.3094, pruned_loss=0.06284, over 21385.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2986, pruned_loss=0.0749, over 4259940.55 frames. ], batch size: 194, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:03:46,974 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1982496.0, ans=0.125 2023-06-25 12:04:49,953 INFO [train.py:996] (1/4) Epoch 11, batch 25500, loss[loss=0.2192, simple_loss=0.2931, pruned_loss=0.07261, over 19963.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2981, pruned_loss=0.07196, over 4252316.66 frames. ], batch size: 702, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:05:10,404 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.694e+02 1.169e+03 1.712e+03 3.614e+03, threshold=2.338e+03, percent-clipped=5.0 2023-06-25 12:05:26,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1982736.0, ans=0.125 2023-06-25 12:06:18,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1982916.0, ans=0.1 2023-06-25 12:06:34,938 INFO [train.py:996] (1/4) Epoch 11, batch 25550, loss[loss=0.2158, simple_loss=0.3004, pruned_loss=0.06562, over 21199.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.305, pruned_loss=0.07273, over 4229981.64 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:07:13,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1983036.0, ans=0.125 2023-06-25 12:07:27,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1983096.0, ans=0.125 2023-06-25 12:07:39,723 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-25 12:07:41,537 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-06-25 12:08:38,091 INFO [train.py:996] (1/4) Epoch 11, batch 25600, loss[loss=0.276, simple_loss=0.3469, pruned_loss=0.1026, over 21805.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3097, pruned_loss=0.07386, over 4243693.57 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:08:44,143 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-25 12:08:52,925 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.144e+02 7.738e+02 1.030e+03 1.718e+03 3.511e+03, threshold=2.059e+03, percent-clipped=11.0 2023-06-25 12:09:01,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1983336.0, ans=0.125 2023-06-25 12:09:07,580 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1983336.0, ans=0.125 2023-06-25 12:09:48,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1983456.0, ans=0.0 2023-06-25 12:10:24,087 INFO [train.py:996] (1/4) Epoch 11, batch 25650, loss[loss=0.2156, simple_loss=0.2786, pruned_loss=0.0763, over 21620.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3104, pruned_loss=0.07611, over 4243704.31 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:10:31,252 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1983576.0, ans=0.125 2023-06-25 12:12:11,598 INFO [train.py:996] (1/4) Epoch 11, batch 25700, loss[loss=0.2421, simple_loss=0.3066, pruned_loss=0.08879, over 21854.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3061, pruned_loss=0.07654, over 4250580.57 frames. ], batch size: 107, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:12:38,814 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.851e+02 8.245e+02 1.134e+03 1.562e+03 3.915e+03, threshold=2.269e+03, percent-clipped=11.0 2023-06-25 12:13:12,585 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1984056.0, ans=0.125 2023-06-25 12:14:01,420 INFO [train.py:996] (1/4) Epoch 11, batch 25750, loss[loss=0.3875, simple_loss=0.4382, pruned_loss=0.1684, over 21371.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3114, pruned_loss=0.07991, over 4258480.87 frames. ], batch size: 508, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:14:43,310 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-25 12:14:53,304 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.32 vs. limit=12.0 2023-06-25 12:15:56,825 INFO [train.py:996] (1/4) Epoch 11, batch 25800, loss[loss=0.2317, simple_loss=0.2923, pruned_loss=0.08551, over 20147.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3233, pruned_loss=0.08474, over 4266465.28 frames. ], batch size: 707, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:15:59,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-25 12:16:12,252 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.210e+02 8.965e+02 1.490e+03 2.590e+03 4.866e+03, threshold=2.981e+03, percent-clipped=29.0 2023-06-25 12:17:30,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1984716.0, ans=0.0 2023-06-25 12:17:45,337 INFO [train.py:996] (1/4) Epoch 11, batch 25850, loss[loss=0.2545, simple_loss=0.3166, pruned_loss=0.09616, over 21826.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3247, pruned_loss=0.0839, over 4276716.31 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:18:10,376 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-25 12:19:28,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1985016.0, ans=0.1 2023-06-25 12:19:33,844 INFO [train.py:996] (1/4) Epoch 11, batch 25900, loss[loss=0.2741, simple_loss=0.3689, pruned_loss=0.08961, over 21828.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3262, pruned_loss=0.08413, over 4286334.94 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:19:54,641 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 8.473e+02 1.223e+03 1.634e+03 2.981e+03, threshold=2.447e+03, percent-clipped=0.0 2023-06-25 12:19:56,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1985136.0, ans=0.0 2023-06-25 12:20:33,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1985196.0, ans=0.09899494936611666 2023-06-25 12:20:48,965 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-25 12:21:09,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1985316.0, ans=0.125 2023-06-25 12:21:21,891 INFO [train.py:996] (1/4) Epoch 11, batch 25950, loss[loss=0.3193, simple_loss=0.381, pruned_loss=0.1288, over 21418.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3305, pruned_loss=0.08672, over 4284238.88 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:22:12,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1985436.0, ans=10.0 2023-06-25 12:22:52,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1985556.0, ans=0.0 2023-06-25 12:23:08,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-25 12:23:18,155 INFO [train.py:996] (1/4) Epoch 11, batch 26000, loss[loss=0.2215, simple_loss=0.3205, pruned_loss=0.06131, over 21798.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3291, pruned_loss=0.08474, over 4274504.81 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:23:40,959 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 7.867e+02 1.001e+03 1.506e+03 3.925e+03, threshold=2.003e+03, percent-clipped=6.0 2023-06-25 12:24:12,302 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1985796.0, ans=0.015 2023-06-25 12:24:50,019 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-25 12:24:56,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1985916.0, ans=0.1 2023-06-25 12:25:03,162 INFO [train.py:996] (1/4) Epoch 11, batch 26050, loss[loss=0.2313, simple_loss=0.3017, pruned_loss=0.08048, over 21872.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3297, pruned_loss=0.08651, over 4280913.94 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:25:51,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-25 12:26:00,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1986096.0, ans=0.0 2023-06-25 12:26:05,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1986096.0, ans=0.04949747468305833 2023-06-25 12:26:12,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-25 12:26:21,847 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1986156.0, ans=0.125 2023-06-25 12:26:30,584 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.20 vs. limit=15.0 2023-06-25 12:26:47,697 INFO [train.py:996] (1/4) Epoch 11, batch 26100, loss[loss=0.255, simple_loss=0.3145, pruned_loss=0.09774, over 21913.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3249, pruned_loss=0.08658, over 4289045.99 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:27:09,757 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.534e+02 7.535e+02 1.106e+03 1.701e+03 2.759e+03, threshold=2.213e+03, percent-clipped=15.0 2023-06-25 12:27:25,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1986336.0, ans=0.125 2023-06-25 12:27:48,395 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-25 12:27:52,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1986396.0, ans=0.125 2023-06-25 12:28:20,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.78 vs. limit=6.0 2023-06-25 12:28:39,866 INFO [train.py:996] (1/4) Epoch 11, batch 26150, loss[loss=0.2262, simple_loss=0.2944, pruned_loss=0.07905, over 17513.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3216, pruned_loss=0.08636, over 4285862.16 frames. ], batch size: 61, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:28:45,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1986576.0, ans=0.5 2023-06-25 12:28:47,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1986576.0, ans=0.125 2023-06-25 12:28:49,209 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1986576.0, ans=0.125 2023-06-25 12:28:49,224 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1986576.0, ans=0.2 2023-06-25 12:29:10,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1986636.0, ans=0.2 2023-06-25 12:30:00,379 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:30:26,230 INFO [train.py:996] (1/4) Epoch 11, batch 26200, loss[loss=0.2423, simple_loss=0.3651, pruned_loss=0.0597, over 20868.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3226, pruned_loss=0.08425, over 4288137.35 frames. ], batch size: 608, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:30:51,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1986876.0, ans=0.0 2023-06-25 12:30:53,732 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.107e+02 7.785e+02 1.042e+03 1.454e+03 3.867e+03, threshold=2.084e+03, percent-clipped=10.0 2023-06-25 12:31:13,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1986996.0, ans=0.0 2023-06-25 12:31:56,289 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1987116.0, ans=0.125 2023-06-25 12:32:10,622 INFO [train.py:996] (1/4) Epoch 11, batch 26250, loss[loss=0.2095, simple_loss=0.2867, pruned_loss=0.06617, over 21486.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3239, pruned_loss=0.08253, over 4285463.33 frames. ], batch size: 194, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:32:20,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1987176.0, ans=0.2 2023-06-25 12:32:43,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1987236.0, ans=0.1 2023-06-25 12:33:08,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1987296.0, ans=0.125 2023-06-25 12:33:13,935 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-25 12:33:17,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.16 vs. limit=15.0 2023-06-25 12:33:57,626 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-25 12:34:04,636 INFO [train.py:996] (1/4) Epoch 11, batch 26300, loss[loss=0.228, simple_loss=0.2992, pruned_loss=0.07842, over 21968.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.322, pruned_loss=0.08309, over 4283690.12 frames. ], batch size: 333, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:34:26,039 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.433e+02 7.825e+02 1.057e+03 1.626e+03 4.026e+03, threshold=2.114e+03, percent-clipped=11.0 2023-06-25 12:34:31,502 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1987536.0, ans=0.0 2023-06-25 12:34:48,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1987596.0, ans=0.125 2023-06-25 12:34:49,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1987596.0, ans=0.07 2023-06-25 12:35:26,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1987716.0, ans=0.125 2023-06-25 12:35:32,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1987716.0, ans=0.1 2023-06-25 12:35:43,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1987716.0, ans=0.125 2023-06-25 12:35:48,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1987776.0, ans=0.0 2023-06-25 12:35:49,292 INFO [train.py:996] (1/4) Epoch 11, batch 26350, loss[loss=0.2482, simple_loss=0.3108, pruned_loss=0.09283, over 19945.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3188, pruned_loss=0.08313, over 4282835.46 frames. ], batch size: 702, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:36:12,836 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1987836.0, ans=0.125 2023-06-25 12:36:22,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1987896.0, ans=0.125 2023-06-25 12:36:41,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1987896.0, ans=0.125 2023-06-25 12:37:31,610 INFO [train.py:996] (1/4) Epoch 11, batch 26400, loss[loss=0.2353, simple_loss=0.29, pruned_loss=0.09026, over 21820.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.315, pruned_loss=0.08443, over 4274004.70 frames. ], batch size: 98, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:37:50,231 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.280e+02 8.065e+02 9.903e+02 1.362e+03 2.931e+03, threshold=1.981e+03, percent-clipped=5.0 2023-06-25 12:37:59,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1988136.0, ans=0.125 2023-06-25 12:38:03,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1988136.0, ans=0.2 2023-06-25 12:38:07,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1988196.0, ans=0.2 2023-06-25 12:38:26,544 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988196.0, ans=0.1 2023-06-25 12:38:27,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-25 12:39:22,505 INFO [train.py:996] (1/4) Epoch 11, batch 26450, loss[loss=0.2418, simple_loss=0.3494, pruned_loss=0.06705, over 21744.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3146, pruned_loss=0.08379, over 4261205.11 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:40:27,454 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-25 12:40:56,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1988616.0, ans=0.1 2023-06-25 12:41:04,407 INFO [train.py:996] (1/4) Epoch 11, batch 26500, loss[loss=0.2068, simple_loss=0.2597, pruned_loss=0.07695, over 20040.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3171, pruned_loss=0.08228, over 4262454.44 frames. ], batch size: 704, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:41:10,923 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-25 12:41:34,322 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.135e+02 9.452e+02 1.417e+03 2.268e+03 5.584e+03, threshold=2.834e+03, percent-clipped=34.0 2023-06-25 12:42:11,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1988796.0, ans=0.125 2023-06-25 12:42:34,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1988916.0, ans=0.125 2023-06-25 12:42:45,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1988916.0, ans=0.125 2023-06-25 12:43:02,825 INFO [train.py:996] (1/4) Epoch 11, batch 26550, loss[loss=0.1863, simple_loss=0.2681, pruned_loss=0.05229, over 21617.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3152, pruned_loss=0.07967, over 4261586.84 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:44:55,006 INFO [train.py:996] (1/4) Epoch 11, batch 26600, loss[loss=0.1972, simple_loss=0.2734, pruned_loss=0.06052, over 21473.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.313, pruned_loss=0.07706, over 4261945.67 frames. ], batch size: 230, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:45:18,706 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.580e+02 8.378e+02 1.280e+03 1.887e+03 4.610e+03, threshold=2.560e+03, percent-clipped=7.0 2023-06-25 12:45:19,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1989336.0, ans=0.025 2023-06-25 12:46:41,121 INFO [train.py:996] (1/4) Epoch 11, batch 26650, loss[loss=0.1765, simple_loss=0.2596, pruned_loss=0.0467, over 21657.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3066, pruned_loss=0.07628, over 4265571.13 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:47:19,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1989696.0, ans=0.125 2023-06-25 12:48:26,227 INFO [train.py:996] (1/4) Epoch 11, batch 26700, loss[loss=0.2801, simple_loss=0.3438, pruned_loss=0.1081, over 21859.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2992, pruned_loss=0.07273, over 4276867.62 frames. ], batch size: 107, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:48:38,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1989876.0, ans=0.2 2023-06-25 12:48:49,310 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-25 12:48:49,822 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 6.649e+02 8.698e+02 1.295e+03 2.536e+03, threshold=1.740e+03, percent-clipped=0.0 2023-06-25 12:49:03,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1989996.0, ans=0.125 2023-06-25 12:49:13,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1989996.0, ans=0.125 2023-06-25 12:49:53,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1990116.0, ans=0.125 2023-06-25 12:50:00,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1990116.0, ans=0.125 2023-06-25 12:50:02,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1990116.0, ans=0.125 2023-06-25 12:50:02,996 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-25 12:50:13,228 INFO [train.py:996] (1/4) Epoch 11, batch 26750, loss[loss=0.1935, simple_loss=0.2657, pruned_loss=0.06062, over 22017.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2991, pruned_loss=0.07227, over 4283999.77 frames. ], batch size: 103, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:50:30,906 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-25 12:50:32,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.26 vs. limit=22.5 2023-06-25 12:51:18,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1990356.0, ans=0.0 2023-06-25 12:51:30,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1990356.0, ans=0.125 2023-06-25 12:51:47,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1990416.0, ans=0.0 2023-06-25 12:51:54,746 INFO [train.py:996] (1/4) Epoch 11, batch 26800, loss[loss=0.3158, simple_loss=0.3699, pruned_loss=0.1308, over 21308.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3072, pruned_loss=0.07705, over 4280551.39 frames. ], batch size: 507, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:51:55,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1990476.0, ans=0.0 2023-06-25 12:52:09,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1990476.0, ans=0.125 2023-06-25 12:52:15,293 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 8.736e+02 1.158e+03 1.774e+03 3.470e+03, threshold=2.315e+03, percent-clipped=25.0 2023-06-25 12:53:08,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1990656.0, ans=0.125 2023-06-25 12:53:23,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1990656.0, ans=0.125 2023-06-25 12:53:42,539 INFO [train.py:996] (1/4) Epoch 11, batch 26850, loss[loss=0.2275, simple_loss=0.2806, pruned_loss=0.08721, over 21395.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3089, pruned_loss=0.0798, over 4275058.55 frames. ], batch size: 473, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:54:15,062 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1990836.0, ans=0.1 2023-06-25 12:54:17,034 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-25 12:54:27,170 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-25 12:55:27,807 INFO [train.py:996] (1/4) Epoch 11, batch 26900, loss[loss=0.2079, simple_loss=0.2672, pruned_loss=0.07426, over 21132.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3001, pruned_loss=0.07831, over 4257052.18 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:55:36,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1991076.0, ans=0.1 2023-06-25 12:55:42,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1991136.0, ans=0.125 2023-06-25 12:55:47,034 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.238e+02 7.282e+02 8.869e+02 1.344e+03 2.683e+03, threshold=1.774e+03, percent-clipped=1.0 2023-06-25 12:55:55,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1991136.0, ans=0.07 2023-06-25 12:56:38,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1991256.0, ans=0.125 2023-06-25 12:56:40,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1991256.0, ans=0.035 2023-06-25 12:56:55,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-25 12:57:13,456 INFO [train.py:996] (1/4) Epoch 11, batch 26950, loss[loss=0.2459, simple_loss=0.3346, pruned_loss=0.07863, over 21712.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2977, pruned_loss=0.07732, over 4258345.77 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:57:17,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1991376.0, ans=0.07 2023-06-25 12:57:22,114 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:57:30,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1991436.0, ans=0.025 2023-06-25 12:57:48,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1991436.0, ans=0.125 2023-06-25 12:57:52,086 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1991496.0, ans=0.0 2023-06-25 12:58:21,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-25 12:58:59,441 INFO [train.py:996] (1/4) Epoch 11, batch 27000, loss[loss=0.1994, simple_loss=0.3078, pruned_loss=0.04556, over 20757.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2988, pruned_loss=0.075, over 4266776.65 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 12:58:59,442 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 12:59:15,222 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([3.9925, 3.5383, 3.6522, 3.8063], device='cuda:1') 2023-06-25 12:59:16,966 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.235, simple_loss=0.334, pruned_loss=0.06803, over 1796401.00 frames. 2023-06-25 12:59:16,967 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 12:59:55,910 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 9.015e+02 1.282e+03 1.827e+03 4.662e+03, threshold=2.565e+03, percent-clipped=27.0 2023-06-25 13:00:42,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1991856.0, ans=0.2 2023-06-25 13:01:06,449 INFO [train.py:996] (1/4) Epoch 11, batch 27050, loss[loss=0.2285, simple_loss=0.3076, pruned_loss=0.07467, over 21462.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2999, pruned_loss=0.0718, over 4262612.00 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:01:58,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1992096.0, ans=0.125 2023-06-25 13:02:38,509 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1992216.0, ans=0.125 2023-06-25 13:02:53,341 INFO [train.py:996] (1/4) Epoch 11, batch 27100, loss[loss=0.229, simple_loss=0.3104, pruned_loss=0.07377, over 21193.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3017, pruned_loss=0.07323, over 4269194.67 frames. ], batch size: 143, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:03:06,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1992276.0, ans=0.125 2023-06-25 13:03:24,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1992336.0, ans=0.1 2023-06-25 13:03:27,676 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.097e+02 9.568e+02 1.359e+03 2.016e+03 3.804e+03, threshold=2.717e+03, percent-clipped=7.0 2023-06-25 13:03:51,516 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-25 13:04:30,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1992516.0, ans=0.0 2023-06-25 13:04:35,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1992516.0, ans=0.0 2023-06-25 13:04:41,025 INFO [train.py:996] (1/4) Epoch 11, batch 27150, loss[loss=0.2622, simple_loss=0.3529, pruned_loss=0.08576, over 21729.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3151, pruned_loss=0.07674, over 4267457.29 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:05:10,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1992636.0, ans=0.1 2023-06-25 13:05:25,744 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-25 13:05:31,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1992696.0, ans=0.125 2023-06-25 13:05:48,352 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1992756.0, ans=0.125 2023-06-25 13:05:52,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-25 13:06:27,139 INFO [train.py:996] (1/4) Epoch 11, batch 27200, loss[loss=0.235, simple_loss=0.3186, pruned_loss=0.07569, over 21379.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3221, pruned_loss=0.07903, over 4275630.44 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:06:48,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1992876.0, ans=0.025 2023-06-25 13:07:01,206 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.854e+02 1.107e+03 1.912e+03 4.473e+03, threshold=2.214e+03, percent-clipped=8.0 2023-06-25 13:07:09,757 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-25 13:07:22,073 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-25 13:07:55,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1993116.0, ans=0.125 2023-06-25 13:08:27,029 INFO [train.py:996] (1/4) Epoch 11, batch 27250, loss[loss=0.2818, simple_loss=0.3521, pruned_loss=0.1057, over 21770.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.325, pruned_loss=0.08235, over 4278392.65 frames. ], batch size: 118, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:08:37,570 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-25 13:08:38,898 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:08:52,624 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1993236.0, ans=0.0 2023-06-25 13:08:53,434 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2023-06-25 13:08:59,824 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-25 13:09:01,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1993296.0, ans=0.125 2023-06-25 13:09:47,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1993356.0, ans=0.0 2023-06-25 13:09:54,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1993356.0, ans=0.1 2023-06-25 13:10:01,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1993416.0, ans=0.0 2023-06-25 13:10:15,679 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1993416.0, ans=0.05 2023-06-25 13:10:17,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1993476.0, ans=0.125 2023-06-25 13:10:18,703 INFO [train.py:996] (1/4) Epoch 11, batch 27300, loss[loss=0.2121, simple_loss=0.3225, pruned_loss=0.05088, over 21749.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3262, pruned_loss=0.08287, over 4276284.56 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:10:26,605 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-25 13:10:32,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1993476.0, ans=0.025 2023-06-25 13:10:41,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1993536.0, ans=0.125 2023-06-25 13:10:46,373 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.500e+02 8.064e+02 1.048e+03 1.560e+03 3.072e+03, threshold=2.097e+03, percent-clipped=8.0 2023-06-25 13:10:58,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-25 13:11:54,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1993716.0, ans=0.125 2023-06-25 13:12:04,738 INFO [train.py:996] (1/4) Epoch 11, batch 27350, loss[loss=0.2606, simple_loss=0.336, pruned_loss=0.09255, over 21544.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3294, pruned_loss=0.08433, over 4277362.76 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:12:05,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1993776.0, ans=0.05 2023-06-25 13:12:15,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1993776.0, ans=0.125 2023-06-25 13:13:17,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1993956.0, ans=0.1 2023-06-25 13:13:20,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1993956.0, ans=0.1 2023-06-25 13:13:34,989 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-06-25 13:13:50,500 INFO [train.py:996] (1/4) Epoch 11, batch 27400, loss[loss=0.2435, simple_loss=0.3056, pruned_loss=0.09072, over 21698.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3256, pruned_loss=0.08399, over 4281021.59 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:14:08,574 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1994076.0, ans=0.0 2023-06-25 13:14:17,608 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.617e+02 1.033e+03 1.386e+03 3.217e+03, threshold=2.066e+03, percent-clipped=9.0 2023-06-25 13:14:25,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1994136.0, ans=0.125 2023-06-25 13:15:02,613 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1994256.0, ans=0.0 2023-06-25 13:15:11,241 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1994256.0, ans=0.125 2023-06-25 13:15:26,423 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1994316.0, ans=0.2 2023-06-25 13:15:35,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1994316.0, ans=0.0 2023-06-25 13:15:37,987 INFO [train.py:996] (1/4) Epoch 11, batch 27450, loss[loss=0.2843, simple_loss=0.3561, pruned_loss=0.1062, over 21363.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3191, pruned_loss=0.08255, over 4283811.90 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:16:51,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1994556.0, ans=0.025 2023-06-25 13:16:59,080 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.56 vs. limit=22.5 2023-06-25 13:17:23,735 INFO [train.py:996] (1/4) Epoch 11, batch 27500, loss[loss=0.2196, simple_loss=0.2861, pruned_loss=0.07661, over 21836.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3181, pruned_loss=0.08276, over 4287398.80 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:17:25,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1994676.0, ans=0.0 2023-06-25 13:17:47,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1994736.0, ans=0.125 2023-06-25 13:17:51,920 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.259e+02 1.005e+03 1.389e+03 2.816e+03, threshold=2.010e+03, percent-clipped=4.0 2023-06-25 13:18:53,416 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1994916.0, ans=0.125 2023-06-25 13:18:53,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1994916.0, ans=0.125 2023-06-25 13:19:07,759 INFO [train.py:996] (1/4) Epoch 11, batch 27550, loss[loss=0.1971, simple_loss=0.2756, pruned_loss=0.05924, over 21704.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3138, pruned_loss=0.08076, over 4284942.97 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:20:21,499 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.58 vs. limit=6.0 2023-06-25 13:20:30,680 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1995216.0, ans=0.0 2023-06-25 13:20:34,729 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-25 13:20:54,534 INFO [train.py:996] (1/4) Epoch 11, batch 27600, loss[loss=0.2912, simple_loss=0.3742, pruned_loss=0.1041, over 19892.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.309, pruned_loss=0.0801, over 4276714.85 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:21:03,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1995276.0, ans=0.0 2023-06-25 13:21:17,411 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.292e+02 9.329e+02 1.469e+03 1.993e+03 3.791e+03, threshold=2.938e+03, percent-clipped=25.0 2023-06-25 13:21:39,636 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1995396.0, ans=0.125 2023-06-25 13:21:42,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1995396.0, ans=0.0 2023-06-25 13:21:52,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1995396.0, ans=0.125 2023-06-25 13:22:14,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 13:22:18,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1995516.0, ans=0.125 2023-06-25 13:22:23,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1995516.0, ans=0.2 2023-06-25 13:22:27,936 INFO [train.py:996] (1/4) Epoch 11, batch 27650, loss[loss=0.2168, simple_loss=0.2744, pruned_loss=0.07959, over 21556.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3029, pruned_loss=0.07951, over 4274234.24 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:23:08,860 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-25 13:23:38,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-25 13:24:10,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1995816.0, ans=0.125 2023-06-25 13:24:19,720 INFO [train.py:996] (1/4) Epoch 11, batch 27700, loss[loss=0.1864, simple_loss=0.2774, pruned_loss=0.04773, over 21727.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3027, pruned_loss=0.07753, over 4278503.31 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:24:39,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1995936.0, ans=0.0 2023-06-25 13:24:43,468 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.775e+02 8.277e+02 1.271e+03 1.738e+03 3.564e+03, threshold=2.542e+03, percent-clipped=2.0 2023-06-25 13:26:03,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1996116.0, ans=0.125 2023-06-25 13:26:05,579 INFO [train.py:996] (1/4) Epoch 11, batch 27750, loss[loss=0.1926, simple_loss=0.2716, pruned_loss=0.05682, over 21268.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3066, pruned_loss=0.07784, over 4277694.96 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:26:36,387 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-25 13:27:14,451 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1996356.0, ans=0.1 2023-06-25 13:27:43,151 INFO [train.py:996] (1/4) Epoch 11, batch 27800, loss[loss=0.2311, simple_loss=0.3017, pruned_loss=0.08026, over 21441.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3053, pruned_loss=0.07834, over 4285511.76 frames. ], batch size: 144, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:28:10,893 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.247e+02 7.249e+02 9.541e+02 1.506e+03 2.955e+03, threshold=1.908e+03, percent-clipped=10.0 2023-06-25 13:28:21,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1996536.0, ans=6.0 2023-06-25 13:28:22,897 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-25 13:28:27,104 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1996596.0, ans=0.1 2023-06-25 13:28:47,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1996656.0, ans=0.125 2023-06-25 13:28:57,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1996656.0, ans=0.0 2023-06-25 13:29:02,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1996656.0, ans=0.125 2023-06-25 13:29:27,201 INFO [train.py:996] (1/4) Epoch 11, batch 27850, loss[loss=0.2111, simple_loss=0.2873, pruned_loss=0.06748, over 21852.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3045, pruned_loss=0.07912, over 4294493.56 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:30:31,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1996896.0, ans=0.125 2023-06-25 13:31:09,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1997016.0, ans=0.125 2023-06-25 13:31:14,002 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-25 13:31:17,912 INFO [train.py:996] (1/4) Epoch 11, batch 27900, loss[loss=0.2467, simple_loss=0.3398, pruned_loss=0.07674, over 21747.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3141, pruned_loss=0.08003, over 4292378.78 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:31:47,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1997136.0, ans=0.0 2023-06-25 13:31:50,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1997136.0, ans=0.125 2023-06-25 13:31:53,162 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 7.594e+02 1.073e+03 1.549e+03 3.110e+03, threshold=2.145e+03, percent-clipped=9.0 2023-06-25 13:32:38,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1997256.0, ans=0.125 2023-06-25 13:32:47,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-25 13:33:12,408 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.86 vs. limit=15.0 2023-06-25 13:33:12,664 INFO [train.py:996] (1/4) Epoch 11, batch 27950, loss[loss=0.2139, simple_loss=0.3232, pruned_loss=0.05226, over 21213.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.316, pruned_loss=0.07735, over 4290236.37 frames. ], batch size: 549, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:33:52,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1997496.0, ans=0.1 2023-06-25 13:34:27,663 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-25 13:34:56,457 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1997676.0, ans=0.2 2023-06-25 13:34:57,437 INFO [train.py:996] (1/4) Epoch 11, batch 28000, loss[loss=0.2004, simple_loss=0.297, pruned_loss=0.05192, over 21410.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3131, pruned_loss=0.07425, over 4286126.56 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 13:35:25,421 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.615e+02 8.690e+02 1.335e+03 1.864e+03 4.176e+03, threshold=2.670e+03, percent-clipped=16.0 2023-06-25 13:35:31,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1997736.0, ans=0.125 2023-06-25 13:35:33,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1997736.0, ans=0.0 2023-06-25 13:35:54,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1997796.0, ans=0.125 2023-06-25 13:35:54,595 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1997796.0, ans=0.125 2023-06-25 13:36:08,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-25 13:36:14,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1997916.0, ans=0.2 2023-06-25 13:36:22,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1997916.0, ans=0.125 2023-06-25 13:36:49,694 INFO [train.py:996] (1/4) Epoch 11, batch 28050, loss[loss=0.2364, simple_loss=0.3355, pruned_loss=0.06871, over 20816.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3104, pruned_loss=0.07551, over 4289070.07 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 13:36:55,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1997976.0, ans=0.0 2023-06-25 13:37:05,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1998036.0, ans=0.0 2023-06-25 13:38:15,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-25 13:38:37,705 INFO [train.py:996] (1/4) Epoch 11, batch 28100, loss[loss=0.1812, simple_loss=0.245, pruned_loss=0.05872, over 21578.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3055, pruned_loss=0.075, over 4281749.34 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:39:01,527 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.345e+02 8.283e+02 1.257e+03 1.912e+03 3.792e+03, threshold=2.513e+03, percent-clipped=5.0 2023-06-25 13:39:25,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.00 vs. limit=22.5 2023-06-25 13:39:56,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1998456.0, ans=0.125 2023-06-25 13:40:22,785 INFO [train.py:996] (1/4) Epoch 11, batch 28150, loss[loss=0.2309, simple_loss=0.2844, pruned_loss=0.08866, over 21170.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3001, pruned_loss=0.07569, over 4275473.12 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:41:42,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1998756.0, ans=0.125 2023-06-25 13:41:42,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1998756.0, ans=0.1 2023-06-25 13:42:10,486 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-25 13:42:10,830 INFO [train.py:996] (1/4) Epoch 11, batch 28200, loss[loss=0.2549, simple_loss=0.3214, pruned_loss=0.09423, over 21526.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.299, pruned_loss=0.07688, over 4270939.88 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:42:16,792 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1998876.0, ans=0.0 2023-06-25 13:42:42,182 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.056e+02 7.864e+02 1.049e+03 1.647e+03 3.891e+03, threshold=2.099e+03, percent-clipped=11.0 2023-06-25 13:42:47,173 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.23 vs. limit=10.0 2023-06-25 13:43:57,356 INFO [train.py:996] (1/4) Epoch 11, batch 28250, loss[loss=0.2206, simple_loss=0.2827, pruned_loss=0.07924, over 16403.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3016, pruned_loss=0.0793, over 4260586.51 frames. ], batch size: 60, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:44:32,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1999236.0, ans=0.0 2023-06-25 13:45:45,850 INFO [train.py:996] (1/4) Epoch 11, batch 28300, loss[loss=0.1723, simple_loss=0.2704, pruned_loss=0.03712, over 21730.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2986, pruned_loss=0.07659, over 4269279.56 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:46:21,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1999536.0, ans=0.125 2023-06-25 13:46:24,393 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.207e+02 7.571e+02 1.027e+03 1.599e+03 2.949e+03, threshold=2.054e+03, percent-clipped=6.0 2023-06-25 13:46:31,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1999596.0, ans=0.0 2023-06-25 13:46:42,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-25 13:47:38,015 INFO [train.py:996] (1/4) Epoch 11, batch 28350, loss[loss=0.2273, simple_loss=0.2922, pruned_loss=0.08116, over 21479.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.296, pruned_loss=0.07157, over 4273657.92 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:47:47,237 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1999776.0, ans=0.125 2023-06-25 13:49:24,429 INFO [train.py:996] (1/4) Epoch 11, batch 28400, loss[loss=0.2431, simple_loss=0.3076, pruned_loss=0.08932, over 21656.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2946, pruned_loss=0.07061, over 4266121.10 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:49:57,010 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.013e+02 9.043e+02 1.474e+03 1.977e+03 3.910e+03, threshold=2.949e+03, percent-clipped=21.0 2023-06-25 13:50:12,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-25 13:50:48,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2000316.0, ans=0.04949747468305833 2023-06-25 13:51:02,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2000316.0, ans=0.2 2023-06-25 13:51:09,815 INFO [train.py:996] (1/4) Epoch 11, batch 28450, loss[loss=0.2809, simple_loss=0.3412, pruned_loss=0.1103, over 21542.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2986, pruned_loss=0.07397, over 4259023.18 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:51:28,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-25 13:51:30,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2000376.0, ans=0.1 2023-06-25 13:52:51,474 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2000616.0, ans=0.125 2023-06-25 13:53:03,153 INFO [train.py:996] (1/4) Epoch 11, batch 28500, loss[loss=0.266, simple_loss=0.3371, pruned_loss=0.09742, over 21780.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3015, pruned_loss=0.07735, over 4271583.95 frames. ], batch size: 124, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:53:10,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2000676.0, ans=0.125 2023-06-25 13:53:17,946 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-25 13:53:19,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2000736.0, ans=0.0 2023-06-25 13:53:46,228 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.310e+02 7.652e+02 9.900e+02 1.430e+03 3.378e+03, threshold=1.980e+03, percent-clipped=2.0 2023-06-25 13:54:02,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-25 13:54:19,233 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:54:45,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2000916.0, ans=0.1 2023-06-25 13:54:51,352 INFO [train.py:996] (1/4) Epoch 11, batch 28550, loss[loss=0.2049, simple_loss=0.3096, pruned_loss=0.05014, over 20016.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3092, pruned_loss=0.07985, over 4270820.66 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 4.0 2023-06-25 13:55:13,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2001036.0, ans=0.05 2023-06-25 13:55:56,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-25 13:56:06,889 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.37 vs. limit=15.0 2023-06-25 13:56:44,025 INFO [train.py:996] (1/4) Epoch 11, batch 28600, loss[loss=0.2477, simple_loss=0.3093, pruned_loss=0.09302, over 20060.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3156, pruned_loss=0.08207, over 4274214.17 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:57:07,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2001336.0, ans=0.0 2023-06-25 13:57:16,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2001336.0, ans=0.125 2023-06-25 13:57:18,471 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.619e+02 9.941e+02 1.518e+03 3.528e+03, threshold=1.988e+03, percent-clipped=12.0 2023-06-25 13:57:50,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2001456.0, ans=0.125 2023-06-25 13:58:00,533 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-25 13:58:18,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2001516.0, ans=0.1 2023-06-25 13:58:28,357 INFO [train.py:996] (1/4) Epoch 11, batch 28650, loss[loss=0.2239, simple_loss=0.2816, pruned_loss=0.08308, over 21228.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3102, pruned_loss=0.08211, over 4271657.40 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:59:17,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2001696.0, ans=0.2 2023-06-25 13:59:23,083 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-25 14:00:08,196 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-25 14:00:20,291 INFO [train.py:996] (1/4) Epoch 11, batch 28700, loss[loss=0.256, simple_loss=0.3262, pruned_loss=0.09289, over 21357.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3081, pruned_loss=0.08261, over 4275182.24 frames. ], batch size: 549, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 14:00:55,771 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.441e+02 7.291e+02 9.817e+02 1.860e+03 4.444e+03, threshold=1.963e+03, percent-clipped=16.0 2023-06-25 14:02:03,142 INFO [train.py:996] (1/4) Epoch 11, batch 28750, loss[loss=0.2196, simple_loss=0.3062, pruned_loss=0.0665, over 21633.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3079, pruned_loss=0.08298, over 4283269.08 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 14:03:37,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2002416.0, ans=0.0 2023-06-25 14:03:48,830 INFO [train.py:996] (1/4) Epoch 11, batch 28800, loss[loss=0.2656, simple_loss=0.3329, pruned_loss=0.09916, over 21614.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.31, pruned_loss=0.08237, over 4286188.13 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:03:58,090 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=15.0 2023-06-25 14:04:29,444 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.612e+02 1.083e+03 1.492e+03 3.378e+03, threshold=2.166e+03, percent-clipped=11.0 2023-06-25 14:04:43,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-25 14:05:18,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2002716.0, ans=0.125 2023-06-25 14:05:22,353 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.75 vs. limit=10.0 2023-06-25 14:05:29,396 INFO [train.py:996] (1/4) Epoch 11, batch 28850, loss[loss=0.2024, simple_loss=0.2739, pruned_loss=0.06548, over 20962.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3126, pruned_loss=0.0843, over 4284295.78 frames. ], batch size: 607, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:05:35,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2002776.0, ans=0.125 2023-06-25 14:06:18,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2002896.0, ans=0.2 2023-06-25 14:06:22,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2002896.0, ans=0.125 2023-06-25 14:06:25,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=2002896.0, ans=0.2 2023-06-25 14:06:58,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2002956.0, ans=0.0 2023-06-25 14:07:14,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2003016.0, ans=0.0 2023-06-25 14:07:22,656 INFO [train.py:996] (1/4) Epoch 11, batch 28900, loss[loss=0.2528, simple_loss=0.3222, pruned_loss=0.09171, over 21763.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3171, pruned_loss=0.08721, over 4283619.22 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:07:45,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2003136.0, ans=0.0 2023-06-25 14:07:57,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2003136.0, ans=0.125 2023-06-25 14:08:00,120 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.503e+02 7.472e+02 9.917e+02 1.436e+03 2.913e+03, threshold=1.983e+03, percent-clipped=5.0 2023-06-25 14:08:03,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2003196.0, ans=0.0 2023-06-25 14:08:15,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2003196.0, ans=0.0 2023-06-25 14:08:30,247 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2003256.0, ans=0.125 2023-06-25 14:08:44,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2003256.0, ans=0.04949747468305833 2023-06-25 14:09:08,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2003316.0, ans=0.1 2023-06-25 14:09:17,157 INFO [train.py:996] (1/4) Epoch 11, batch 28950, loss[loss=0.2259, simple_loss=0.3115, pruned_loss=0.07014, over 21851.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.318, pruned_loss=0.08584, over 4277418.88 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:09:48,774 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2003436.0, ans=0.2 2023-06-25 14:09:50,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2003436.0, ans=0.1 2023-06-25 14:10:16,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2003496.0, ans=0.125 2023-06-25 14:10:40,111 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2003556.0, ans=0.0 2023-06-25 14:11:06,006 INFO [train.py:996] (1/4) Epoch 11, batch 29000, loss[loss=0.2629, simple_loss=0.3397, pruned_loss=0.09307, over 21788.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.32, pruned_loss=0.08443, over 4270786.59 frames. ], batch size: 118, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:11:12,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-25 14:11:28,639 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2003736.0, ans=0.0 2023-06-25 14:11:46,496 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.59 vs. limit=12.0 2023-06-25 14:11:48,366 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.370e+02 8.594e+02 1.350e+03 2.116e+03 4.440e+03, threshold=2.700e+03, percent-clipped=27.0 2023-06-25 14:11:58,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2003796.0, ans=0.0 2023-06-25 14:12:52,701 INFO [train.py:996] (1/4) Epoch 11, batch 29050, loss[loss=0.2058, simple_loss=0.2711, pruned_loss=0.07029, over 21168.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3196, pruned_loss=0.08573, over 4277445.23 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:13:10,386 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-25 14:13:11,402 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2003976.0, ans=0.1 2023-06-25 14:13:49,972 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=22.5 2023-06-25 14:14:27,584 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.48 vs. limit=6.0 2023-06-25 14:14:30,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2004216.0, ans=0.2 2023-06-25 14:14:31,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2004216.0, ans=0.125 2023-06-25 14:14:37,995 INFO [train.py:996] (1/4) Epoch 11, batch 29100, loss[loss=0.2086, simple_loss=0.2667, pruned_loss=0.07525, over 21487.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3111, pruned_loss=0.08347, over 4284022.67 frames. ], batch size: 476, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:15:07,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2004336.0, ans=0.125 2023-06-25 14:15:19,694 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.995e+02 7.507e+02 9.912e+02 1.574e+03 3.418e+03, threshold=1.982e+03, percent-clipped=5.0 2023-06-25 14:15:24,106 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-25 14:15:50,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2004456.0, ans=0.1 2023-06-25 14:15:51,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2004456.0, ans=0.125 2023-06-25 14:16:24,467 INFO [train.py:996] (1/4) Epoch 11, batch 29150, loss[loss=0.2124, simple_loss=0.2854, pruned_loss=0.06973, over 21209.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3106, pruned_loss=0.08228, over 4275320.29 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:16:52,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2004636.0, ans=0.125 2023-06-25 14:17:32,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2004756.0, ans=0.125 2023-06-25 14:17:58,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2004816.0, ans=0.0 2023-06-25 14:18:08,698 INFO [train.py:996] (1/4) Epoch 11, batch 29200, loss[loss=0.2272, simple_loss=0.2855, pruned_loss=0.08446, over 21386.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3065, pruned_loss=0.08165, over 4277067.93 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 14:18:33,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2004876.0, ans=0.0 2023-06-25 14:18:49,100 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.775e+02 8.207e+02 1.113e+03 1.658e+03 3.096e+03, threshold=2.226e+03, percent-clipped=9.0 2023-06-25 14:18:53,271 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2004996.0, ans=0.0 2023-06-25 14:19:01,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2004996.0, ans=0.0 2023-06-25 14:19:17,535 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-25 14:19:22,675 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-25 14:20:00,631 INFO [train.py:996] (1/4) Epoch 11, batch 29250, loss[loss=0.1932, simple_loss=0.2678, pruned_loss=0.05931, over 21723.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3057, pruned_loss=0.07912, over 4270453.43 frames. ], batch size: 112, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 14:20:17,043 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-25 14:21:47,252 INFO [train.py:996] (1/4) Epoch 11, batch 29300, loss[loss=0.2375, simple_loss=0.3078, pruned_loss=0.0836, over 21259.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3072, pruned_loss=0.07804, over 4265470.59 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:22:22,867 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2005536.0, ans=0.125 2023-06-25 14:22:25,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 9.346e+02 1.272e+03 1.765e+03 3.710e+03, threshold=2.544e+03, percent-clipped=11.0 2023-06-25 14:23:03,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2005656.0, ans=0.125 2023-06-25 14:23:12,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=22.5 2023-06-25 14:23:21,667 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:23:38,601 INFO [train.py:996] (1/4) Epoch 11, batch 29350, loss[loss=0.2148, simple_loss=0.316, pruned_loss=0.0568, over 21716.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3016, pruned_loss=0.07663, over 4255817.95 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:23:54,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2005836.0, ans=0.025 2023-06-25 14:24:01,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2005836.0, ans=0.0 2023-06-25 14:24:48,806 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2005956.0, ans=0.0 2023-06-25 14:25:03,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2006016.0, ans=0.0 2023-06-25 14:25:26,844 INFO [train.py:996] (1/4) Epoch 11, batch 29400, loss[loss=0.2674, simple_loss=0.344, pruned_loss=0.09539, over 21430.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3012, pruned_loss=0.0746, over 4261311.16 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:25:35,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2006076.0, ans=0.125 2023-06-25 14:26:04,466 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.130e+02 8.606e+02 1.280e+03 1.886e+03 3.409e+03, threshold=2.560e+03, percent-clipped=11.0 2023-06-25 14:26:47,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2006256.0, ans=0.0 2023-06-25 14:27:15,005 INFO [train.py:996] (1/4) Epoch 11, batch 29450, loss[loss=0.254, simple_loss=0.3236, pruned_loss=0.09219, over 21688.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2994, pruned_loss=0.07413, over 4266382.58 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:27:22,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2006376.0, ans=0.1 2023-06-25 14:27:40,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2006436.0, ans=0.1 2023-06-25 14:28:05,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2006496.0, ans=0.125 2023-06-25 14:29:00,496 INFO [train.py:996] (1/4) Epoch 11, batch 29500, loss[loss=0.245, simple_loss=0.3074, pruned_loss=0.09131, over 21865.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.304, pruned_loss=0.07765, over 4265614.93 frames. ], batch size: 124, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:29:12,202 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=15.0 2023-06-25 14:29:26,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2006736.0, ans=0.2 2023-06-25 14:29:33,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2006736.0, ans=0.07 2023-06-25 14:29:42,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2006736.0, ans=0.2 2023-06-25 14:29:44,979 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.728e+02 1.187e+03 1.757e+03 3.879e+03, threshold=2.373e+03, percent-clipped=3.0 2023-06-25 14:29:45,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-25 14:30:32,915 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-25 14:30:48,753 INFO [train.py:996] (1/4) Epoch 11, batch 29550, loss[loss=0.2376, simple_loss=0.3045, pruned_loss=0.08535, over 21332.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3031, pruned_loss=0.079, over 4274210.51 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 14:31:35,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-25 14:32:00,281 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2007156.0, ans=0.1 2023-06-25 14:32:03,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2007156.0, ans=0.0 2023-06-25 14:32:39,634 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:32:42,233 INFO [train.py:996] (1/4) Epoch 11, batch 29600, loss[loss=0.2525, simple_loss=0.3364, pruned_loss=0.08435, over 21444.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.309, pruned_loss=0.08078, over 4278261.53 frames. ], batch size: 194, lr: 2.59e-03, grad_scale: 32.0 2023-06-25 14:32:56,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2007276.0, ans=0.2 2023-06-25 14:33:22,207 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.467e+02 8.361e+02 1.294e+03 2.319e+03 6.850e+03, threshold=2.587e+03, percent-clipped=23.0 2023-06-25 14:33:56,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2007456.0, ans=0.2 2023-06-25 14:33:59,258 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2007456.0, ans=0.125 2023-06-25 14:34:06,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2007516.0, ans=0.125 2023-06-25 14:34:09,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2007516.0, ans=0.0 2023-06-25 14:34:27,163 INFO [train.py:996] (1/4) Epoch 11, batch 29650, loss[loss=0.2094, simple_loss=0.2788, pruned_loss=0.06996, over 21573.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3089, pruned_loss=0.07836, over 4274654.83 frames. ], batch size: 212, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:34:54,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2007636.0, ans=0.125 2023-06-25 14:34:55,502 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-25 14:35:24,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2007696.0, ans=0.95 2023-06-25 14:35:58,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-25 14:36:16,287 INFO [train.py:996] (1/4) Epoch 11, batch 29700, loss[loss=0.2264, simple_loss=0.3278, pruned_loss=0.06246, over 21423.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3098, pruned_loss=0.07818, over 4284522.11 frames. ], batch size: 211, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:37:02,470 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.011e+02 9.154e+02 1.304e+03 2.529e+03 6.535e+03, threshold=2.607e+03, percent-clipped=22.0 2023-06-25 14:37:27,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2008056.0, ans=0.1 2023-06-25 14:37:36,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2008056.0, ans=0.125 2023-06-25 14:37:40,520 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2008116.0, ans=0.125 2023-06-25 14:37:42,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2008116.0, ans=0.125 2023-06-25 14:38:01,631 INFO [train.py:996] (1/4) Epoch 11, batch 29750, loss[loss=0.1915, simple_loss=0.2761, pruned_loss=0.05347, over 16241.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3144, pruned_loss=0.07772, over 4271965.28 frames. ], batch size: 60, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:39:10,000 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-06-25 14:39:45,401 INFO [train.py:996] (1/4) Epoch 11, batch 29800, loss[loss=0.2095, simple_loss=0.294, pruned_loss=0.06244, over 21890.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3154, pruned_loss=0.07845, over 4280376.75 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:40:21,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2008536.0, ans=0.125 2023-06-25 14:40:30,983 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.324e+02 8.347e+02 1.266e+03 1.868e+03 3.431e+03, threshold=2.532e+03, percent-clipped=7.0 2023-06-25 14:40:33,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2008596.0, ans=0.0 2023-06-25 14:40:58,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2008656.0, ans=22.5 2023-06-25 14:41:30,275 INFO [train.py:996] (1/4) Epoch 11, batch 29850, loss[loss=0.1803, simple_loss=0.2674, pruned_loss=0.04663, over 20949.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3109, pruned_loss=0.07637, over 4280527.55 frames. ], batch size: 608, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:42:20,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2008896.0, ans=0.125 2023-06-25 14:42:22,292 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2008896.0, ans=0.0 2023-06-25 14:42:59,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2009016.0, ans=0.125 2023-06-25 14:43:16,127 INFO [train.py:996] (1/4) Epoch 11, batch 29900, loss[loss=0.2871, simple_loss=0.3511, pruned_loss=0.1116, over 21647.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3104, pruned_loss=0.07806, over 4285250.24 frames. ], batch size: 389, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:43:36,364 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2009076.0, ans=0.125 2023-06-25 14:43:59,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2009196.0, ans=0.1 2023-06-25 14:44:02,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.086e+02 7.684e+02 1.155e+03 1.725e+03 4.466e+03, threshold=2.311e+03, percent-clipped=10.0 2023-06-25 14:44:14,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=2009196.0, ans=22.5 2023-06-25 14:44:50,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2009316.0, ans=0.125 2023-06-25 14:45:08,303 INFO [train.py:996] (1/4) Epoch 11, batch 29950, loss[loss=0.253, simple_loss=0.3186, pruned_loss=0.0937, over 21536.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3149, pruned_loss=0.08215, over 4290165.19 frames. ], batch size: 211, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:46:48,628 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-25 14:46:55,266 INFO [train.py:996] (1/4) Epoch 11, batch 30000, loss[loss=0.2288, simple_loss=0.3235, pruned_loss=0.0671, over 21686.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3155, pruned_loss=0.08139, over 4289126.14 frames. ], batch size: 414, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 14:46:55,266 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 14:47:11,290 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8844, 3.4772, 2.2847, 1.7879], device='cuda:1') 2023-06-25 14:47:14,794 INFO [train.py:1028] (1/4) Epoch 11, validation: loss=0.2475, simple_loss=0.3451, pruned_loss=0.07497, over 1796401.00 frames. 2023-06-25 14:47:14,795 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 14:47:30,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2009676.0, ans=0.125 2023-06-25 14:47:48,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2009736.0, ans=0.0 2023-06-25 14:47:51,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2009736.0, ans=0.2 2023-06-25 14:48:03,296 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.762e+02 8.420e+02 1.340e+03 1.867e+03 3.638e+03, threshold=2.681e+03, percent-clipped=9.0 2023-06-25 14:48:27,714 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2009856.0, ans=0.125 2023-06-25 14:49:03,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2009916.0, ans=0.125 2023-06-25 14:49:16,230 INFO [train.py:996] (1/4) Epoch 11, batch 30050, loss[loss=0.2085, simple_loss=0.3287, pruned_loss=0.0441, over 19807.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3197, pruned_loss=0.07953, over 4274734.74 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:49:27,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2009976.0, ans=0.1 2023-06-25 14:49:49,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2010036.0, ans=0.1 2023-06-25 14:49:58,012 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-25 14:49:59,872 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-25 14:50:09,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2010096.0, ans=0.0 2023-06-25 14:50:32,171 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2010156.0, ans=0.0 2023-06-25 14:50:38,992 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-25 14:51:01,291 INFO [train.py:996] (1/4) Epoch 11, batch 30100, loss[loss=0.2356, simple_loss=0.3039, pruned_loss=0.08361, over 21600.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3192, pruned_loss=0.07953, over 4275040.41 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:51:44,632 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.182e+02 9.788e+02 1.555e+03 2.396e+03 5.388e+03, threshold=3.111e+03, percent-clipped=17.0 2023-06-25 14:51:46,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2010396.0, ans=0.125 2023-06-25 14:52:25,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2010516.0, ans=0.125 2023-06-25 14:52:49,223 INFO [train.py:996] (1/4) Epoch 11, batch 30150, loss[loss=0.3023, simple_loss=0.3548, pruned_loss=0.1249, over 21285.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3148, pruned_loss=0.081, over 4272271.69 frames. ], batch size: 507, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:52:56,991 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-25 14:53:20,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2010636.0, ans=0.1 2023-06-25 14:54:16,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2010756.0, ans=0.125 2023-06-25 14:54:17,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2010756.0, ans=0.09899494936611666 2023-06-25 14:54:46,737 INFO [train.py:996] (1/4) Epoch 11, batch 30200, loss[loss=0.2213, simple_loss=0.2969, pruned_loss=0.07286, over 21398.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3165, pruned_loss=0.08006, over 4274165.70 frames. ], batch size: 194, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:55:34,492 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 8.137e+02 1.157e+03 1.769e+03 3.974e+03, threshold=2.314e+03, percent-clipped=2.0 2023-06-25 14:55:52,157 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-25 14:56:22,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2011116.0, ans=0.2 2023-06-25 14:56:30,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2011116.0, ans=0.0 2023-06-25 14:56:34,990 INFO [train.py:996] (1/4) Epoch 11, batch 30250, loss[loss=0.3616, simple_loss=0.4486, pruned_loss=0.1373, over 21412.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3236, pruned_loss=0.08174, over 4275406.50 frames. ], batch size: 507, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:56:51,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.49 vs. limit=10.0 2023-06-25 14:57:31,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2011296.0, ans=0.025 2023-06-25 14:57:32,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.03 vs. limit=15.0 2023-06-25 14:57:37,804 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2011296.0, ans=0.2 2023-06-25 14:58:20,276 INFO [train.py:996] (1/4) Epoch 11, batch 30300, loss[loss=0.25, simple_loss=0.3305, pruned_loss=0.08478, over 21686.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3191, pruned_loss=0.08087, over 4268914.40 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:58:20,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2011476.0, ans=0.125 2023-06-25 14:58:45,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2011476.0, ans=0.125 2023-06-25 14:58:45,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2011476.0, ans=0.125 2023-06-25 14:59:05,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=2011536.0, ans=22.5 2023-06-25 14:59:14,785 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 1.032e+03 1.380e+03 1.875e+03 4.556e+03, threshold=2.761e+03, percent-clipped=17.0 2023-06-25 14:59:18,819 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2011596.0, ans=0.125 2023-06-25 14:59:46,609 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=22.5 2023-06-25 14:59:55,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2011716.0, ans=0.0 2023-06-25 15:00:21,516 INFO [train.py:996] (1/4) Epoch 11, batch 30350, loss[loss=0.244, simple_loss=0.3376, pruned_loss=0.07518, over 21635.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3187, pruned_loss=0.08168, over 4266729.84 frames. ], batch size: 389, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 15:00:25,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2011776.0, ans=0.0 2023-06-25 15:00:59,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2011896.0, ans=0.0 2023-06-25 15:01:03,889 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2011896.0, ans=0.125 2023-06-25 15:01:43,087 INFO [train.py:996] (1/4) Epoch 11, batch 30400, loss[loss=0.2208, simple_loss=0.2722, pruned_loss=0.08472, over 20401.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3124, pruned_loss=0.07978, over 4258111.44 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 15:02:24,038 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.345e+02 1.116e+03 1.633e+03 2.614e+03 1.022e+04, threshold=3.266e+03, percent-clipped=19.0 2023-06-25 15:02:55,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2012316.0, ans=0.1 2023-06-25 15:02:59,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=12.0 2023-06-25 15:03:11,929 INFO [train.py:996] (1/4) Epoch 11, batch 30450, loss[loss=0.2681, simple_loss=0.3769, pruned_loss=0.0796, over 19917.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3136, pruned_loss=0.07873, over 4199665.66 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 15:03:16,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2012376.0, ans=0.125 2023-06-25 15:04:03,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2012556.0, ans=0.125 2023-06-25 15:04:06,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2012556.0, ans=0.0 2023-06-25 15:04:12,645 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2012556.0, ans=0.125 2023-06-25 15:06:14,780 INFO [train.py:996] (1/4) Epoch 12, batch 0, loss[loss=0.2399, simple_loss=0.3062, pruned_loss=0.08685, over 21934.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3062, pruned_loss=0.08685, over 21934.00 frames. ], batch size: 103, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:06:14,780 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 15:06:38,471 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.246, simple_loss=0.3509, pruned_loss=0.07057, over 1796401.00 frames. 2023-06-25 15:06:38,472 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 15:06:42,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2012646.0, ans=0.1 2023-06-25 15:06:46,180 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-25 15:07:11,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2012706.0, ans=0.0 2023-06-25 15:07:29,958 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.737e+02 2.108e+03 3.291e+03 4.750e+03 1.246e+04, threshold=6.583e+03, percent-clipped=51.0 2023-06-25 15:07:39,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2012826.0, ans=0.0 2023-06-25 15:07:45,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2012826.0, ans=0.2 2023-06-25 15:08:24,011 INFO [train.py:996] (1/4) Epoch 12, batch 50, loss[loss=0.2689, simple_loss=0.3646, pruned_loss=0.0866, over 21682.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3221, pruned_loss=0.07964, over 967573.05 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:08:26,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2012946.0, ans=0.125 2023-06-25 15:09:08,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2013066.0, ans=0.07 2023-06-25 15:09:33,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2013126.0, ans=0.125 2023-06-25 15:09:44,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2013186.0, ans=0.0 2023-06-25 15:09:58,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2013186.0, ans=0.125 2023-06-25 15:10:07,042 INFO [train.py:996] (1/4) Epoch 12, batch 100, loss[loss=0.2879, simple_loss=0.3711, pruned_loss=0.1023, over 21489.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3358, pruned_loss=0.08178, over 1686440.64 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:10:11,371 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-25 15:10:21,746 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2013306.0, ans=0.125 2023-06-25 15:10:23,881 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-25 15:10:39,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2013306.0, ans=0.0 2023-06-25 15:10:55,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2013366.0, ans=0.125 2023-06-25 15:11:03,472 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.950e+02 8.744e+02 1.296e+03 2.082e+03 4.002e+03, threshold=2.593e+03, percent-clipped=0.0 2023-06-25 15:11:11,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2013426.0, ans=0.125 2023-06-25 15:11:43,853 INFO [train.py:996] (1/4) Epoch 12, batch 150, loss[loss=0.2013, simple_loss=0.28, pruned_loss=0.06133, over 21202.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3325, pruned_loss=0.081, over 2261579.86 frames. ], batch size: 159, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:11:55,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2013546.0, ans=0.2 2023-06-25 15:12:00,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2013546.0, ans=0.125 2023-06-25 15:12:03,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2013546.0, ans=0.0 2023-06-25 15:12:11,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2013606.0, ans=0.0 2023-06-25 15:13:03,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2013726.0, ans=0.125 2023-06-25 15:13:32,359 INFO [train.py:996] (1/4) Epoch 12, batch 200, loss[loss=0.2701, simple_loss=0.365, pruned_loss=0.08759, over 20710.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3301, pruned_loss=0.07954, over 2704262.53 frames. ], batch size: 607, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:14:16,012 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.608e-02 2023-06-25 15:14:31,668 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.459e+02 8.933e+02 1.301e+03 1.792e+03 3.949e+03, threshold=2.602e+03, percent-clipped=5.0 2023-06-25 15:15:02,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2014086.0, ans=0.125 2023-06-25 15:15:20,324 INFO [train.py:996] (1/4) Epoch 12, batch 250, loss[loss=0.2465, simple_loss=0.3188, pruned_loss=0.08712, over 21820.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3255, pruned_loss=0.07965, over 3058345.61 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:15:22,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2014146.0, ans=0.0 2023-06-25 15:16:13,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2014266.0, ans=0.0 2023-06-25 15:16:23,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2014326.0, ans=0.0 2023-06-25 15:16:54,332 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2014386.0, ans=0.125 2023-06-25 15:17:00,566 INFO [train.py:996] (1/4) Epoch 12, batch 300, loss[loss=0.2799, simple_loss=0.3423, pruned_loss=0.1088, over 21795.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3197, pruned_loss=0.07915, over 3326307.71 frames. ], batch size: 391, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:17:21,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2014446.0, ans=0.125 2023-06-25 15:17:45,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2014566.0, ans=0.2 2023-06-25 15:17:56,401 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-25 15:18:01,870 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.795e+02 8.285e+02 1.102e+03 1.636e+03 4.756e+03, threshold=2.203e+03, percent-clipped=8.0 2023-06-25 15:18:41,562 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-25 15:18:49,628 INFO [train.py:996] (1/4) Epoch 12, batch 350, loss[loss=0.2264, simple_loss=0.3025, pruned_loss=0.07511, over 21624.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3143, pruned_loss=0.07833, over 3532837.91 frames. ], batch size: 415, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:19:27,503 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.73 vs. limit=15.0 2023-06-25 15:19:57,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2014926.0, ans=0.0 2023-06-25 15:20:09,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2014926.0, ans=0.0 2023-06-25 15:20:37,709 INFO [train.py:996] (1/4) Epoch 12, batch 400, loss[loss=0.2004, simple_loss=0.2692, pruned_loss=0.06575, over 21706.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3061, pruned_loss=0.07725, over 3697835.02 frames. ], batch size: 333, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:21:02,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2015106.0, ans=0.125 2023-06-25 15:21:37,425 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.810e+02 9.305e+02 1.280e+03 1.857e+03 4.239e+03, threshold=2.560e+03, percent-clipped=17.0 2023-06-25 15:22:08,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2015286.0, ans=10.0 2023-06-25 15:22:24,711 INFO [train.py:996] (1/4) Epoch 12, batch 450, loss[loss=0.3159, simple_loss=0.4138, pruned_loss=0.1089, over 21798.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.304, pruned_loss=0.07649, over 3827842.21 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:22:32,010 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2015346.0, ans=0.125 2023-06-25 15:22:47,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2015406.0, ans=0.0 2023-06-25 15:23:06,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2015406.0, ans=0.0 2023-06-25 15:23:42,635 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=22.5 2023-06-25 15:23:45,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-25 15:24:15,438 INFO [train.py:996] (1/4) Epoch 12, batch 500, loss[loss=0.2738, simple_loss=0.3762, pruned_loss=0.08571, over 21772.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3072, pruned_loss=0.07594, over 3931454.61 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:24:22,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2015646.0, ans=0.125 2023-06-25 15:25:03,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2015766.0, ans=0.125 2023-06-25 15:25:14,865 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.486e+02 9.539e+02 1.426e+03 2.120e+03 6.298e+03, threshold=2.852e+03, percent-clipped=19.0 2023-06-25 15:26:02,514 INFO [train.py:996] (1/4) Epoch 12, batch 550, loss[loss=0.2472, simple_loss=0.3117, pruned_loss=0.09133, over 21879.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3123, pruned_loss=0.07598, over 4015018.84 frames. ], batch size: 107, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:26:05,259 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-25 15:26:42,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2016066.0, ans=0.2 2023-06-25 15:26:44,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=22.5 2023-06-25 15:27:27,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2016186.0, ans=0.125 2023-06-25 15:27:49,019 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=22.5 2023-06-25 15:27:49,320 INFO [train.py:996] (1/4) Epoch 12, batch 600, loss[loss=0.2658, simple_loss=0.3794, pruned_loss=0.07608, over 21252.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.316, pruned_loss=0.07653, over 4071713.02 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:28:49,903 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.802e+02 1.034e+03 1.598e+03 2.241e+03 5.970e+03, threshold=3.196e+03, percent-clipped=11.0 2023-06-25 15:29:38,783 INFO [train.py:996] (1/4) Epoch 12, batch 650, loss[loss=0.2573, simple_loss=0.344, pruned_loss=0.08527, over 21719.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3159, pruned_loss=0.07761, over 4125668.45 frames. ], batch size: 441, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:29:53,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2016546.0, ans=0.0 2023-06-25 15:30:18,366 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:30:36,634 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2016666.0, ans=0.1 2023-06-25 15:30:47,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-25 15:31:28,089 INFO [train.py:996] (1/4) Epoch 12, batch 700, loss[loss=0.3408, simple_loss=0.4223, pruned_loss=0.1296, over 21513.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3137, pruned_loss=0.07767, over 4167117.23 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:31:35,228 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2016846.0, ans=0.0 2023-06-25 15:32:25,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2016966.0, ans=0.0 2023-06-25 15:32:28,837 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.353e+02 8.196e+02 1.234e+03 2.056e+03 5.759e+03, threshold=2.467e+03, percent-clipped=11.0 2023-06-25 15:32:40,300 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-25 15:33:12,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2017086.0, ans=0.0 2023-06-25 15:33:16,171 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=12.0 2023-06-25 15:33:16,920 INFO [train.py:996] (1/4) Epoch 12, batch 750, loss[loss=0.1929, simple_loss=0.3273, pruned_loss=0.02927, over 19848.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3138, pruned_loss=0.07879, over 4199042.09 frames. ], batch size: 703, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:34:19,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2017326.0, ans=0.0 2023-06-25 15:34:24,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2017326.0, ans=0.04949747468305833 2023-06-25 15:34:34,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2017326.0, ans=0.125 2023-06-25 15:34:49,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2017386.0, ans=0.125 2023-06-25 15:35:08,433 INFO [train.py:996] (1/4) Epoch 12, batch 800, loss[loss=0.1937, simple_loss=0.2591, pruned_loss=0.06416, over 16232.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.31, pruned_loss=0.07885, over 4206075.49 frames. ], batch size: 60, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:35:41,113 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2017506.0, ans=0.125 2023-06-25 15:35:45,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2017566.0, ans=0.07 2023-06-25 15:36:10,822 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.338e+02 9.265e+02 1.322e+03 1.960e+03 3.991e+03, threshold=2.645e+03, percent-clipped=16.0 2023-06-25 15:36:58,035 INFO [train.py:996] (1/4) Epoch 12, batch 850, loss[loss=0.2386, simple_loss=0.2979, pruned_loss=0.08965, over 21247.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3074, pruned_loss=0.0792, over 4224720.22 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:37:24,510 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2017806.0, ans=0.0 2023-06-25 15:37:34,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2017806.0, ans=0.125 2023-06-25 15:37:51,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2017866.0, ans=0.125 2023-06-25 15:37:56,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2017866.0, ans=0.125 2023-06-25 15:38:05,747 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2017926.0, ans=0.0 2023-06-25 15:38:46,359 INFO [train.py:996] (1/4) Epoch 12, batch 900, loss[loss=0.2288, simple_loss=0.3187, pruned_loss=0.06946, over 21796.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3067, pruned_loss=0.07887, over 4239225.33 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:38:49,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-25 15:38:50,586 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2018046.0, ans=0.2 2023-06-25 15:39:49,736 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.898e+02 1.048e+03 1.588e+03 2.681e+03 4.714e+03, threshold=3.177e+03, percent-clipped=25.0 2023-06-25 15:40:09,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2018226.0, ans=0.1 2023-06-25 15:40:30,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2018286.0, ans=0.07 2023-06-25 15:40:34,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2018286.0, ans=0.0 2023-06-25 15:40:37,422 INFO [train.py:996] (1/4) Epoch 12, batch 950, loss[loss=0.2406, simple_loss=0.3068, pruned_loss=0.08717, over 21696.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3043, pruned_loss=0.07725, over 4250980.97 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:41:09,413 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2018406.0, ans=0.5 2023-06-25 15:41:21,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2018466.0, ans=0.0 2023-06-25 15:41:30,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2018466.0, ans=0.0 2023-06-25 15:41:41,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2018526.0, ans=0.125 2023-06-25 15:41:44,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.36 vs. limit=10.0 2023-06-25 15:42:25,957 INFO [train.py:996] (1/4) Epoch 12, batch 1000, loss[loss=0.2292, simple_loss=0.3125, pruned_loss=0.07292, over 21899.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3061, pruned_loss=0.07768, over 4263164.99 frames. ], batch size: 316, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:42:57,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2018706.0, ans=0.125 2023-06-25 15:43:03,978 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2018706.0, ans=0.1 2023-06-25 15:43:06,286 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-25 15:43:29,123 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.398e+02 9.331e+02 1.352e+03 1.940e+03 3.326e+03, threshold=2.703e+03, percent-clipped=1.0 2023-06-25 15:43:30,448 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-25 15:43:42,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2018826.0, ans=0.0 2023-06-25 15:43:57,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2018826.0, ans=0.025 2023-06-25 15:43:57,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2018826.0, ans=0.125 2023-06-25 15:44:21,949 INFO [train.py:996] (1/4) Epoch 12, batch 1050, loss[loss=0.3714, simple_loss=0.4111, pruned_loss=0.1659, over 21420.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3057, pruned_loss=0.07706, over 4267862.90 frames. ], batch size: 507, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:44:52,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2019006.0, ans=0.125 2023-06-25 15:45:20,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2019126.0, ans=0.125 2023-06-25 15:45:50,758 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:46:02,193 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.22 vs. limit=15.0 2023-06-25 15:46:13,564 INFO [train.py:996] (1/4) Epoch 12, batch 1100, loss[loss=0.2006, simple_loss=0.2777, pruned_loss=0.06175, over 21255.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.305, pruned_loss=0.07618, over 4272242.51 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:46:35,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2019246.0, ans=0.125 2023-06-25 15:46:57,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019366.0, ans=0.1 2023-06-25 15:47:06,606 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.39 vs. limit=10.0 2023-06-25 15:47:12,025 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.833e+02 8.143e+02 1.241e+03 1.792e+03 5.093e+03, threshold=2.482e+03, percent-clipped=8.0 2023-06-25 15:47:59,976 INFO [train.py:996] (1/4) Epoch 12, batch 1150, loss[loss=0.2394, simple_loss=0.3073, pruned_loss=0.08577, over 21258.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3042, pruned_loss=0.07588, over 4272599.73 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:48:11,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2019546.0, ans=0.1 2023-06-25 15:49:30,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2019726.0, ans=0.0 2023-06-25 15:49:43,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-25 15:49:56,070 INFO [train.py:996] (1/4) Epoch 12, batch 1200, loss[loss=0.2486, simple_loss=0.3295, pruned_loss=0.08387, over 21950.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3054, pruned_loss=0.07641, over 4280241.77 frames. ], batch size: 372, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:50:07,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-25 15:50:25,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2019906.0, ans=0.0 2023-06-25 15:51:00,444 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.676e+02 8.295e+02 1.207e+03 1.708e+03 3.534e+03, threshold=2.414e+03, percent-clipped=4.0 2023-06-25 15:51:47,230 INFO [train.py:996] (1/4) Epoch 12, batch 1250, loss[loss=0.2493, simple_loss=0.3175, pruned_loss=0.09054, over 21784.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3098, pruned_loss=0.07824, over 4288451.31 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:52:32,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-25 15:53:36,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2020386.0, ans=0.125 2023-06-25 15:53:38,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2020386.0, ans=0.1 2023-06-25 15:53:41,563 INFO [train.py:996] (1/4) Epoch 12, batch 1300, loss[loss=0.2102, simple_loss=0.2862, pruned_loss=0.06709, over 21692.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3123, pruned_loss=0.07959, over 4296261.88 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:54:40,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2020566.0, ans=0.125 2023-06-25 15:54:42,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2020566.0, ans=0.0 2023-06-25 15:54:46,924 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.616e+02 1.419e+03 1.872e+03 2.553e+03 5.619e+03, threshold=3.744e+03, percent-clipped=29.0 2023-06-25 15:55:08,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2020686.0, ans=0.2 2023-06-25 15:55:30,066 INFO [train.py:996] (1/4) Epoch 12, batch 1350, loss[loss=0.2842, simple_loss=0.364, pruned_loss=0.1022, over 21504.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3135, pruned_loss=0.08005, over 4293694.45 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:56:50,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2020926.0, ans=0.2 2023-06-25 15:57:00,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2020986.0, ans=0.0 2023-06-25 15:57:18,663 INFO [train.py:996] (1/4) Epoch 12, batch 1400, loss[loss=0.2257, simple_loss=0.2937, pruned_loss=0.07891, over 21483.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3105, pruned_loss=0.07956, over 4290914.95 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:57:29,887 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-25 15:57:43,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2021106.0, ans=0.2 2023-06-25 15:58:13,059 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-25 15:58:16,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2021166.0, ans=0.0 2023-06-25 15:58:29,200 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.50 vs. limit=15.0 2023-06-25 15:58:29,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2021226.0, ans=0.0 2023-06-25 15:58:31,011 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 6.812e+02 9.835e+02 1.527e+03 2.832e+03, threshold=1.967e+03, percent-clipped=0.0 2023-06-25 15:58:48,900 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-25 15:59:07,937 INFO [train.py:996] (1/4) Epoch 12, batch 1450, loss[loss=0.2482, simple_loss=0.3192, pruned_loss=0.08861, over 21824.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3093, pruned_loss=0.07933, over 4286333.40 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:59:17,148 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2021346.0, ans=0.125 2023-06-25 15:59:26,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2021406.0, ans=0.125 2023-06-25 15:59:34,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 15:59:45,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=8.0 2023-06-25 16:00:45,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2021586.0, ans=0.125 2023-06-25 16:00:56,921 INFO [train.py:996] (1/4) Epoch 12, batch 1500, loss[loss=0.2354, simple_loss=0.3047, pruned_loss=0.08309, over 17330.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3111, pruned_loss=0.0809, over 4281156.47 frames. ], batch size: 60, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:01:10,318 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=22.5 2023-06-25 16:02:04,723 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.651e+02 8.922e+02 1.288e+03 1.827e+03 4.851e+03, threshold=2.577e+03, percent-clipped=21.0 2023-06-25 16:02:43,239 INFO [train.py:996] (1/4) Epoch 12, batch 1550, loss[loss=0.2127, simple_loss=0.2928, pruned_loss=0.06627, over 20992.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3111, pruned_loss=0.08096, over 4283647.70 frames. ], batch size: 607, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:02:51,350 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2021946.0, ans=0.125 2023-06-25 16:03:43,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2022066.0, ans=0.125 2023-06-25 16:04:06,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2022126.0, ans=0.0 2023-06-25 16:04:27,581 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2022186.0, ans=0.125 2023-06-25 16:04:32,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2022186.0, ans=0.0 2023-06-25 16:04:36,333 INFO [train.py:996] (1/4) Epoch 12, batch 1600, loss[loss=0.2469, simple_loss=0.3254, pruned_loss=0.08426, over 21624.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3097, pruned_loss=0.07934, over 4280012.81 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:06:00,545 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 9.608e+02 1.362e+03 1.874e+03 5.231e+03, threshold=2.724e+03, percent-clipped=11.0 2023-06-25 16:06:29,241 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-25 16:06:38,408 INFO [train.py:996] (1/4) Epoch 12, batch 1650, loss[loss=0.209, simple_loss=0.2869, pruned_loss=0.06552, over 21212.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3063, pruned_loss=0.07742, over 4276971.96 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:07:10,778 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:07:10,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2022606.0, ans=0.125 2023-06-25 16:07:53,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-25 16:08:31,822 INFO [train.py:996] (1/4) Epoch 12, batch 1700, loss[loss=0.2854, simple_loss=0.3691, pruned_loss=0.1009, over 21855.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3098, pruned_loss=0.07896, over 4284122.94 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:08:57,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2022906.0, ans=0.0 2023-06-25 16:09:46,201 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.72 vs. limit=15.0 2023-06-25 16:09:48,462 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 9.466e+02 1.193e+03 1.792e+03 3.263e+03, threshold=2.387e+03, percent-clipped=3.0 2023-06-25 16:10:26,081 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:10:32,430 INFO [train.py:996] (1/4) Epoch 12, batch 1750, loss[loss=0.244, simple_loss=0.3366, pruned_loss=0.07571, over 21454.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3108, pruned_loss=0.0782, over 4283788.70 frames. ], batch size: 507, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:10:47,307 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:11:33,191 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2023266.0, ans=0.125 2023-06-25 16:11:41,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2023326.0, ans=0.1 2023-06-25 16:12:29,643 INFO [train.py:996] (1/4) Epoch 12, batch 1800, loss[loss=0.2124, simple_loss=0.3077, pruned_loss=0.05854, over 21359.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3083, pruned_loss=0.07495, over 4272960.14 frames. ], batch size: 194, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:13:09,749 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2023566.0, ans=0.125 2023-06-25 16:13:34,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2023626.0, ans=0.07 2023-06-25 16:13:40,953 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.675e+02 9.606e+02 1.409e+03 2.077e+03 5.009e+03, threshold=2.818e+03, percent-clipped=17.0 2023-06-25 16:14:21,237 INFO [train.py:996] (1/4) Epoch 12, batch 1850, loss[loss=0.2574, simple_loss=0.3301, pruned_loss=0.09236, over 21600.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3092, pruned_loss=0.0733, over 4271732.10 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:14:42,532 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2023746.0, ans=0.025 2023-06-25 16:14:56,768 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2023806.0, ans=0.0 2023-06-25 16:14:56,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2023806.0, ans=0.035 2023-06-25 16:15:09,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2023866.0, ans=0.2 2023-06-25 16:16:02,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2023986.0, ans=0.09899494936611666 2023-06-25 16:16:21,198 INFO [train.py:996] (1/4) Epoch 12, batch 1900, loss[loss=0.2892, simple_loss=0.3936, pruned_loss=0.09237, over 20842.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3115, pruned_loss=0.07424, over 4276090.92 frames. ], batch size: 607, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:16:22,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-25 16:16:32,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2024046.0, ans=0.125 2023-06-25 16:17:11,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2024166.0, ans=0.125 2023-06-25 16:17:30,679 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.998e+02 9.175e+02 1.448e+03 2.003e+03 3.751e+03, threshold=2.896e+03, percent-clipped=10.0 2023-06-25 16:17:32,035 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=15.0 2023-06-25 16:17:50,431 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2024286.0, ans=0.2 2023-06-25 16:18:05,283 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:18:12,828 INFO [train.py:996] (1/4) Epoch 12, batch 1950, loss[loss=0.1889, simple_loss=0.2578, pruned_loss=0.05998, over 21655.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3076, pruned_loss=0.07413, over 4280282.41 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:18:18,843 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2024346.0, ans=0.2 2023-06-25 16:19:08,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2024466.0, ans=0.0 2023-06-25 16:20:00,041 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-25 16:20:06,014 INFO [train.py:996] (1/4) Epoch 12, batch 2000, loss[loss=0.1831, simple_loss=0.2484, pruned_loss=0.05886, over 21329.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3042, pruned_loss=0.0728, over 4280518.47 frames. ], batch size: 131, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:20:48,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2024766.0, ans=0.2 2023-06-25 16:20:53,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2024766.0, ans=0.0 2023-06-25 16:21:15,644 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 9.911e+02 1.522e+03 2.173e+03 4.229e+03, threshold=3.044e+03, percent-clipped=10.0 2023-06-25 16:21:46,198 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-25 16:21:57,004 INFO [train.py:996] (1/4) Epoch 12, batch 2050, loss[loss=0.1841, simple_loss=0.2521, pruned_loss=0.05802, over 21274.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3073, pruned_loss=0.07473, over 4278064.75 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:22:00,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2024946.0, ans=0.0 2023-06-25 16:22:10,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2024946.0, ans=0.125 2023-06-25 16:22:37,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.50 vs. limit=10.0 2023-06-25 16:22:42,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2025066.0, ans=0.0 2023-06-25 16:23:27,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2025186.0, ans=0.2 2023-06-25 16:23:44,378 INFO [train.py:996] (1/4) Epoch 12, batch 2100, loss[loss=0.2033, simple_loss=0.2885, pruned_loss=0.05909, over 21836.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.31, pruned_loss=0.07584, over 4285559.86 frames. ], batch size: 102, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:23:50,393 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2025246.0, ans=0.1 2023-06-25 16:23:55,888 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:23:59,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2025246.0, ans=0.125 2023-06-25 16:24:57,641 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.106e+02 8.786e+02 1.413e+03 2.170e+03 3.783e+03, threshold=2.827e+03, percent-clipped=9.0 2023-06-25 16:25:06,446 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2025426.0, ans=0.1 2023-06-25 16:25:38,367 INFO [train.py:996] (1/4) Epoch 12, batch 2150, loss[loss=0.2117, simple_loss=0.2786, pruned_loss=0.07242, over 21582.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3097, pruned_loss=0.0774, over 4283004.16 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:25:58,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2025606.0, ans=0.125 2023-06-25 16:26:26,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025666.0, ans=0.1 2023-06-25 16:26:29,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-25 16:26:32,106 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2025666.0, ans=0.125 2023-06-25 16:27:01,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2025726.0, ans=0.1 2023-06-25 16:27:31,516 INFO [train.py:996] (1/4) Epoch 12, batch 2200, loss[loss=0.1815, simple_loss=0.2608, pruned_loss=0.05111, over 21395.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3098, pruned_loss=0.07696, over 4282540.95 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:27:37,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2025846.0, ans=0.0 2023-06-25 16:28:22,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-25 16:28:46,655 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.633e+02 9.347e+02 1.379e+03 2.152e+03 4.543e+03, threshold=2.758e+03, percent-clipped=14.0 2023-06-25 16:29:23,290 INFO [train.py:996] (1/4) Epoch 12, batch 2250, loss[loss=0.2536, simple_loss=0.3182, pruned_loss=0.09449, over 21664.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3072, pruned_loss=0.07562, over 4280927.31 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:29:58,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2026206.0, ans=0.2 2023-06-25 16:30:13,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2026266.0, ans=0.125 2023-06-25 16:30:26,333 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2026266.0, ans=0.125 2023-06-25 16:30:43,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2026326.0, ans=0.125 2023-06-25 16:30:43,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2026326.0, ans=0.0 2023-06-25 16:31:01,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2026386.0, ans=0.125 2023-06-25 16:31:06,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2026386.0, ans=0.125 2023-06-25 16:31:15,001 INFO [train.py:996] (1/4) Epoch 12, batch 2300, loss[loss=0.1883, simple_loss=0.2514, pruned_loss=0.06259, over 21228.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3016, pruned_loss=0.07484, over 4284349.04 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:31:31,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2026506.0, ans=0.125 2023-06-25 16:31:32,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2026506.0, ans=0.0 2023-06-25 16:31:32,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2026506.0, ans=0.125 2023-06-25 16:32:31,110 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.590e+02 8.645e+02 1.249e+03 1.812e+03 4.519e+03, threshold=2.497e+03, percent-clipped=11.0 2023-06-25 16:33:07,256 INFO [train.py:996] (1/4) Epoch 12, batch 2350, loss[loss=0.2491, simple_loss=0.3138, pruned_loss=0.09222, over 21887.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.298, pruned_loss=0.07576, over 4282231.85 frames. ], batch size: 317, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:34:53,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2026986.0, ans=0.0 2023-06-25 16:34:59,684 INFO [train.py:996] (1/4) Epoch 12, batch 2400, loss[loss=0.2763, simple_loss=0.3479, pruned_loss=0.1023, over 21307.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3032, pruned_loss=0.0787, over 4268687.36 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 16:35:24,459 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2027106.0, ans=0.0 2023-06-25 16:35:24,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2027106.0, ans=0.0 2023-06-25 16:35:52,312 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2027166.0, ans=0.2 2023-06-25 16:36:23,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2027226.0, ans=0.125 2023-06-25 16:36:28,142 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 8.583e+02 1.300e+03 1.930e+03 5.128e+03, threshold=2.600e+03, percent-clipped=13.0 2023-06-25 16:36:33,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2027226.0, ans=0.125 2023-06-25 16:36:48,337 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=12.0 2023-06-25 16:37:04,200 INFO [train.py:996] (1/4) Epoch 12, batch 2450, loss[loss=0.211, simple_loss=0.2777, pruned_loss=0.07216, over 21639.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3095, pruned_loss=0.08087, over 4265387.86 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:37:10,158 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-25 16:38:24,365 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2027526.0, ans=0.125 2023-06-25 16:38:26,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2027526.0, ans=10.0 2023-06-25 16:38:26,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2027526.0, ans=0.125 2023-06-25 16:38:28,266 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.18 vs. limit=15.0 2023-06-25 16:38:36,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2027586.0, ans=0.125 2023-06-25 16:38:54,301 INFO [train.py:996] (1/4) Epoch 12, batch 2500, loss[loss=0.2238, simple_loss=0.2896, pruned_loss=0.07903, over 21867.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3067, pruned_loss=0.0795, over 4271287.88 frames. ], batch size: 98, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:39:34,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2027766.0, ans=0.125 2023-06-25 16:40:11,315 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.584e+02 1.091e+03 1.592e+03 2.289e+03 5.240e+03, threshold=3.184e+03, percent-clipped=19.0 2023-06-25 16:40:44,931 INFO [train.py:996] (1/4) Epoch 12, batch 2550, loss[loss=0.1878, simple_loss=0.2661, pruned_loss=0.05471, over 21371.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3041, pruned_loss=0.0782, over 4264598.12 frames. ], batch size: 211, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:40:52,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2027946.0, ans=0.0 2023-06-25 16:41:07,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2028006.0, ans=0.5 2023-06-25 16:41:33,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2028066.0, ans=0.0 2023-06-25 16:41:52,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2028126.0, ans=0.2 2023-06-25 16:42:17,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2028186.0, ans=0.125 2023-06-25 16:42:26,415 INFO [train.py:996] (1/4) Epoch 12, batch 2600, loss[loss=0.1825, simple_loss=0.2497, pruned_loss=0.05764, over 21623.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3054, pruned_loss=0.07963, over 4263803.15 frames. ], batch size: 264, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:42:36,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2028246.0, ans=0.125 2023-06-25 16:42:46,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2028246.0, ans=0.0 2023-06-25 16:43:00,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2028306.0, ans=0.2 2023-06-25 16:43:45,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2028426.0, ans=0.2 2023-06-25 16:43:45,661 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-25 16:43:49,908 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.618e+02 9.833e+02 1.370e+03 2.330e+03 4.697e+03, threshold=2.739e+03, percent-clipped=12.0 2023-06-25 16:43:52,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2028426.0, ans=0.1 2023-06-25 16:44:11,918 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-25 16:44:25,801 INFO [train.py:996] (1/4) Epoch 12, batch 2650, loss[loss=0.2709, simple_loss=0.3475, pruned_loss=0.0971, over 21835.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3068, pruned_loss=0.08129, over 4270078.69 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:44:26,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2028546.0, ans=0.0 2023-06-25 16:44:39,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2028546.0, ans=0.2 2023-06-25 16:46:02,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2028786.0, ans=0.125 2023-06-25 16:46:14,948 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-25 16:46:18,789 INFO [train.py:996] (1/4) Epoch 12, batch 2700, loss[loss=0.2007, simple_loss=0.2713, pruned_loss=0.06504, over 21763.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.304, pruned_loss=0.07971, over 4279869.08 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:46:23,461 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-25 16:46:34,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2028906.0, ans=0.5 2023-06-25 16:46:39,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2028906.0, ans=0.125 2023-06-25 16:47:35,849 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.499e+02 8.174e+02 1.316e+03 1.880e+03 3.948e+03, threshold=2.631e+03, percent-clipped=11.0 2023-06-25 16:48:09,328 INFO [train.py:996] (1/4) Epoch 12, batch 2750, loss[loss=0.2515, simple_loss=0.3253, pruned_loss=0.08889, over 21770.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3035, pruned_loss=0.07897, over 4283636.61 frames. ], batch size: 112, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:48:31,629 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2029206.0, ans=0.2 2023-06-25 16:49:18,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2029326.0, ans=0.125 2023-06-25 16:49:54,781 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-25 16:50:00,323 INFO [train.py:996] (1/4) Epoch 12, batch 2800, loss[loss=0.2226, simple_loss=0.2948, pruned_loss=0.07515, over 21583.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3092, pruned_loss=0.07957, over 4288427.31 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 16:50:46,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2029506.0, ans=0.0 2023-06-25 16:51:26,714 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.572e+02 1.005e+03 1.362e+03 2.253e+03 4.999e+03, threshold=2.724e+03, percent-clipped=18.0 2023-06-25 16:51:48,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2029686.0, ans=0.1 2023-06-25 16:51:53,167 INFO [train.py:996] (1/4) Epoch 12, batch 2850, loss[loss=0.2184, simple_loss=0.2989, pruned_loss=0.06894, over 21602.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3128, pruned_loss=0.08131, over 4280240.55 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:51:54,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-25 16:53:00,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2029926.0, ans=0.0 2023-06-25 16:53:07,728 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-25 16:53:38,307 INFO [train.py:996] (1/4) Epoch 12, batch 2900, loss[loss=0.2687, simple_loss=0.3661, pruned_loss=0.08564, over 21364.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3087, pruned_loss=0.08002, over 4280969.89 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:53:44,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2030046.0, ans=0.125 2023-06-25 16:54:29,182 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2030166.0, ans=0.1 2023-06-25 16:54:30,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2030166.0, ans=0.1 2023-06-25 16:54:56,898 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.061e+02 9.423e+02 1.341e+03 2.226e+03 4.607e+03, threshold=2.681e+03, percent-clipped=12.0 2023-06-25 16:54:57,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2030226.0, ans=0.0 2023-06-25 16:55:28,745 INFO [train.py:996] (1/4) Epoch 12, batch 2950, loss[loss=0.256, simple_loss=0.3489, pruned_loss=0.08156, over 21883.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3104, pruned_loss=0.08005, over 4291019.22 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:57:14,435 INFO [train.py:996] (1/4) Epoch 12, batch 3000, loss[loss=0.2679, simple_loss=0.3466, pruned_loss=0.09459, over 21602.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3152, pruned_loss=0.08082, over 4293273.09 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:57:14,436 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 16:57:41,085 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2513, simple_loss=0.3439, pruned_loss=0.07939, over 1796401.00 frames. 2023-06-25 16:57:41,085 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 16:57:52,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2030646.0, ans=0.04949747468305833 2023-06-25 16:57:59,838 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2030646.0, ans=0.125 2023-06-25 16:58:52,814 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.014e+02 9.167e+02 1.270e+03 1.811e+03 4.329e+03, threshold=2.541e+03, percent-clipped=6.0 2023-06-25 16:59:15,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2030886.0, ans=0.0 2023-06-25 16:59:25,860 INFO [train.py:996] (1/4) Epoch 12, batch 3050, loss[loss=0.2017, simple_loss=0.2788, pruned_loss=0.0623, over 20823.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3148, pruned_loss=0.0791, over 4289078.02 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:00:28,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2031126.0, ans=0.125 2023-06-25 17:00:29,112 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-25 17:01:13,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031186.0, ans=0.1 2023-06-25 17:01:18,216 INFO [train.py:996] (1/4) Epoch 12, batch 3100, loss[loss=0.2059, simple_loss=0.2796, pruned_loss=0.06614, over 21371.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3148, pruned_loss=0.07832, over 4282605.54 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:01:30,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2031246.0, ans=0.1 2023-06-25 17:01:39,337 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2031306.0, ans=0.0 2023-06-25 17:02:07,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2031366.0, ans=0.125 2023-06-25 17:02:17,478 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2031426.0, ans=0.04949747468305833 2023-06-25 17:02:17,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2031426.0, ans=0.05 2023-06-25 17:02:25,538 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.476e+02 8.628e+02 1.556e+03 2.270e+03 3.749e+03, threshold=3.112e+03, percent-clipped=16.0 2023-06-25 17:03:06,706 INFO [train.py:996] (1/4) Epoch 12, batch 3150, loss[loss=0.3033, simple_loss=0.3734, pruned_loss=0.1166, over 21866.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3156, pruned_loss=0.07848, over 4281174.81 frames. ], batch size: 118, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:03:17,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2031546.0, ans=0.2 2023-06-25 17:05:01,259 INFO [train.py:996] (1/4) Epoch 12, batch 3200, loss[loss=0.2098, simple_loss=0.306, pruned_loss=0.05679, over 21911.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3165, pruned_loss=0.07847, over 4282029.99 frames. ], batch size: 372, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:05:30,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2031906.0, ans=0.0 2023-06-25 17:05:37,930 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:06:03,932 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2031966.0, ans=0.125 2023-06-25 17:06:21,716 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.588e+02 9.417e+02 1.305e+03 2.002e+03 3.314e+03, threshold=2.610e+03, percent-clipped=4.0 2023-06-25 17:06:45,607 INFO [train.py:996] (1/4) Epoch 12, batch 3250, loss[loss=0.2926, simple_loss=0.3323, pruned_loss=0.1265, over 21450.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3179, pruned_loss=0.08022, over 4275085.20 frames. ], batch size: 510, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:08:13,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2032386.0, ans=0.125 2023-06-25 17:08:28,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2032386.0, ans=0.2 2023-06-25 17:08:38,545 INFO [train.py:996] (1/4) Epoch 12, batch 3300, loss[loss=0.2087, simple_loss=0.2726, pruned_loss=0.07244, over 15571.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3122, pruned_loss=0.0798, over 4274341.11 frames. ], batch size: 60, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:08:51,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2032446.0, ans=0.125 2023-06-25 17:09:26,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2032566.0, ans=0.125 2023-06-25 17:09:56,061 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.590e+02 8.857e+02 1.384e+03 2.131e+03 4.581e+03, threshold=2.768e+03, percent-clipped=14.0 2023-06-25 17:10:26,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2032746.0, ans=0.05 2023-06-25 17:10:28,068 INFO [train.py:996] (1/4) Epoch 12, batch 3350, loss[loss=0.2191, simple_loss=0.2974, pruned_loss=0.07042, over 21658.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3146, pruned_loss=0.07997, over 4272880.31 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:10:32,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2032746.0, ans=0.1 2023-06-25 17:10:32,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2032746.0, ans=0.125 2023-06-25 17:11:59,835 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2032986.0, ans=0.025 2023-06-25 17:12:17,380 INFO [train.py:996] (1/4) Epoch 12, batch 3400, loss[loss=0.2329, simple_loss=0.3267, pruned_loss=0.06953, over 21832.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.315, pruned_loss=0.08084, over 4281769.30 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:13:48,690 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.332e+02 9.099e+02 1.349e+03 1.796e+03 3.997e+03, threshold=2.698e+03, percent-clipped=5.0 2023-06-25 17:14:13,576 INFO [train.py:996] (1/4) Epoch 12, batch 3450, loss[loss=0.2653, simple_loss=0.338, pruned_loss=0.09632, over 21499.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3105, pruned_loss=0.08051, over 4280509.33 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:15:05,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2033466.0, ans=0.125 2023-06-25 17:15:11,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2033466.0, ans=0.125 2023-06-25 17:15:48,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2033586.0, ans=0.1 2023-06-25 17:15:56,770 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-25 17:16:03,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2033646.0, ans=0.0 2023-06-25 17:16:10,323 INFO [train.py:996] (1/4) Epoch 12, batch 3500, loss[loss=0.2884, simple_loss=0.3582, pruned_loss=0.1093, over 21379.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3192, pruned_loss=0.08419, over 4279480.99 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:16:38,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2033646.0, ans=0.125 2023-06-25 17:17:06,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2033766.0, ans=0.2 2023-06-25 17:17:22,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2033826.0, ans=0.2 2023-06-25 17:17:32,086 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.249e+02 1.085e+03 1.477e+03 2.105e+03 4.608e+03, threshold=2.953e+03, percent-clipped=10.0 2023-06-25 17:17:37,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2033886.0, ans=0.125 2023-06-25 17:17:39,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2033886.0, ans=0.125 2023-06-25 17:17:41,552 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:18:17,208 INFO [train.py:996] (1/4) Epoch 12, batch 3550, loss[loss=0.2064, simple_loss=0.2752, pruned_loss=0.06877, over 21583.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3217, pruned_loss=0.08533, over 4277575.80 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:18:32,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2033946.0, ans=0.2 2023-06-25 17:18:44,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2034006.0, ans=0.1 2023-06-25 17:18:44,671 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2034006.0, ans=0.125 2023-06-25 17:20:12,199 INFO [train.py:996] (1/4) Epoch 12, batch 3600, loss[loss=0.2604, simple_loss=0.3232, pruned_loss=0.09885, over 21705.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3172, pruned_loss=0.08523, over 4271719.44 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:20:12,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2034246.0, ans=0.1 2023-06-25 17:20:14,958 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2034246.0, ans=0.125 2023-06-25 17:20:23,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2034246.0, ans=0.125 2023-06-25 17:20:48,957 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-25 17:21:20,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-25 17:21:27,852 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 9.639e+02 1.612e+03 2.381e+03 4.879e+03, threshold=3.225e+03, percent-clipped=14.0 2023-06-25 17:21:50,032 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-25 17:22:02,918 INFO [train.py:996] (1/4) Epoch 12, batch 3650, loss[loss=0.2872, simple_loss=0.3568, pruned_loss=0.1088, over 21650.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3171, pruned_loss=0.08593, over 4275888.51 frames. ], batch size: 508, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:23:04,799 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-25 17:23:18,039 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2034726.0, ans=0.2 2023-06-25 17:23:30,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2034786.0, ans=0.0 2023-06-25 17:23:40,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2034786.0, ans=0.0 2023-06-25 17:23:51,409 INFO [train.py:996] (1/4) Epoch 12, batch 3700, loss[loss=0.2256, simple_loss=0.2988, pruned_loss=0.0762, over 21463.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3154, pruned_loss=0.08499, over 4281599.96 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:24:23,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2034966.0, ans=0.125 2023-06-25 17:25:05,411 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.252e+02 8.433e+02 1.159e+03 1.645e+03 2.871e+03, threshold=2.319e+03, percent-clipped=0.0 2023-06-25 17:25:07,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2035026.0, ans=0.0 2023-06-25 17:25:39,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.22 vs. limit=22.5 2023-06-25 17:25:39,738 INFO [train.py:996] (1/4) Epoch 12, batch 3750, loss[loss=0.2415, simple_loss=0.3368, pruned_loss=0.07311, over 21316.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3137, pruned_loss=0.08324, over 4285190.62 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:26:37,105 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:26:46,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2035326.0, ans=0.07 2023-06-25 17:27:07,232 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-25 17:27:11,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2035326.0, ans=0.0 2023-06-25 17:27:11,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2035326.0, ans=0.2 2023-06-25 17:27:31,419 INFO [train.py:996] (1/4) Epoch 12, batch 3800, loss[loss=0.2279, simple_loss=0.3086, pruned_loss=0.07362, over 21912.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3096, pruned_loss=0.08111, over 4275741.82 frames. ], batch size: 372, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:28:23,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2035566.0, ans=0.2 2023-06-25 17:29:02,407 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.463e+02 9.532e+02 1.366e+03 2.202e+03 4.372e+03, threshold=2.732e+03, percent-clipped=24.0 2023-06-25 17:29:13,721 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-25 17:29:18,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2035686.0, ans=0.1 2023-06-25 17:29:24,668 INFO [train.py:996] (1/4) Epoch 12, batch 3850, loss[loss=0.2085, simple_loss=0.2685, pruned_loss=0.07426, over 21329.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3086, pruned_loss=0.08166, over 4272375.56 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:29:31,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-25 17:30:13,350 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=22.5 2023-06-25 17:30:30,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2035926.0, ans=0.0 2023-06-25 17:30:41,869 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-25 17:30:43,336 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:31:16,017 INFO [train.py:996] (1/4) Epoch 12, batch 3900, loss[loss=0.2292, simple_loss=0.3028, pruned_loss=0.07781, over 21859.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3051, pruned_loss=0.0813, over 4266303.41 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:31:41,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2036106.0, ans=0.0 2023-06-25 17:32:06,544 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:32:10,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2036166.0, ans=0.0 2023-06-25 17:32:25,025 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.03 vs. limit=12.0 2023-06-25 17:32:35,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2036226.0, ans=0.125 2023-06-25 17:32:41,851 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.422e+02 7.358e+02 1.058e+03 1.629e+03 3.913e+03, threshold=2.115e+03, percent-clipped=2.0 2023-06-25 17:33:04,662 INFO [train.py:996] (1/4) Epoch 12, batch 3950, loss[loss=0.2343, simple_loss=0.2922, pruned_loss=0.08819, over 20836.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3058, pruned_loss=0.08017, over 4268725.63 frames. ], batch size: 611, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:33:43,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-25 17:34:56,654 INFO [train.py:996] (1/4) Epoch 12, batch 4000, loss[loss=0.2178, simple_loss=0.2828, pruned_loss=0.07641, over 21439.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.2991, pruned_loss=0.07702, over 4273821.55 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:35:18,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2036646.0, ans=0.1 2023-06-25 17:35:37,593 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.05 vs. limit=22.5 2023-06-25 17:36:23,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.30 vs. limit=15.0 2023-06-25 17:36:31,376 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.824e+02 9.183e+02 1.353e+03 2.384e+03 4.707e+03, threshold=2.707e+03, percent-clipped=29.0 2023-06-25 17:36:51,651 INFO [train.py:996] (1/4) Epoch 12, batch 4050, loss[loss=0.2158, simple_loss=0.3122, pruned_loss=0.05971, over 21687.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2973, pruned_loss=0.07513, over 4267914.60 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:36:53,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2036946.0, ans=0.0 2023-06-25 17:37:21,543 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-25 17:37:22,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2037006.0, ans=0.1 2023-06-25 17:38:44,041 INFO [train.py:996] (1/4) Epoch 12, batch 4100, loss[loss=0.2209, simple_loss=0.296, pruned_loss=0.07287, over 21410.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3, pruned_loss=0.07509, over 4270666.93 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:39:05,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2037246.0, ans=0.2 2023-06-25 17:39:10,352 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-25 17:40:15,662 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.136e+02 8.322e+02 1.179e+03 1.731e+03 4.243e+03, threshold=2.358e+03, percent-clipped=9.0 2023-06-25 17:40:20,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2037486.0, ans=0.2 2023-06-25 17:40:20,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2037486.0, ans=0.1 2023-06-25 17:40:37,190 INFO [train.py:996] (1/4) Epoch 12, batch 4150, loss[loss=0.1737, simple_loss=0.2658, pruned_loss=0.0408, over 21485.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2998, pruned_loss=0.07168, over 4276274.67 frames. ], batch size: 195, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:41:47,141 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-25 17:42:16,148 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:42:21,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2037786.0, ans=0.1 2023-06-25 17:42:39,762 INFO [train.py:996] (1/4) Epoch 12, batch 4200, loss[loss=0.2475, simple_loss=0.3456, pruned_loss=0.07468, over 21226.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3017, pruned_loss=0.07239, over 4274756.52 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:42:53,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2037846.0, ans=0.1 2023-06-25 17:43:58,894 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.587e+02 1.298e+03 1.935e+03 2.585e+03 6.035e+03, threshold=3.870e+03, percent-clipped=37.0 2023-06-25 17:44:26,766 INFO [train.py:996] (1/4) Epoch 12, batch 4250, loss[loss=0.2389, simple_loss=0.3268, pruned_loss=0.07555, over 21363.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3056, pruned_loss=0.07358, over 4271782.49 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:45:32,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2038326.0, ans=0.125 2023-06-25 17:46:20,948 INFO [train.py:996] (1/4) Epoch 12, batch 4300, loss[loss=0.2354, simple_loss=0.3105, pruned_loss=0.0802, over 21303.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3121, pruned_loss=0.07545, over 4274292.66 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:47:46,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=2038626.0, ans=10.0 2023-06-25 17:47:49,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-25 17:47:49,731 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.415e+02 1.013e+03 1.560e+03 2.387e+03 5.571e+03, threshold=3.121e+03, percent-clipped=6.0 2023-06-25 17:48:21,422 INFO [train.py:996] (1/4) Epoch 12, batch 4350, loss[loss=0.1888, simple_loss=0.2561, pruned_loss=0.06078, over 21368.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3126, pruned_loss=0.07478, over 4277025.54 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:48:52,432 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-25 17:50:04,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-25 17:50:19,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2038986.0, ans=0.0 2023-06-25 17:50:21,654 INFO [train.py:996] (1/4) Epoch 12, batch 4400, loss[loss=0.2001, simple_loss=0.2735, pruned_loss=0.06339, over 21496.00 frames. ], tot_loss[loss=0.23, simple_loss=0.31, pruned_loss=0.07498, over 4261995.92 frames. ], batch size: 195, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:50:24,980 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-25 17:50:54,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2039106.0, ans=0.0 2023-06-25 17:51:49,931 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.648e+02 9.673e+02 1.616e+03 2.348e+03 4.188e+03, threshold=3.231e+03, percent-clipped=7.0 2023-06-25 17:52:03,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-06-25 17:52:06,145 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-25 17:52:17,338 INFO [train.py:996] (1/4) Epoch 12, batch 4450, loss[loss=0.255, simple_loss=0.3462, pruned_loss=0.08187, over 21603.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3142, pruned_loss=0.07536, over 4261889.74 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:52:52,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2039406.0, ans=0.125 2023-06-25 17:53:00,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2039466.0, ans=0.0 2023-06-25 17:53:24,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2039526.0, ans=0.125 2023-06-25 17:54:08,269 INFO [train.py:996] (1/4) Epoch 12, batch 4500, loss[loss=0.2325, simple_loss=0.3137, pruned_loss=0.07561, over 21687.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3146, pruned_loss=0.07743, over 4272262.91 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:54:12,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2039646.0, ans=0.07 2023-06-25 17:54:14,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2039646.0, ans=0.0 2023-06-25 17:54:16,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2039646.0, ans=0.125 2023-06-25 17:54:48,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2039706.0, ans=0.2 2023-06-25 17:55:32,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2039826.0, ans=0.125 2023-06-25 17:55:41,746 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.020e+02 9.450e+02 1.326e+03 2.168e+03 4.753e+03, threshold=2.653e+03, percent-clipped=7.0 2023-06-25 17:56:06,476 INFO [train.py:996] (1/4) Epoch 12, batch 4550, loss[loss=0.2271, simple_loss=0.3269, pruned_loss=0.06359, over 20778.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3183, pruned_loss=0.07821, over 4270695.12 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:57:02,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2040066.0, ans=0.2 2023-06-25 17:57:16,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2040066.0, ans=0.125 2023-06-25 17:57:59,504 INFO [train.py:996] (1/4) Epoch 12, batch 4600, loss[loss=0.2232, simple_loss=0.2965, pruned_loss=0.07496, over 21402.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3203, pruned_loss=0.07949, over 4272471.65 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:58:10,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-25 17:58:32,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2040306.0, ans=0.125 2023-06-25 17:59:10,654 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-25 17:59:23,626 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2040426.0, ans=0.1 2023-06-25 17:59:25,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2040426.0, ans=0.1 2023-06-25 17:59:34,147 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.831e+02 1.036e+03 1.391e+03 1.906e+03 3.939e+03, threshold=2.783e+03, percent-clipped=3.0 2023-06-25 17:59:36,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2040486.0, ans=0.1 2023-06-25 17:59:45,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2040486.0, ans=0.025 2023-06-25 17:59:50,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2040486.0, ans=0.1 2023-06-25 17:59:53,343 INFO [train.py:996] (1/4) Epoch 12, batch 4650, loss[loss=0.2379, simple_loss=0.3069, pruned_loss=0.08446, over 21396.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3149, pruned_loss=0.07803, over 4273117.51 frames. ], batch size: 144, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:59:55,429 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2040546.0, ans=0.0 2023-06-25 18:00:22,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2040606.0, ans=0.125 2023-06-25 18:00:39,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2040666.0, ans=0.2 2023-06-25 18:01:02,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2040666.0, ans=0.04949747468305833 2023-06-25 18:01:43,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-25 18:01:47,037 INFO [train.py:996] (1/4) Epoch 12, batch 4700, loss[loss=0.1931, simple_loss=0.2587, pruned_loss=0.06379, over 21245.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3085, pruned_loss=0.07645, over 4271467.48 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:01:47,677 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2040846.0, ans=0.0 2023-06-25 18:03:11,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2041026.0, ans=0.125 2023-06-25 18:03:19,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.793e+02 9.702e+02 1.366e+03 2.356e+03 4.995e+03, threshold=2.732e+03, percent-clipped=18.0 2023-06-25 18:03:25,734 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2041086.0, ans=0.125 2023-06-25 18:03:39,174 INFO [train.py:996] (1/4) Epoch 12, batch 4750, loss[loss=0.2306, simple_loss=0.2958, pruned_loss=0.08271, over 21627.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3033, pruned_loss=0.07707, over 4272218.65 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:03:40,366 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-25 18:04:42,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2041266.0, ans=0.0 2023-06-25 18:04:53,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2041326.0, ans=0.2 2023-06-25 18:04:56,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2041326.0, ans=0.125 2023-06-25 18:05:29,043 INFO [train.py:996] (1/4) Epoch 12, batch 4800, loss[loss=0.2513, simple_loss=0.3177, pruned_loss=0.09246, over 21904.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3038, pruned_loss=0.07815, over 4283879.59 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 18:05:38,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2041446.0, ans=0.125 2023-06-25 18:06:40,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2041626.0, ans=0.1 2023-06-25 18:06:56,306 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.862e+02 9.240e+02 1.237e+03 1.888e+03 3.806e+03, threshold=2.475e+03, percent-clipped=7.0 2023-06-25 18:07:13,714 INFO [train.py:996] (1/4) Epoch 12, batch 4850, loss[loss=0.2469, simple_loss=0.3296, pruned_loss=0.08214, over 21538.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3029, pruned_loss=0.07776, over 4278573.19 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:07:30,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2041806.0, ans=0.1 2023-06-25 18:08:28,467 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2041926.0, ans=0.2 2023-06-25 18:09:04,403 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-25 18:09:06,947 INFO [train.py:996] (1/4) Epoch 12, batch 4900, loss[loss=0.2296, simple_loss=0.3008, pruned_loss=0.0792, over 21881.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3038, pruned_loss=0.07825, over 4286430.29 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:09:26,106 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:09:51,740 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2042106.0, ans=0.2 2023-06-25 18:09:53,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2042166.0, ans=0.125 2023-06-25 18:10:30,727 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-25 18:10:36,695 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.253e+02 9.394e+02 1.371e+03 2.232e+03 4.474e+03, threshold=2.741e+03, percent-clipped=21.0 2023-06-25 18:10:55,763 INFO [train.py:996] (1/4) Epoch 12, batch 4950, loss[loss=0.1925, simple_loss=0.291, pruned_loss=0.04701, over 21680.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3095, pruned_loss=0.07778, over 4287087.34 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:11:18,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2042406.0, ans=0.0 2023-06-25 18:11:54,250 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.64 vs. limit=6.0 2023-06-25 18:12:38,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2042586.0, ans=0.125 2023-06-25 18:12:45,013 INFO [train.py:996] (1/4) Epoch 12, batch 5000, loss[loss=0.2478, simple_loss=0.3198, pruned_loss=0.08795, over 21852.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3088, pruned_loss=0.07423, over 4293979.23 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:12:58,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2042646.0, ans=0.0 2023-06-25 18:13:06,486 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2042706.0, ans=0.125 2023-06-25 18:13:26,692 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-25 18:13:39,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2042766.0, ans=0.125 2023-06-25 18:13:48,940 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2042766.0, ans=0.0 2023-06-25 18:14:19,301 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 9.354e+02 1.620e+03 2.183e+03 4.157e+03, threshold=3.240e+03, percent-clipped=13.0 2023-06-25 18:14:20,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2042886.0, ans=0.125 2023-06-25 18:14:34,562 INFO [train.py:996] (1/4) Epoch 12, batch 5050, loss[loss=0.2549, simple_loss=0.3209, pruned_loss=0.09442, over 21610.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3091, pruned_loss=0.0756, over 4298915.40 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:14:35,052 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2042946.0, ans=0.125 2023-06-25 18:14:42,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2042946.0, ans=0.1 2023-06-25 18:15:21,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2043066.0, ans=0.05 2023-06-25 18:15:50,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=22.5 2023-06-25 18:16:06,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2043186.0, ans=0.2 2023-06-25 18:16:11,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2043186.0, ans=0.0 2023-06-25 18:16:16,449 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2043186.0, ans=10.0 2023-06-25 18:16:24,392 INFO [train.py:996] (1/4) Epoch 12, batch 5100, loss[loss=0.2432, simple_loss=0.3145, pruned_loss=0.086, over 21592.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3077, pruned_loss=0.07653, over 4292789.63 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:16:41,985 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:17:14,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2043366.0, ans=0.0 2023-06-25 18:18:00,380 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.858e+02 8.212e+02 1.178e+03 1.481e+03 2.753e+03, threshold=2.355e+03, percent-clipped=0.0 2023-06-25 18:18:04,398 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2043486.0, ans=0.125 2023-06-25 18:18:15,438 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-25 18:18:15,865 INFO [train.py:996] (1/4) Epoch 12, batch 5150, loss[loss=0.227, simple_loss=0.2901, pruned_loss=0.08193, over 21481.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3055, pruned_loss=0.07731, over 4293534.06 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:18:38,102 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-25 18:19:01,885 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-25 18:19:24,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2043666.0, ans=0.125 2023-06-25 18:19:27,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2043726.0, ans=0.125 2023-06-25 18:20:10,121 INFO [train.py:996] (1/4) Epoch 12, batch 5200, loss[loss=0.211, simple_loss=0.2981, pruned_loss=0.06194, over 21167.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3067, pruned_loss=0.07758, over 4291960.66 frames. ], batch size: 143, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:20:47,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-06-25 18:20:53,375 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-25 18:21:12,336 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2043966.0, ans=0.125 2023-06-25 18:21:20,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2044026.0, ans=0.125 2023-06-25 18:21:44,968 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.871e+02 1.084e+03 1.828e+03 2.790e+03 5.969e+03, threshold=3.657e+03, percent-clipped=36.0 2023-06-25 18:22:00,092 INFO [train.py:996] (1/4) Epoch 12, batch 5250, loss[loss=0.3121, simple_loss=0.3893, pruned_loss=0.1175, over 21494.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.307, pruned_loss=0.07593, over 4284125.51 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:22:09,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2044146.0, ans=0.125 2023-06-25 18:22:50,646 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-25 18:22:51,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2044266.0, ans=0.125 2023-06-25 18:22:56,565 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2044266.0, ans=0.0 2023-06-25 18:23:34,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2044386.0, ans=0.125 2023-06-25 18:23:43,157 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2044386.0, ans=0.04949747468305833 2023-06-25 18:23:52,664 INFO [train.py:996] (1/4) Epoch 12, batch 5300, loss[loss=0.2363, simple_loss=0.3009, pruned_loss=0.08584, over 21784.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3052, pruned_loss=0.07563, over 4273028.19 frames. ], batch size: 231, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:24:16,613 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-25 18:24:20,871 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:24:52,547 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.00 vs. limit=15.0 2023-06-25 18:25:02,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2044626.0, ans=0.125 2023-06-25 18:25:20,849 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.228e+02 8.444e+02 1.453e+03 2.194e+03 4.490e+03, threshold=2.906e+03, percent-clipped=2.0 2023-06-25 18:25:39,052 INFO [train.py:996] (1/4) Epoch 12, batch 5350, loss[loss=0.2383, simple_loss=0.2955, pruned_loss=0.09053, over 21375.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3041, pruned_loss=0.07685, over 4271974.45 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:25:53,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2044746.0, ans=0.125 2023-06-25 18:26:11,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2044806.0, ans=0.125 2023-06-25 18:26:59,411 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2044926.0, ans=0.125 2023-06-25 18:27:04,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2044986.0, ans=0.2 2023-06-25 18:27:28,469 INFO [train.py:996] (1/4) Epoch 12, batch 5400, loss[loss=0.2095, simple_loss=0.2788, pruned_loss=0.0701, over 21495.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3035, pruned_loss=0.07809, over 4281693.25 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:27:51,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-25 18:27:58,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2045106.0, ans=0.125 2023-06-25 18:28:39,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2045226.0, ans=0.125 2023-06-25 18:28:51,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2045226.0, ans=0.2 2023-06-25 18:29:03,880 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.530e+02 9.524e+02 1.484e+03 2.155e+03 3.165e+03, threshold=2.968e+03, percent-clipped=5.0 2023-06-25 18:29:06,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2045286.0, ans=0.125 2023-06-25 18:29:17,840 INFO [train.py:996] (1/4) Epoch 12, batch 5450, loss[loss=0.2351, simple_loss=0.3283, pruned_loss=0.07093, over 21394.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3031, pruned_loss=0.07582, over 4278298.49 frames. ], batch size: 211, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:30:19,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2045466.0, ans=0.0 2023-06-25 18:30:20,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.19 vs. limit=10.0 2023-06-25 18:31:20,598 INFO [train.py:996] (1/4) Epoch 12, batch 5500, loss[loss=0.2218, simple_loss=0.3355, pruned_loss=0.054, over 21192.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3088, pruned_loss=0.07263, over 4274050.07 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:32:32,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2045826.0, ans=0.1 2023-06-25 18:32:54,278 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.103e+02 8.816e+02 1.255e+03 2.207e+03 4.493e+03, threshold=2.511e+03, percent-clipped=9.0 2023-06-25 18:33:09,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2045886.0, ans=0.125 2023-06-25 18:33:15,533 INFO [train.py:996] (1/4) Epoch 12, batch 5550, loss[loss=0.1699, simple_loss=0.2508, pruned_loss=0.04448, over 21031.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3112, pruned_loss=0.07077, over 4269333.26 frames. ], batch size: 143, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:34:26,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2046126.0, ans=0.125 2023-06-25 18:34:34,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2046126.0, ans=0.125 2023-06-25 18:35:15,782 INFO [train.py:996] (1/4) Epoch 12, batch 5600, loss[loss=0.2608, simple_loss=0.3513, pruned_loss=0.08514, over 21770.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3124, pruned_loss=0.0692, over 4268738.86 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:35:23,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2046246.0, ans=0.125 2023-06-25 18:36:01,925 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=2046366.0, ans=10.0 2023-06-25 18:36:36,442 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2046426.0, ans=0.1 2023-06-25 18:36:43,823 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.414e+02 9.096e+02 1.468e+03 2.250e+03 5.132e+03, threshold=2.936e+03, percent-clipped=21.0 2023-06-25 18:37:03,769 INFO [train.py:996] (1/4) Epoch 12, batch 5650, loss[loss=0.2508, simple_loss=0.3176, pruned_loss=0.09199, over 20003.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.313, pruned_loss=0.07129, over 4271738.84 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:37:28,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2046606.0, ans=0.0 2023-06-25 18:37:42,568 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2046666.0, ans=0.1 2023-06-25 18:38:18,428 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2046726.0, ans=0.125 2023-06-25 18:38:32,801 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-25 18:38:53,837 INFO [train.py:996] (1/4) Epoch 12, batch 5700, loss[loss=0.2354, simple_loss=0.3595, pruned_loss=0.05568, over 19776.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3129, pruned_loss=0.07335, over 4276438.67 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:39:07,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2046846.0, ans=0.2 2023-06-25 18:39:48,415 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-06-25 18:40:00,261 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.61 vs. limit=15.0 2023-06-25 18:40:01,552 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-25 18:40:34,222 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.374e+02 7.754e+02 1.090e+03 1.653e+03 4.716e+03, threshold=2.180e+03, percent-clipped=6.0 2023-06-25 18:40:48,816 INFO [train.py:996] (1/4) Epoch 12, batch 5750, loss[loss=0.159, simple_loss=0.2506, pruned_loss=0.03373, over 21826.00 frames. ], tot_loss[loss=0.225, simple_loss=0.31, pruned_loss=0.06999, over 4262250.49 frames. ], batch size: 316, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:41:21,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=2047206.0, ans=0.05 2023-06-25 18:42:46,129 INFO [train.py:996] (1/4) Epoch 12, batch 5800, loss[loss=0.1806, simple_loss=0.269, pruned_loss=0.04603, over 21292.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3092, pruned_loss=0.06882, over 4257622.66 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:43:07,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2047446.0, ans=0.125 2023-06-25 18:43:39,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2047566.0, ans=0.0 2023-06-25 18:43:41,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2047566.0, ans=0.125 2023-06-25 18:44:07,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2047626.0, ans=0.2 2023-06-25 18:44:17,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.217e+02 1.046e+03 1.663e+03 2.546e+03 5.272e+03, threshold=3.326e+03, percent-clipped=34.0 2023-06-25 18:44:37,362 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-25 18:44:39,619 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2047686.0, ans=0.04949747468305833 2023-06-25 18:44:42,286 INFO [train.py:996] (1/4) Epoch 12, batch 5850, loss[loss=0.1932, simple_loss=0.3069, pruned_loss=0.03972, over 21785.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3081, pruned_loss=0.06577, over 4266492.59 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:45:05,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2047806.0, ans=0.1 2023-06-25 18:45:49,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2047926.0, ans=0.125 2023-06-25 18:46:05,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2047986.0, ans=0.125 2023-06-25 18:46:34,305 INFO [train.py:996] (1/4) Epoch 12, batch 5900, loss[loss=0.2165, simple_loss=0.3093, pruned_loss=0.06185, over 21453.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.3014, pruned_loss=0.06115, over 4272656.86 frames. ], batch size: 507, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:46:46,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2048046.0, ans=0.0 2023-06-25 18:47:12,704 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-25 18:47:22,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2048166.0, ans=0.025 2023-06-25 18:47:39,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-25 18:48:00,969 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 8.415e+02 1.286e+03 1.722e+03 5.284e+03, threshold=2.571e+03, percent-clipped=3.0 2023-06-25 18:48:22,664 INFO [train.py:996] (1/4) Epoch 12, batch 5950, loss[loss=0.2018, simple_loss=0.265, pruned_loss=0.06926, over 21675.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.3007, pruned_loss=0.06384, over 4276849.97 frames. ], batch size: 231, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:49:07,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2048466.0, ans=0.125 2023-06-25 18:49:10,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2048466.0, ans=0.125 2023-06-25 18:50:19,504 INFO [train.py:996] (1/4) Epoch 12, batch 6000, loss[loss=0.2039, simple_loss=0.2651, pruned_loss=0.07134, over 21739.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2966, pruned_loss=0.06716, over 4274200.53 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 32.0 2023-06-25 18:50:19,510 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 18:50:34,261 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.8228, 3.2030, 3.1452, 1.7486], device='cuda:1') 2023-06-25 18:50:36,760 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2561, simple_loss=0.3516, pruned_loss=0.08031, over 1796401.00 frames. 2023-06-25 18:50:36,761 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 18:51:20,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2048766.0, ans=0.125 2023-06-25 18:51:47,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2048826.0, ans=0.0 2023-06-25 18:51:49,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-25 18:52:09,310 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2048886.0, ans=0.0 2023-06-25 18:52:10,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2048886.0, ans=0.2 2023-06-25 18:52:13,644 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.382e+02 1.134e+03 1.580e+03 2.203e+03 4.281e+03, threshold=3.160e+03, percent-clipped=13.0 2023-06-25 18:52:25,705 INFO [train.py:996] (1/4) Epoch 12, batch 6050, loss[loss=0.2288, simple_loss=0.2896, pruned_loss=0.08399, over 21652.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2928, pruned_loss=0.06869, over 4266709.28 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:52:40,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2048946.0, ans=0.125 2023-06-25 18:52:49,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-25 18:53:05,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2023-06-25 18:54:14,014 INFO [train.py:996] (1/4) Epoch 12, batch 6100, loss[loss=0.2306, simple_loss=0.3093, pruned_loss=0.07595, over 21816.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2915, pruned_loss=0.06727, over 4270380.12 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:55:36,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2049426.0, ans=0.0 2023-06-25 18:55:46,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2049486.0, ans=0.125 2023-06-25 18:55:51,233 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.240e+02 1.094e+03 1.641e+03 2.565e+03 7.490e+03, threshold=3.281e+03, percent-clipped=16.0 2023-06-25 18:56:03,373 INFO [train.py:996] (1/4) Epoch 12, batch 6150, loss[loss=0.22, simple_loss=0.2903, pruned_loss=0.0748, over 21800.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2942, pruned_loss=0.0702, over 4270572.38 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:57:56,229 INFO [train.py:996] (1/4) Epoch 12, batch 6200, loss[loss=0.2343, simple_loss=0.317, pruned_loss=0.07582, over 21624.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2957, pruned_loss=0.07045, over 4273595.97 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:59:18,879 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2050026.0, ans=0.0 2023-06-25 18:59:28,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 1.039e+03 1.350e+03 2.072e+03 4.321e+03, threshold=2.700e+03, percent-clipped=2.0 2023-06-25 18:59:45,871 INFO [train.py:996] (1/4) Epoch 12, batch 6250, loss[loss=0.1951, simple_loss=0.3128, pruned_loss=0.03873, over 20848.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3007, pruned_loss=0.07006, over 4267653.89 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:00:11,073 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2050206.0, ans=0.125 2023-06-25 19:00:21,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2050206.0, ans=0.125 2023-06-25 19:00:40,831 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2050266.0, ans=0.1 2023-06-25 19:01:34,303 INFO [train.py:996] (1/4) Epoch 12, batch 6300, loss[loss=0.2166, simple_loss=0.3049, pruned_loss=0.06412, over 21850.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3048, pruned_loss=0.06891, over 4278213.16 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:02:15,048 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-25 19:02:16,068 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2050506.0, ans=0.125 2023-06-25 19:02:33,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2050566.0, ans=0.125 2023-06-25 19:03:12,174 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.168e+02 9.742e+02 1.547e+03 2.074e+03 3.988e+03, threshold=3.093e+03, percent-clipped=9.0 2023-06-25 19:03:22,556 INFO [train.py:996] (1/4) Epoch 12, batch 6350, loss[loss=0.2512, simple_loss=0.3208, pruned_loss=0.09078, over 21568.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3076, pruned_loss=0.07301, over 4285085.70 frames. ], batch size: 230, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:03:23,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2050746.0, ans=0.125 2023-06-25 19:03:50,944 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=2050806.0, ans=0.2 2023-06-25 19:04:07,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-25 19:04:10,596 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2050866.0, ans=0.125 2023-06-25 19:05:06,321 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-25 19:05:15,857 INFO [train.py:996] (1/4) Epoch 12, batch 6400, loss[loss=0.1944, simple_loss=0.3181, pruned_loss=0.03529, over 19767.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3133, pruned_loss=0.07673, over 4273622.02 frames. ], batch size: 703, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:05:55,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2051106.0, ans=0.0 2023-06-25 19:06:23,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2051166.0, ans=0.05 2023-06-25 19:06:29,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2051226.0, ans=0.0 2023-06-25 19:06:41,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-25 19:06:54,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.078e+02 9.080e+02 1.264e+03 1.614e+03 4.055e+03, threshold=2.529e+03, percent-clipped=6.0 2023-06-25 19:06:58,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2051286.0, ans=0.2 2023-06-25 19:07:09,055 INFO [train.py:996] (1/4) Epoch 12, batch 6450, loss[loss=0.2739, simple_loss=0.3518, pruned_loss=0.09798, over 21601.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3149, pruned_loss=0.0769, over 4277161.53 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:07:21,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2051346.0, ans=0.1 2023-06-25 19:07:29,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2051346.0, ans=0.1 2023-06-25 19:07:43,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2051406.0, ans=0.125 2023-06-25 19:07:44,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2051406.0, ans=0.1 2023-06-25 19:08:05,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2051466.0, ans=0.0 2023-06-25 19:08:11,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2051466.0, ans=0.0 2023-06-25 19:08:24,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2051526.0, ans=0.05 2023-06-25 19:08:34,028 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-25 19:08:51,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2051586.0, ans=0.125 2023-06-25 19:08:52,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2051586.0, ans=0.125 2023-06-25 19:08:59,162 INFO [train.py:996] (1/4) Epoch 12, batch 6500, loss[loss=0.228, simple_loss=0.2863, pruned_loss=0.08489, over 21271.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3071, pruned_loss=0.07587, over 4262478.72 frames. ], batch size: 131, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:10:18,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2051826.0, ans=0.015 2023-06-25 19:10:35,313 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.489e+02 7.750e+02 1.055e+03 1.738e+03 4.061e+03, threshold=2.109e+03, percent-clipped=10.0 2023-06-25 19:10:44,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2051886.0, ans=0.0 2023-06-25 19:10:52,977 INFO [train.py:996] (1/4) Epoch 12, batch 6550, loss[loss=0.2204, simple_loss=0.2978, pruned_loss=0.07149, over 21825.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3068, pruned_loss=0.07483, over 4267372.36 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:11:21,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2052006.0, ans=0.125 2023-06-25 19:11:41,876 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2052066.0, ans=0.0 2023-06-25 19:11:48,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2052066.0, ans=0.0 2023-06-25 19:12:06,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-06-25 19:12:09,190 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2052126.0, ans=0.0 2023-06-25 19:12:14,956 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-25 19:12:42,070 INFO [train.py:996] (1/4) Epoch 12, batch 6600, loss[loss=0.1835, simple_loss=0.2442, pruned_loss=0.0614, over 21242.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3015, pruned_loss=0.075, over 4275577.55 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:12:53,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2052246.0, ans=0.0 2023-06-25 19:13:00,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2052246.0, ans=0.1 2023-06-25 19:14:05,651 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2052426.0, ans=0.5 2023-06-25 19:14:07,109 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2052486.0, ans=0.2 2023-06-25 19:14:22,497 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.983e+02 6.750e+02 1.048e+03 1.422e+03 4.566e+03, threshold=2.096e+03, percent-clipped=9.0 2023-06-25 19:14:36,289 INFO [train.py:996] (1/4) Epoch 12, batch 6650, loss[loss=0.2401, simple_loss=0.3007, pruned_loss=0.08975, over 21376.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.294, pruned_loss=0.07171, over 4274037.46 frames. ], batch size: 508, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:14:56,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2052606.0, ans=0.125 2023-06-25 19:15:28,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2052666.0, ans=0.125 2023-06-25 19:16:24,139 INFO [train.py:996] (1/4) Epoch 12, batch 6700, loss[loss=0.1985, simple_loss=0.2673, pruned_loss=0.06484, over 21735.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2907, pruned_loss=0.07258, over 4273617.88 frames. ], batch size: 118, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:16:26,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2052846.0, ans=0.0 2023-06-25 19:17:22,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2053026.0, ans=10.0 2023-06-25 19:17:34,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2053026.0, ans=0.2 2023-06-25 19:17:46,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2053086.0, ans=0.09899494936611666 2023-06-25 19:17:56,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2053086.0, ans=0.125 2023-06-25 19:17:59,535 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.379e+02 9.118e+02 1.454e+03 2.169e+03 5.767e+03, threshold=2.907e+03, percent-clipped=27.0 2023-06-25 19:18:10,557 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2053086.0, ans=0.5 2023-06-25 19:18:13,658 INFO [train.py:996] (1/4) Epoch 12, batch 6750, loss[loss=0.2376, simple_loss=0.3048, pruned_loss=0.08521, over 21862.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2905, pruned_loss=0.07223, over 4269723.91 frames. ], batch size: 118, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:18:28,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2053206.0, ans=0.0 2023-06-25 19:18:35,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2053206.0, ans=0.125 2023-06-25 19:20:02,669 INFO [train.py:996] (1/4) Epoch 12, batch 6800, loss[loss=0.1909, simple_loss=0.2778, pruned_loss=0.05202, over 21122.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2923, pruned_loss=0.07473, over 4275151.83 frames. ], batch size: 607, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:21:22,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2053686.0, ans=0.1 2023-06-25 19:21:29,045 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.142e+02 1.053e+03 1.480e+03 2.142e+03 3.474e+03, threshold=2.960e+03, percent-clipped=10.0 2023-06-25 19:21:42,863 INFO [train.py:996] (1/4) Epoch 12, batch 6850, loss[loss=0.2161, simple_loss=0.2745, pruned_loss=0.0788, over 21422.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.29, pruned_loss=0.07614, over 4276775.05 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:22:02,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2053746.0, ans=0.125 2023-06-25 19:22:07,732 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2053806.0, ans=0.125 2023-06-25 19:22:12,776 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2053806.0, ans=0.125 2023-06-25 19:22:44,205 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-25 19:23:38,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2054046.0, ans=0.125 2023-06-25 19:23:39,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-25 19:23:39,673 INFO [train.py:996] (1/4) Epoch 12, batch 6900, loss[loss=0.1952, simple_loss=0.2782, pruned_loss=0.05608, over 21374.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2907, pruned_loss=0.07621, over 4278018.72 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:25:04,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2054286.0, ans=0.0 2023-06-25 19:25:18,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2054286.0, ans=0.125 2023-06-25 19:25:20,639 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.318e+02 8.019e+02 1.199e+03 1.523e+03 3.693e+03, threshold=2.398e+03, percent-clipped=1.0 2023-06-25 19:25:26,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=2054286.0, ans=0.025 2023-06-25 19:25:28,996 INFO [train.py:996] (1/4) Epoch 12, batch 6950, loss[loss=0.2676, simple_loss=0.3381, pruned_loss=0.09851, over 21851.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2929, pruned_loss=0.07326, over 4280569.53 frames. ], batch size: 371, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:25:43,265 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2054346.0, ans=0.125 2023-06-25 19:27:17,513 INFO [train.py:996] (1/4) Epoch 12, batch 7000, loss[loss=0.2076, simple_loss=0.273, pruned_loss=0.07112, over 21730.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2964, pruned_loss=0.07567, over 4275042.39 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:27:27,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2054646.0, ans=0.2 2023-06-25 19:28:14,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2054766.0, ans=0.0 2023-06-25 19:28:36,993 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.97 vs. limit=6.0 2023-06-25 19:28:43,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2054826.0, ans=0.1 2023-06-25 19:28:56,673 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.600e+02 9.157e+02 1.451e+03 1.908e+03 5.399e+03, threshold=2.901e+03, percent-clipped=14.0 2023-06-25 19:29:05,352 INFO [train.py:996] (1/4) Epoch 12, batch 7050, loss[loss=0.2176, simple_loss=0.3157, pruned_loss=0.0597, over 21264.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2934, pruned_loss=0.07363, over 4267305.87 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:29:23,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2054946.0, ans=0.0 2023-06-25 19:29:59,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2055066.0, ans=0.04949747468305833 2023-06-25 19:30:58,067 INFO [train.py:996] (1/4) Epoch 12, batch 7100, loss[loss=0.1698, simple_loss=0.2427, pruned_loss=0.04839, over 21283.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2989, pruned_loss=0.07566, over 4268188.86 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:32:32,639 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.001e+02 9.554e+02 1.238e+03 1.810e+03 4.265e+03, threshold=2.476e+03, percent-clipped=5.0 2023-06-25 19:32:40,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2055486.0, ans=0.1 2023-06-25 19:32:44,690 INFO [train.py:996] (1/4) Epoch 12, batch 7150, loss[loss=0.2985, simple_loss=0.356, pruned_loss=0.1205, over 21433.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2959, pruned_loss=0.07314, over 4270678.50 frames. ], batch size: 510, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:32:45,510 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:33:04,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2055606.0, ans=0.125 2023-06-25 19:33:18,788 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2055606.0, ans=0.025 2023-06-25 19:33:56,522 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2055726.0, ans=0.125 2023-06-25 19:34:01,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2055726.0, ans=0.2 2023-06-25 19:34:31,906 INFO [train.py:996] (1/4) Epoch 12, batch 7200, loss[loss=0.2041, simple_loss=0.2742, pruned_loss=0.06701, over 21793.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.299, pruned_loss=0.07588, over 4270721.11 frames. ], batch size: 317, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:34:56,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2055906.0, ans=0.0 2023-06-25 19:35:28,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2055966.0, ans=10.0 2023-06-25 19:35:44,031 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-25 19:36:05,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2056086.0, ans=0.0 2023-06-25 19:36:08,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2056086.0, ans=0.0 2023-06-25 19:36:16,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2056086.0, ans=0.0 2023-06-25 19:36:17,054 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.460e+02 1.008e+03 1.554e+03 2.511e+03 5.348e+03, threshold=3.107e+03, percent-clipped=25.0 2023-06-25 19:36:21,746 INFO [train.py:996] (1/4) Epoch 12, batch 7250, loss[loss=0.1849, simple_loss=0.2491, pruned_loss=0.06036, over 21406.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2939, pruned_loss=0.07525, over 4265214.53 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:37:12,202 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2056266.0, ans=0.0 2023-06-25 19:37:45,876 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-06-25 19:38:09,316 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-25 19:38:09,795 INFO [train.py:996] (1/4) Epoch 12, batch 7300, loss[loss=0.2243, simple_loss=0.2781, pruned_loss=0.0853, over 21228.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2889, pruned_loss=0.07433, over 4270821.38 frames. ], batch size: 143, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:38:45,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2056506.0, ans=0.125 2023-06-25 19:39:04,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2056566.0, ans=0.125 2023-06-25 19:39:54,693 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.105e+02 8.292e+02 1.311e+03 1.863e+03 3.675e+03, threshold=2.622e+03, percent-clipped=4.0 2023-06-25 19:39:59,537 INFO [train.py:996] (1/4) Epoch 12, batch 7350, loss[loss=0.2305, simple_loss=0.3027, pruned_loss=0.07911, over 21366.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2869, pruned_loss=0.07511, over 4277050.53 frames. ], batch size: 549, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:40:59,510 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-25 19:42:01,528 INFO [train.py:996] (1/4) Epoch 12, batch 7400, loss[loss=0.242, simple_loss=0.3354, pruned_loss=0.07429, over 21694.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2954, pruned_loss=0.07719, over 4275221.28 frames. ], batch size: 415, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:42:35,125 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-25 19:42:36,940 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-25 19:43:43,651 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 8.354e+02 1.413e+03 2.372e+03 4.608e+03, threshold=2.826e+03, percent-clipped=17.0 2023-06-25 19:43:47,750 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:43:48,749 INFO [train.py:996] (1/4) Epoch 12, batch 7450, loss[loss=0.2439, simple_loss=0.3057, pruned_loss=0.09102, over 21875.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2957, pruned_loss=0.07539, over 4266122.32 frames. ], batch size: 373, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:45:22,333 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-25 19:45:39,205 INFO [train.py:996] (1/4) Epoch 12, batch 7500, loss[loss=0.2648, simple_loss=0.3556, pruned_loss=0.08698, over 21444.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3002, pruned_loss=0.0758, over 4269883.30 frames. ], batch size: 211, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:45:57,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2057646.0, ans=0.125 2023-06-25 19:46:12,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2057706.0, ans=0.05 2023-06-25 19:46:26,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2057766.0, ans=0.09899494936611666 2023-06-25 19:46:29,046 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.58 vs. limit=22.5 2023-06-25 19:47:23,850 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.023e+02 9.552e+02 1.495e+03 2.068e+03 4.249e+03, threshold=2.990e+03, percent-clipped=12.0 2023-06-25 19:47:36,198 INFO [train.py:996] (1/4) Epoch 12, batch 7550, loss[loss=0.2148, simple_loss=0.3065, pruned_loss=0.06154, over 21653.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3083, pruned_loss=0.07597, over 4265751.83 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:47:38,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2057946.0, ans=0.0 2023-06-25 19:48:59,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2058186.0, ans=0.2 2023-06-25 19:49:17,673 INFO [train.py:996] (1/4) Epoch 12, batch 7600, loss[loss=0.1932, simple_loss=0.2712, pruned_loss=0.05753, over 21856.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3081, pruned_loss=0.0751, over 4274881.22 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:50:06,366 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058366.0, ans=0.1 2023-06-25 19:51:01,882 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 8.115e+02 1.091e+03 1.750e+03 3.922e+03, threshold=2.181e+03, percent-clipped=5.0 2023-06-25 19:51:12,224 INFO [train.py:996] (1/4) Epoch 12, batch 7650, loss[loss=0.2305, simple_loss=0.296, pruned_loss=0.08256, over 21833.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3061, pruned_loss=0.07645, over 4286303.87 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:52:11,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2058726.0, ans=0.015 2023-06-25 19:52:45,563 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-25 19:53:04,309 INFO [train.py:996] (1/4) Epoch 12, batch 7700, loss[loss=0.247, simple_loss=0.3238, pruned_loss=0.08508, over 21576.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3087, pruned_loss=0.07935, over 4287392.26 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:53:12,647 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-25 19:53:28,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2058906.0, ans=0.125 2023-06-25 19:53:44,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2058966.0, ans=0.2 2023-06-25 19:53:45,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.97 vs. limit=12.0 2023-06-25 19:54:52,163 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.071e+02 8.849e+02 1.361e+03 2.010e+03 4.988e+03, threshold=2.722e+03, percent-clipped=19.0 2023-06-25 19:55:00,909 INFO [train.py:996] (1/4) Epoch 12, batch 7750, loss[loss=0.2265, simple_loss=0.3245, pruned_loss=0.06423, over 21426.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3119, pruned_loss=0.07936, over 4273969.55 frames. ], batch size: 211, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:55:08,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2059146.0, ans=0.0 2023-06-25 19:55:10,196 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2059146.0, ans=0.125 2023-06-25 19:55:25,056 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2023-06-25 19:55:31,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2059206.0, ans=0.125 2023-06-25 19:55:50,478 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-25 19:55:55,226 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2059266.0, ans=0.1 2023-06-25 19:55:58,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2059266.0, ans=0.125 2023-06-25 19:55:59,053 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-25 19:56:50,897 INFO [train.py:996] (1/4) Epoch 12, batch 7800, loss[loss=0.3054, simple_loss=0.4282, pruned_loss=0.09132, over 19845.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3132, pruned_loss=0.08026, over 4269859.27 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:56:56,681 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2059446.0, ans=0.025 2023-06-25 19:57:36,672 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2059566.0, ans=0.125 2023-06-25 19:57:36,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2059566.0, ans=0.0 2023-06-25 19:57:38,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2059566.0, ans=0.125 2023-06-25 19:58:32,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.35 vs. limit=22.5 2023-06-25 19:58:38,497 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.766e+02 1.116e+03 1.579e+03 2.542e+03 4.639e+03, threshold=3.158e+03, percent-clipped=18.0 2023-06-25 19:58:42,200 INFO [train.py:996] (1/4) Epoch 12, batch 7850, loss[loss=0.2287, simple_loss=0.2872, pruned_loss=0.08511, over 21737.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.308, pruned_loss=0.07943, over 4263953.10 frames. ], batch size: 317, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:58:52,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2059746.0, ans=0.2 2023-06-25 19:59:37,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2059866.0, ans=0.125 2023-06-25 19:59:44,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2059926.0, ans=0.2 2023-06-25 19:59:44,989 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2059926.0, ans=10.0 2023-06-25 20:00:06,998 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2059926.0, ans=0.125 2023-06-25 20:00:33,873 INFO [train.py:996] (1/4) Epoch 12, batch 7900, loss[loss=0.1974, simple_loss=0.2707, pruned_loss=0.06202, over 21700.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3029, pruned_loss=0.07817, over 4265252.81 frames. ], batch size: 333, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:01:02,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2060106.0, ans=0.125 2023-06-25 20:01:45,621 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:01:47,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2060226.0, ans=0.125 2023-06-25 20:02:23,975 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.116e+02 9.466e+02 1.589e+03 2.565e+03 7.022e+03, threshold=3.178e+03, percent-clipped=11.0 2023-06-25 20:02:27,427 INFO [train.py:996] (1/4) Epoch 12, batch 7950, loss[loss=0.2017, simple_loss=0.2834, pruned_loss=0.05997, over 21074.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3069, pruned_loss=0.07753, over 4253753.07 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:03:19,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2060466.0, ans=0.125 2023-06-25 20:04:05,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2060586.0, ans=0.125 2023-06-25 20:04:07,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2060586.0, ans=0.0 2023-06-25 20:04:07,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2060586.0, ans=0.2 2023-06-25 20:04:21,430 INFO [train.py:996] (1/4) Epoch 12, batch 8000, loss[loss=0.2626, simple_loss=0.3317, pruned_loss=0.09672, over 21183.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3135, pruned_loss=0.07984, over 4255597.52 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:04:22,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2060646.0, ans=0.125 2023-06-25 20:04:54,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2060706.0, ans=0.035 2023-06-25 20:04:59,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-25 20:05:29,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2060766.0, ans=0.125 2023-06-25 20:05:34,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2060826.0, ans=0.0 2023-06-25 20:06:08,301 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2060886.0, ans=0.0 2023-06-25 20:06:16,552 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.154e+02 1.106e+03 1.585e+03 2.816e+03 6.535e+03, threshold=3.170e+03, percent-clipped=17.0 2023-06-25 20:06:25,706 INFO [train.py:996] (1/4) Epoch 12, batch 8050, loss[loss=0.2754, simple_loss=0.3615, pruned_loss=0.09461, over 21756.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.317, pruned_loss=0.07985, over 4259553.87 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:06:29,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2060946.0, ans=0.1 2023-06-25 20:06:40,448 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2060946.0, ans=0.0 2023-06-25 20:07:00,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2061006.0, ans=0.0 2023-06-25 20:07:05,744 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2061066.0, ans=0.5 2023-06-25 20:08:01,579 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2061186.0, ans=0.0 2023-06-25 20:08:15,660 INFO [train.py:996] (1/4) Epoch 12, batch 8100, loss[loss=0.2092, simple_loss=0.284, pruned_loss=0.06726, over 21514.00 frames. ], tot_loss[loss=0.237, simple_loss=0.314, pruned_loss=0.08006, over 4270308.90 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:08:39,294 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.37 vs. limit=15.0 2023-06-25 20:09:54,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2061486.0, ans=0.125 2023-06-25 20:10:11,168 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.662e+02 1.217e+03 1.664e+03 2.613e+03 6.278e+03, threshold=3.327e+03, percent-clipped=14.0 2023-06-25 20:10:14,566 INFO [train.py:996] (1/4) Epoch 12, batch 8150, loss[loss=0.2355, simple_loss=0.3359, pruned_loss=0.06755, over 21677.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3229, pruned_loss=0.08132, over 4269762.16 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:11:32,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2061726.0, ans=0.05 2023-06-25 20:11:37,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2061726.0, ans=0.125 2023-06-25 20:12:03,281 INFO [train.py:996] (1/4) Epoch 12, batch 8200, loss[loss=0.196, simple_loss=0.2615, pruned_loss=0.06521, over 21303.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3163, pruned_loss=0.07986, over 4272648.71 frames. ], batch size: 551, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:12:38,011 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-25 20:13:50,890 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.075e+02 8.315e+02 1.225e+03 2.435e+03 5.897e+03, threshold=2.449e+03, percent-clipped=16.0 2023-06-25 20:13:54,421 INFO [train.py:996] (1/4) Epoch 12, batch 8250, loss[loss=0.229, simple_loss=0.3253, pruned_loss=0.06636, over 21693.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3128, pruned_loss=0.0798, over 4265283.92 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:14:58,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2062266.0, ans=0.0 2023-06-25 20:15:02,298 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2062326.0, ans=0.0 2023-06-25 20:15:03,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2062326.0, ans=0.125 2023-06-25 20:15:12,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2062326.0, ans=0.125 2023-06-25 20:15:44,057 INFO [train.py:996] (1/4) Epoch 12, batch 8300, loss[loss=0.233, simple_loss=0.3109, pruned_loss=0.07755, over 21650.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3114, pruned_loss=0.07732, over 4264995.67 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:16:49,469 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:16:59,911 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.581e-03 2023-06-25 20:17:19,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2062686.0, ans=0.1 2023-06-25 20:17:29,300 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.578e+02 9.193e+02 1.531e+03 2.133e+03 6.656e+03, threshold=3.063e+03, percent-clipped=18.0 2023-06-25 20:17:32,754 INFO [train.py:996] (1/4) Epoch 12, batch 8350, loss[loss=0.202, simple_loss=0.2928, pruned_loss=0.05556, over 21560.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3106, pruned_loss=0.0754, over 4261996.59 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:17:43,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=2062746.0, ans=0.2 2023-06-25 20:17:59,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2062806.0, ans=0.125 2023-06-25 20:18:12,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2062806.0, ans=0.1 2023-06-25 20:18:19,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2062866.0, ans=0.125 2023-06-25 20:19:16,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2062986.0, ans=0.2 2023-06-25 20:19:26,207 INFO [train.py:996] (1/4) Epoch 12, batch 8400, loss[loss=0.1814, simple_loss=0.2793, pruned_loss=0.04178, over 21784.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3092, pruned_loss=0.07295, over 4269238.14 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 32.0 2023-06-25 20:19:44,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2063046.0, ans=0.2 2023-06-25 20:19:48,820 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.20 vs. limit=10.0 2023-06-25 20:19:51,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2063106.0, ans=0.1 2023-06-25 20:19:53,293 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2063106.0, ans=0.0 2023-06-25 20:20:04,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2063106.0, ans=0.0 2023-06-25 20:20:09,849 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2063166.0, ans=0.125 2023-06-25 20:20:17,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2063166.0, ans=0.1 2023-06-25 20:20:38,825 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-25 20:21:05,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2063286.0, ans=0.1 2023-06-25 20:21:14,096 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.481e+02 1.044e+03 1.681e+03 2.727e+03 5.790e+03, threshold=3.363e+03, percent-clipped=19.0 2023-06-25 20:21:14,127 INFO [train.py:996] (1/4) Epoch 12, batch 8450, loss[loss=0.265, simple_loss=0.3146, pruned_loss=0.1076, over 21788.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3076, pruned_loss=0.07302, over 4273374.73 frames. ], batch size: 508, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:21:31,476 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-25 20:21:58,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2063466.0, ans=0.07 2023-06-25 20:22:45,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2063586.0, ans=0.5 2023-06-25 20:22:58,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2063586.0, ans=0.025 2023-06-25 20:23:01,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 20:23:04,274 INFO [train.py:996] (1/4) Epoch 12, batch 8500, loss[loss=0.1962, simple_loss=0.2585, pruned_loss=0.06694, over 21473.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3038, pruned_loss=0.07423, over 4270674.84 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:24:10,806 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-25 20:24:13,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2063826.0, ans=0.0 2023-06-25 20:24:48,532 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:24:52,480 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-25 20:24:58,602 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 9.731e+02 1.377e+03 2.109e+03 5.965e+03, threshold=2.755e+03, percent-clipped=8.0 2023-06-25 20:24:58,625 INFO [train.py:996] (1/4) Epoch 12, batch 8550, loss[loss=0.2507, simple_loss=0.3335, pruned_loss=0.08396, over 21828.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3081, pruned_loss=0.07667, over 4275153.60 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:25:15,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2063946.0, ans=0.125 2023-06-25 20:25:17,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2063946.0, ans=0.2 2023-06-25 20:25:39,569 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-25 20:26:16,185 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2064126.0, ans=0.125 2023-06-25 20:26:21,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2064126.0, ans=0.0 2023-06-25 20:26:33,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2064186.0, ans=0.0 2023-06-25 20:26:56,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2064246.0, ans=0.0 2023-06-25 20:26:57,768 INFO [train.py:996] (1/4) Epoch 12, batch 8600, loss[loss=0.2621, simple_loss=0.3341, pruned_loss=0.09501, over 21469.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3137, pruned_loss=0.07841, over 4272213.93 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:27:14,263 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2064246.0, ans=0.125 2023-06-25 20:27:17,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2064246.0, ans=0.1 2023-06-25 20:28:08,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2064426.0, ans=15.0 2023-06-25 20:28:15,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2064426.0, ans=0.1 2023-06-25 20:28:36,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2064486.0, ans=0.1 2023-06-25 20:28:53,067 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.484e+02 8.709e+02 1.194e+03 1.940e+03 4.638e+03, threshold=2.389e+03, percent-clipped=12.0 2023-06-25 20:28:53,098 INFO [train.py:996] (1/4) Epoch 12, batch 8650, loss[loss=0.1884, simple_loss=0.2399, pruned_loss=0.06845, over 20037.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3202, pruned_loss=0.08014, over 4275645.94 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:29:39,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2064666.0, ans=0.125 2023-06-25 20:29:51,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2064666.0, ans=0.2 2023-06-25 20:30:32,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-25 20:30:40,433 INFO [train.py:996] (1/4) Epoch 12, batch 8700, loss[loss=0.1992, simple_loss=0.2409, pruned_loss=0.07876, over 20119.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3095, pruned_loss=0.07639, over 4272463.47 frames. ], batch size: 704, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:30:40,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2064846.0, ans=0.125 2023-06-25 20:31:07,205 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2064906.0, ans=0.125 2023-06-25 20:31:59,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2065086.0, ans=0.125 2023-06-25 20:32:18,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2065086.0, ans=0.2 2023-06-25 20:32:25,083 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.287e+02 9.432e+02 1.371e+03 2.158e+03 4.053e+03, threshold=2.743e+03, percent-clipped=19.0 2023-06-25 20:32:25,122 INFO [train.py:996] (1/4) Epoch 12, batch 8750, loss[loss=0.2339, simple_loss=0.2963, pruned_loss=0.08574, over 21806.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3051, pruned_loss=0.07726, over 4278225.15 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:32:43,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-25 20:34:02,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2065386.0, ans=0.125 2023-06-25 20:34:21,918 INFO [train.py:996] (1/4) Epoch 12, batch 8800, loss[loss=0.1862, simple_loss=0.3011, pruned_loss=0.03568, over 20770.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3152, pruned_loss=0.08021, over 4277655.59 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:35:16,962 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.42 vs. limit=10.0 2023-06-25 20:36:04,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2065686.0, ans=0.1 2023-06-25 20:36:12,852 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.737e+02 1.070e+03 1.502e+03 2.029e+03 4.867e+03, threshold=3.004e+03, percent-clipped=9.0 2023-06-25 20:36:12,885 INFO [train.py:996] (1/4) Epoch 12, batch 8850, loss[loss=0.203, simple_loss=0.2859, pruned_loss=0.06006, over 21603.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.321, pruned_loss=0.0821, over 4281264.43 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:37:09,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2065866.0, ans=0.125 2023-06-25 20:37:44,394 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2065986.0, ans=0.125 2023-06-25 20:38:14,923 INFO [train.py:996] (1/4) Epoch 12, batch 8900, loss[loss=0.2042, simple_loss=0.2674, pruned_loss=0.07047, over 21323.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3162, pruned_loss=0.08078, over 4263933.74 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:38:51,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2066166.0, ans=0.0 2023-06-25 20:39:17,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2066166.0, ans=0.125 2023-06-25 20:39:21,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2066226.0, ans=0.1 2023-06-25 20:39:48,056 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2066286.0, ans=0.0 2023-06-25 20:40:09,292 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.332e+02 9.490e+02 1.490e+03 1.972e+03 6.536e+03, threshold=2.979e+03, percent-clipped=12.0 2023-06-25 20:40:09,324 INFO [train.py:996] (1/4) Epoch 12, batch 8950, loss[loss=0.2618, simple_loss=0.3425, pruned_loss=0.09053, over 21732.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3177, pruned_loss=0.08006, over 4264075.09 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:41:12,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2066526.0, ans=0.125 2023-06-25 20:41:14,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2066526.0, ans=0.125 2023-06-25 20:41:42,584 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2066586.0, ans=0.125 2023-06-25 20:41:57,055 INFO [train.py:996] (1/4) Epoch 12, batch 9000, loss[loss=0.2208, simple_loss=0.2898, pruned_loss=0.07588, over 21334.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3117, pruned_loss=0.08013, over 4268688.64 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:41:57,056 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 20:42:15,053 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2658, simple_loss=0.3589, pruned_loss=0.08634, over 1796401.00 frames. 2023-06-25 20:42:15,054 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 20:42:31,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2066646.0, ans=0.2 2023-06-25 20:42:45,633 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2066706.0, ans=0.125 2023-06-25 20:43:54,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2066886.0, ans=0.5 2023-06-25 20:44:02,541 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.409e+02 9.189e+02 1.168e+03 1.657e+03 3.184e+03, threshold=2.336e+03, percent-clipped=2.0 2023-06-25 20:44:02,574 INFO [train.py:996] (1/4) Epoch 12, batch 9050, loss[loss=0.2107, simple_loss=0.294, pruned_loss=0.06374, over 21812.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3056, pruned_loss=0.07681, over 4256482.82 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:45:59,870 INFO [train.py:996] (1/4) Epoch 12, batch 9100, loss[loss=0.2345, simple_loss=0.3308, pruned_loss=0.06908, over 21736.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.311, pruned_loss=0.07886, over 4259750.85 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:47:56,030 INFO [train.py:996] (1/4) Epoch 12, batch 9150, loss[loss=0.2359, simple_loss=0.3351, pruned_loss=0.06835, over 21736.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3144, pruned_loss=0.07633, over 4258630.07 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:47:57,732 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.614e+02 8.747e+02 1.495e+03 2.212e+03 5.275e+03, threshold=2.990e+03, percent-clipped=21.0 2023-06-25 20:47:58,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2067546.0, ans=0.125 2023-06-25 20:48:12,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2067606.0, ans=0.125 2023-06-25 20:48:51,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2067666.0, ans=0.125 2023-06-25 20:49:44,679 INFO [train.py:996] (1/4) Epoch 12, batch 9200, loss[loss=0.2536, simple_loss=0.322, pruned_loss=0.09264, over 21261.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3165, pruned_loss=0.07551, over 4251060.53 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:49:59,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2067846.0, ans=0.125 2023-06-25 20:49:59,545 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-25 20:51:31,772 INFO [train.py:996] (1/4) Epoch 12, batch 9250, loss[loss=0.2167, simple_loss=0.2791, pruned_loss=0.07712, over 21617.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3179, pruned_loss=0.07731, over 4255112.39 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:51:33,449 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.008e+02 1.206e+03 1.769e+03 2.365e+03 5.380e+03, threshold=3.537e+03, percent-clipped=9.0 2023-06-25 20:52:09,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2068206.0, ans=0.0 2023-06-25 20:52:12,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2068206.0, ans=0.125 2023-06-25 20:52:18,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2068266.0, ans=0.0 2023-06-25 20:52:20,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2068266.0, ans=0.125 2023-06-25 20:52:35,209 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.26 vs. limit=15.0 2023-06-25 20:53:22,069 INFO [train.py:996] (1/4) Epoch 12, batch 9300, loss[loss=0.2276, simple_loss=0.2854, pruned_loss=0.08494, over 21439.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3115, pruned_loss=0.07716, over 4261736.57 frames. ], batch size: 475, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:53:52,283 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2068506.0, ans=0.0 2023-06-25 20:53:55,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2068506.0, ans=0.1 2023-06-25 20:54:29,089 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2068566.0, ans=0.125 2023-06-25 20:55:06,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-25 20:55:18,055 INFO [train.py:996] (1/4) Epoch 12, batch 9350, loss[loss=0.2339, simple_loss=0.3193, pruned_loss=0.07429, over 21901.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3167, pruned_loss=0.07781, over 4262499.77 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:55:19,919 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.869e+02 1.391e+03 2.110e+03 3.228e+03 6.570e+03, threshold=4.220e+03, percent-clipped=18.0 2023-06-25 20:56:10,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2068866.0, ans=0.1 2023-06-25 20:56:21,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2068926.0, ans=0.125 2023-06-25 20:56:50,651 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:56:55,582 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2068986.0, ans=0.125 2023-06-25 20:57:09,366 INFO [train.py:996] (1/4) Epoch 12, batch 9400, loss[loss=0.2111, simple_loss=0.2808, pruned_loss=0.07074, over 21640.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.319, pruned_loss=0.07842, over 4265887.23 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:57:57,926 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-25 20:58:34,177 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2069286.0, ans=0.125 2023-06-25 20:58:57,587 INFO [train.py:996] (1/4) Epoch 12, batch 9450, loss[loss=0.204, simple_loss=0.2705, pruned_loss=0.0687, over 21668.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3118, pruned_loss=0.07729, over 4269304.20 frames. ], batch size: 333, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:58:59,289 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.200e+02 9.646e+02 1.392e+03 2.055e+03 4.300e+03, threshold=2.785e+03, percent-clipped=2.0 2023-06-25 20:59:14,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2069406.0, ans=0.125 2023-06-25 20:59:43,488 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:59:51,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2069466.0, ans=0.125 2023-06-25 21:00:00,515 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-25 21:00:45,754 INFO [train.py:996] (1/4) Epoch 12, batch 9500, loss[loss=0.1939, simple_loss=0.2599, pruned_loss=0.06391, over 21488.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3061, pruned_loss=0.07553, over 4264706.85 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:01:20,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2069706.0, ans=0.125 2023-06-25 21:01:35,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2069766.0, ans=0.2 2023-06-25 21:01:50,421 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2069826.0, ans=0.1 2023-06-25 21:02:37,415 INFO [train.py:996] (1/4) Epoch 12, batch 9550, loss[loss=0.2561, simple_loss=0.33, pruned_loss=0.09106, over 21797.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3076, pruned_loss=0.07693, over 4266073.51 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:02:40,608 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.839e+02 1.229e+03 2.098e+03 2.991e+03 5.309e+03, threshold=4.197e+03, percent-clipped=32.0 2023-06-25 21:02:42,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2069946.0, ans=0.125 2023-06-25 21:02:46,373 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2069946.0, ans=0.125 2023-06-25 21:02:58,601 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2070006.0, ans=0.0 2023-06-25 21:03:24,432 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2070066.0, ans=0.0 2023-06-25 21:04:28,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2070246.0, ans=0.1 2023-06-25 21:04:29,477 INFO [train.py:996] (1/4) Epoch 12, batch 9600, loss[loss=0.2073, simple_loss=0.2818, pruned_loss=0.06641, over 21415.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3105, pruned_loss=0.07917, over 4273490.49 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:04:58,171 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-25 21:05:29,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2070366.0, ans=0.125 2023-06-25 21:06:05,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-25 21:06:19,438 INFO [train.py:996] (1/4) Epoch 12, batch 9650, loss[loss=0.2177, simple_loss=0.2962, pruned_loss=0.0696, over 21760.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3089, pruned_loss=0.07854, over 4282294.71 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:06:23,084 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.038e+02 1.081e+03 1.726e+03 2.542e+03 4.912e+03, threshold=3.453e+03, percent-clipped=2.0 2023-06-25 21:06:38,695 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2070546.0, ans=0.125 2023-06-25 21:07:18,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2070666.0, ans=0.1 2023-06-25 21:07:19,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-25 21:08:03,550 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:08:13,745 INFO [train.py:996] (1/4) Epoch 12, batch 9700, loss[loss=0.2197, simple_loss=0.2873, pruned_loss=0.07609, over 21242.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3126, pruned_loss=0.07938, over 4282654.50 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:08:53,115 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-25 21:08:59,090 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2070966.0, ans=0.0 2023-06-25 21:09:15,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2071026.0, ans=0.1 2023-06-25 21:09:40,278 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2071086.0, ans=0.2 2023-06-25 21:09:45,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2071086.0, ans=0.5 2023-06-25 21:10:01,556 INFO [train.py:996] (1/4) Epoch 12, batch 9750, loss[loss=0.1974, simple_loss=0.2626, pruned_loss=0.0661, over 21715.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3084, pruned_loss=0.07847, over 4268090.10 frames. ], batch size: 299, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:10:04,511 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.823e+02 1.223e+03 1.776e+03 2.509e+03 4.467e+03, threshold=3.552e+03, percent-clipped=5.0 2023-06-25 21:10:07,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2071146.0, ans=0.1 2023-06-25 21:10:07,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-25 21:10:23,023 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2071206.0, ans=0.125 2023-06-25 21:10:31,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2071206.0, ans=0.0 2023-06-25 21:11:43,861 INFO [train.py:996] (1/4) Epoch 12, batch 9800, loss[loss=0.2353, simple_loss=0.3051, pruned_loss=0.08278, over 21903.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3078, pruned_loss=0.0791, over 4261880.45 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:12:24,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2071506.0, ans=0.125 2023-06-25 21:12:28,489 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:12:57,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2071626.0, ans=0.1 2023-06-25 21:13:35,368 INFO [train.py:996] (1/4) Epoch 12, batch 9850, loss[loss=0.2287, simple_loss=0.3108, pruned_loss=0.07336, over 15732.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3043, pruned_loss=0.07879, over 4256462.29 frames. ], batch size: 60, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:13:44,005 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.466e+02 9.389e+02 1.315e+03 1.721e+03 3.595e+03, threshold=2.631e+03, percent-clipped=2.0 2023-06-25 21:15:06,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-25 21:15:16,950 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2071986.0, ans=0.125 2023-06-25 21:15:19,988 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2071986.0, ans=0.0 2023-06-25 21:15:25,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2072046.0, ans=0.125 2023-06-25 21:15:31,668 INFO [train.py:996] (1/4) Epoch 12, batch 9900, loss[loss=0.2604, simple_loss=0.3432, pruned_loss=0.0888, over 16367.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3011, pruned_loss=0.07812, over 4256117.87 frames. ], batch size: 60, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:15:53,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2072106.0, ans=0.05 2023-06-25 21:16:19,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2072166.0, ans=0.1 2023-06-25 21:16:21,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2072166.0, ans=0.125 2023-06-25 21:16:53,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2072226.0, ans=0.0 2023-06-25 21:16:53,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2072226.0, ans=0.1 2023-06-25 21:17:07,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2072286.0, ans=0.0 2023-06-25 21:17:16,864 INFO [train.py:996] (1/4) Epoch 12, batch 9950, loss[loss=0.2003, simple_loss=0.2678, pruned_loss=0.06646, over 21608.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3021, pruned_loss=0.0801, over 4255641.19 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:17:25,541 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.617e+02 8.355e+02 1.275e+03 2.214e+03 4.972e+03, threshold=2.550e+03, percent-clipped=15.0 2023-06-25 21:18:04,753 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2072466.0, ans=0.125 2023-06-25 21:18:13,254 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2072466.0, ans=0.125 2023-06-25 21:18:42,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2072526.0, ans=0.0 2023-06-25 21:18:45,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2072526.0, ans=0.2 2023-06-25 21:18:52,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2072586.0, ans=0.2 2023-06-25 21:19:08,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2072586.0, ans=0.09899494936611666 2023-06-25 21:19:16,259 INFO [train.py:996] (1/4) Epoch 12, batch 10000, loss[loss=0.2144, simple_loss=0.2842, pruned_loss=0.07233, over 21759.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2981, pruned_loss=0.07897, over 4252426.30 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 32.0 2023-06-25 21:19:41,978 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:20:24,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-25 21:21:00,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2072886.0, ans=0.125 2023-06-25 21:21:04,427 INFO [train.py:996] (1/4) Epoch 12, batch 10050, loss[loss=0.2347, simple_loss=0.3131, pruned_loss=0.07818, over 21371.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2999, pruned_loss=0.07928, over 4262075.36 frames. ], batch size: 549, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:21:16,494 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.942e+02 8.782e+02 1.169e+03 1.851e+03 4.390e+03, threshold=2.338e+03, percent-clipped=10.0 2023-06-25 21:21:20,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2072946.0, ans=0.1 2023-06-25 21:21:48,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2073066.0, ans=0.125 2023-06-25 21:22:16,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2073126.0, ans=0.2 2023-06-25 21:22:30,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2073186.0, ans=0.125 2023-06-25 21:23:02,171 INFO [train.py:996] (1/4) Epoch 12, batch 10100, loss[loss=0.2566, simple_loss=0.3249, pruned_loss=0.09414, over 21384.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2994, pruned_loss=0.0775, over 4263999.69 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:23:19,615 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-25 21:23:25,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2073306.0, ans=0.125 2023-06-25 21:24:52,555 INFO [train.py:996] (1/4) Epoch 12, batch 10150, loss[loss=0.2897, simple_loss=0.3552, pruned_loss=0.1122, over 21459.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3051, pruned_loss=0.07974, over 4270526.48 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:24:58,892 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.161e+02 9.851e+02 1.656e+03 2.564e+03 6.129e+03, threshold=3.312e+03, percent-clipped=27.0 2023-06-25 21:25:51,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2073666.0, ans=0.5 2023-06-25 21:26:41,529 INFO [train.py:996] (1/4) Epoch 12, batch 10200, loss[loss=0.2011, simple_loss=0.2691, pruned_loss=0.06657, over 21726.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3043, pruned_loss=0.07775, over 4273738.86 frames. ], batch size: 112, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:26:43,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2073846.0, ans=0.125 2023-06-25 21:26:57,392 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2073906.0, ans=0.125 2023-06-25 21:27:30,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2073966.0, ans=0.0 2023-06-25 21:27:48,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2074026.0, ans=0.125 2023-06-25 21:27:59,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2074026.0, ans=0.0 2023-06-25 21:28:23,410 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2074086.0, ans=0.1 2023-06-25 21:28:31,298 INFO [train.py:996] (1/4) Epoch 12, batch 10250, loss[loss=0.1824, simple_loss=0.2763, pruned_loss=0.0442, over 21624.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2988, pruned_loss=0.07165, over 4277314.43 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:28:33,784 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2074146.0, ans=0.2 2023-06-25 21:28:38,276 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 7.759e+02 1.163e+03 1.703e+03 3.224e+03, threshold=2.326e+03, percent-clipped=0.0 2023-06-25 21:30:10,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2074386.0, ans=0.125 2023-06-25 21:30:19,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2074386.0, ans=0.0 2023-06-25 21:30:24,267 INFO [train.py:996] (1/4) Epoch 12, batch 10300, loss[loss=0.2499, simple_loss=0.3253, pruned_loss=0.08727, over 21374.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3017, pruned_loss=0.07299, over 4277445.63 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:30:55,447 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 21:31:34,155 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2074566.0, ans=0.125 2023-06-25 21:32:24,455 INFO [train.py:996] (1/4) Epoch 12, batch 10350, loss[loss=0.2015, simple_loss=0.2826, pruned_loss=0.06016, over 20805.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.304, pruned_loss=0.07242, over 4279142.98 frames. ], batch size: 609, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:32:30,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2074746.0, ans=0.2 2023-06-25 21:32:31,428 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.446e+02 9.833e+02 1.578e+03 2.772e+03 4.867e+03, threshold=3.157e+03, percent-clipped=30.0 2023-06-25 21:33:26,977 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2074866.0, ans=0.0 2023-06-25 21:33:48,808 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:33:50,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2074986.0, ans=0.0 2023-06-25 21:34:19,099 INFO [train.py:996] (1/4) Epoch 12, batch 10400, loss[loss=0.1741, simple_loss=0.2374, pruned_loss=0.0554, over 21485.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2988, pruned_loss=0.07177, over 4266156.52 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:35:27,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2075226.0, ans=0.1 2023-06-25 21:35:39,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2075226.0, ans=0.0 2023-06-25 21:35:58,973 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2075286.0, ans=0.015 2023-06-25 21:36:10,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2075286.0, ans=0.125 2023-06-25 21:36:14,881 INFO [train.py:996] (1/4) Epoch 12, batch 10450, loss[loss=0.3137, simple_loss=0.3886, pruned_loss=0.1194, over 21521.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3025, pruned_loss=0.07501, over 4261826.60 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:36:19,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2075346.0, ans=0.0 2023-06-25 21:36:22,631 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.406e+02 1.055e+03 1.797e+03 3.015e+03 7.446e+03, threshold=3.594e+03, percent-clipped=22.0 2023-06-25 21:36:46,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2075406.0, ans=0.125 2023-06-25 21:37:17,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2075466.0, ans=0.0 2023-06-25 21:38:00,983 INFO [train.py:996] (1/4) Epoch 12, batch 10500, loss[loss=0.2052, simple_loss=0.2669, pruned_loss=0.07172, over 21418.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3032, pruned_loss=0.07342, over 4259108.93 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:39:53,029 INFO [train.py:996] (1/4) Epoch 12, batch 10550, loss[loss=0.2086, simple_loss=0.2604, pruned_loss=0.07843, over 21148.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2978, pruned_loss=0.07331, over 4257133.27 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:40:05,046 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.958e+02 1.039e+03 1.480e+03 2.237e+03 5.696e+03, threshold=2.960e+03, percent-clipped=5.0 2023-06-25 21:41:40,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2076186.0, ans=0.125 2023-06-25 21:41:44,551 INFO [train.py:996] (1/4) Epoch 12, batch 10600, loss[loss=0.1795, simple_loss=0.2539, pruned_loss=0.0526, over 21423.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2933, pruned_loss=0.07265, over 4255441.71 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:41:52,630 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.81 vs. limit=6.0 2023-06-25 21:42:08,275 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2076306.0, ans=0.2 2023-06-25 21:42:25,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2076306.0, ans=0.5 2023-06-25 21:43:06,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2076426.0, ans=0.2 2023-06-25 21:43:08,750 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-25 21:43:11,905 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:43:14,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2076426.0, ans=0.0 2023-06-25 21:43:41,769 INFO [train.py:996] (1/4) Epoch 12, batch 10650, loss[loss=0.1975, simple_loss=0.2887, pruned_loss=0.05316, over 21620.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2982, pruned_loss=0.07178, over 4258538.81 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:43:48,913 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.385e+02 8.599e+02 1.360e+03 2.205e+03 4.639e+03, threshold=2.719e+03, percent-clipped=13.0 2023-06-25 21:44:06,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2076606.0, ans=0.2 2023-06-25 21:44:51,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2076726.0, ans=0.125 2023-06-25 21:44:54,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2076726.0, ans=0.125 2023-06-25 21:45:08,517 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2076786.0, ans=0.0 2023-06-25 21:45:32,602 INFO [train.py:996] (1/4) Epoch 12, batch 10700, loss[loss=0.1817, simple_loss=0.2601, pruned_loss=0.05164, over 21631.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2969, pruned_loss=0.07259, over 4260894.06 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:45:48,821 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.75 vs. limit=22.5 2023-06-25 21:45:56,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=22.5 2023-06-25 21:46:03,168 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.57 vs. limit=22.5 2023-06-25 21:46:45,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2077026.0, ans=0.0 2023-06-25 21:46:50,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2077026.0, ans=0.125 2023-06-25 21:46:52,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2077026.0, ans=0.1 2023-06-25 21:47:29,866 INFO [train.py:996] (1/4) Epoch 12, batch 10750, loss[loss=0.2413, simple_loss=0.321, pruned_loss=0.08081, over 21379.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.308, pruned_loss=0.07669, over 4264574.30 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:47:36,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.154e+02 8.331e+02 1.141e+03 1.485e+03 5.663e+03, threshold=2.282e+03, percent-clipped=4.0 2023-06-25 21:47:39,694 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-06-25 21:48:30,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2077266.0, ans=0.1 2023-06-25 21:48:39,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2077326.0, ans=0.125 2023-06-25 21:48:43,683 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2077326.0, ans=0.125 2023-06-25 21:48:50,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2077326.0, ans=0.0 2023-06-25 21:49:03,814 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2077386.0, ans=0.125 2023-06-25 21:49:20,817 INFO [train.py:996] (1/4) Epoch 12, batch 10800, loss[loss=0.2296, simple_loss=0.3083, pruned_loss=0.07538, over 21908.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.313, pruned_loss=0.07683, over 4272290.92 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 21:49:23,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2077446.0, ans=0.125 2023-06-25 21:49:43,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2077446.0, ans=0.125 2023-06-25 21:50:50,146 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2077626.0, ans=0.1 2023-06-25 21:51:15,797 INFO [train.py:996] (1/4) Epoch 12, batch 10850, loss[loss=0.2205, simple_loss=0.285, pruned_loss=0.07799, over 21295.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3146, pruned_loss=0.07715, over 4273179.08 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:51:31,521 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.495e+02 1.105e+03 1.666e+03 2.535e+03 5.598e+03, threshold=3.333e+03, percent-clipped=30.0 2023-06-25 21:51:32,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2077746.0, ans=0.1 2023-06-25 21:51:59,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2077866.0, ans=0.125 2023-06-25 21:51:59,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=15.0 2023-06-25 21:52:10,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.13 vs. limit=15.0 2023-06-25 21:52:37,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2077926.0, ans=0.1 2023-06-25 21:53:04,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2077986.0, ans=0.0 2023-06-25 21:53:16,015 INFO [train.py:996] (1/4) Epoch 12, batch 10900, loss[loss=0.2287, simple_loss=0.3126, pruned_loss=0.07241, over 21659.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3067, pruned_loss=0.07502, over 4265817.11 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:53:26,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2078046.0, ans=0.035 2023-06-25 21:54:30,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2078226.0, ans=0.125 2023-06-25 21:55:00,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-25 21:55:00,778 INFO [train.py:996] (1/4) Epoch 12, batch 10950, loss[loss=0.237, simple_loss=0.3009, pruned_loss=0.08655, over 21470.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.303, pruned_loss=0.07339, over 4266655.53 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:55:05,134 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2023-06-25 21:55:06,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2078346.0, ans=0.07 2023-06-25 21:55:18,507 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.435e+02 8.397e+02 1.288e+03 2.017e+03 5.203e+03, threshold=2.576e+03, percent-clipped=6.0 2023-06-25 21:55:21,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2078346.0, ans=0.1 2023-06-25 21:55:27,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2078406.0, ans=0.125 2023-06-25 21:56:04,483 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2078526.0, ans=0.2 2023-06-25 21:56:49,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2078586.0, ans=0.125 2023-06-25 21:56:51,900 INFO [train.py:996] (1/4) Epoch 12, batch 11000, loss[loss=0.2111, simple_loss=0.2768, pruned_loss=0.0727, over 21581.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3024, pruned_loss=0.07303, over 4259753.74 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:57:08,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-25 21:57:11,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2078646.0, ans=0.0 2023-06-25 21:57:24,639 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-25 21:57:34,054 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2078706.0, ans=0.04949747468305833 2023-06-25 21:58:05,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-25 21:58:39,772 INFO [train.py:996] (1/4) Epoch 12, batch 11050, loss[loss=0.2385, simple_loss=0.2926, pruned_loss=0.09216, over 21841.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2999, pruned_loss=0.07451, over 4265197.52 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:58:57,596 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.227e+02 8.641e+02 1.192e+03 1.708e+03 4.413e+03, threshold=2.383e+03, percent-clipped=10.0 2023-06-25 21:59:32,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2079066.0, ans=0.125 2023-06-25 21:59:52,958 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-25 22:00:30,694 INFO [train.py:996] (1/4) Epoch 12, batch 11100, loss[loss=0.2435, simple_loss=0.3046, pruned_loss=0.09123, over 21888.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2984, pruned_loss=0.07525, over 4270351.07 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:01:47,747 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-25 22:01:48,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2079426.0, ans=0.04949747468305833 2023-06-25 22:02:04,537 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2079486.0, ans=0.125 2023-06-25 22:02:14,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2079486.0, ans=0.125 2023-06-25 22:02:21,211 INFO [train.py:996] (1/4) Epoch 12, batch 11150, loss[loss=0.2129, simple_loss=0.2869, pruned_loss=0.06947, over 21617.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2947, pruned_loss=0.07421, over 4259816.29 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:02:38,387 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.021e+02 8.746e+02 1.220e+03 1.810e+03 4.073e+03, threshold=2.441e+03, percent-clipped=6.0 2023-06-25 22:02:41,101 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2079546.0, ans=0.0 2023-06-25 22:04:09,074 INFO [train.py:996] (1/4) Epoch 12, batch 11200, loss[loss=0.2597, simple_loss=0.3026, pruned_loss=0.1084, over 21299.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2933, pruned_loss=0.0741, over 4264800.09 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:05:12,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2079966.0, ans=0.2 2023-06-25 22:05:56,529 INFO [train.py:996] (1/4) Epoch 12, batch 11250, loss[loss=0.2304, simple_loss=0.3196, pruned_loss=0.07057, over 21604.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2927, pruned_loss=0.07416, over 4254698.02 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:05:57,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2080146.0, ans=0.125 2023-06-25 22:06:00,727 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2080146.0, ans=0.0 2023-06-25 22:06:00,731 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2080146.0, ans=0.5 2023-06-25 22:06:05,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2080146.0, ans=0.125 2023-06-25 22:06:14,388 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.198e+02 9.018e+02 1.214e+03 1.830e+03 3.568e+03, threshold=2.429e+03, percent-clipped=9.0 2023-06-25 22:06:24,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2080206.0, ans=0.0 2023-06-25 22:06:54,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2080266.0, ans=0.125 2023-06-25 22:07:17,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2080326.0, ans=0.1 2023-06-25 22:07:37,708 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2080386.0, ans=0.125 2023-06-25 22:07:46,557 INFO [train.py:996] (1/4) Epoch 12, batch 11300, loss[loss=0.2028, simple_loss=0.2849, pruned_loss=0.06031, over 21825.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2938, pruned_loss=0.07391, over 4266471.35 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:08:14,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2080506.0, ans=0.1 2023-06-25 22:08:49,560 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.80 vs. limit=12.0 2023-06-25 22:08:50,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2080566.0, ans=0.125 2023-06-25 22:09:28,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2080686.0, ans=0.125 2023-06-25 22:09:29,826 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2080686.0, ans=0.2 2023-06-25 22:09:31,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2080686.0, ans=0.2 2023-06-25 22:09:40,458 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2080746.0, ans=10.0 2023-06-25 22:09:41,338 INFO [train.py:996] (1/4) Epoch 12, batch 11350, loss[loss=0.2345, simple_loss=0.3197, pruned_loss=0.0746, over 21813.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2947, pruned_loss=0.07357, over 4266114.85 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:09:42,117 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2080746.0, ans=0.1 2023-06-25 22:09:54,134 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.747e+02 9.372e+02 1.256e+03 1.751e+03 3.011e+03, threshold=2.512e+03, percent-clipped=9.0 2023-06-25 22:10:48,355 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2080926.0, ans=0.125 2023-06-25 22:11:36,916 INFO [train.py:996] (1/4) Epoch 12, batch 11400, loss[loss=0.2489, simple_loss=0.3302, pruned_loss=0.08377, over 20113.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3008, pruned_loss=0.07603, over 4261666.06 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:12:03,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2081106.0, ans=0.125 2023-06-25 22:12:09,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2081106.0, ans=0.2 2023-06-25 22:12:15,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-06-25 22:12:25,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2081166.0, ans=0.0 2023-06-25 22:13:25,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2081346.0, ans=0.1 2023-06-25 22:13:26,720 INFO [train.py:996] (1/4) Epoch 12, batch 11450, loss[loss=0.2215, simple_loss=0.2931, pruned_loss=0.07498, over 21254.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3025, pruned_loss=0.07512, over 4267988.09 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:13:33,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081346.0, ans=0.1 2023-06-25 22:13:35,005 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2081346.0, ans=0.125 2023-06-25 22:13:46,260 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.691e+02 8.654e+02 1.291e+03 1.959e+03 4.523e+03, threshold=2.583e+03, percent-clipped=10.0 2023-06-25 22:14:34,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2023-06-25 22:14:35,613 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=15.0 2023-06-25 22:14:39,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2081526.0, ans=0.0 2023-06-25 22:14:50,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2081526.0, ans=0.125 2023-06-25 22:14:59,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2081586.0, ans=0.05 2023-06-25 22:15:15,778 INFO [train.py:996] (1/4) Epoch 12, batch 11500, loss[loss=0.2759, simple_loss=0.3452, pruned_loss=0.1033, over 21414.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3049, pruned_loss=0.07625, over 4271215.86 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:15:21,846 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2081646.0, ans=0.2 2023-06-25 22:15:41,222 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.80 vs. limit=10.0 2023-06-25 22:15:56,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2081706.0, ans=0.0 2023-06-25 22:17:10,044 INFO [train.py:996] (1/4) Epoch 12, batch 11550, loss[loss=0.3011, simple_loss=0.4047, pruned_loss=0.09873, over 21763.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3105, pruned_loss=0.07609, over 4272664.97 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:17:17,918 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2081946.0, ans=0.125 2023-06-25 22:17:30,009 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.895e+02 1.016e+03 1.404e+03 2.173e+03 5.477e+03, threshold=2.808e+03, percent-clipped=17.0 2023-06-25 22:17:55,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2082006.0, ans=0.0 2023-06-25 22:17:57,655 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=12.0 2023-06-25 22:18:12,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2082066.0, ans=0.125 2023-06-25 22:19:08,993 INFO [train.py:996] (1/4) Epoch 12, batch 11600, loss[loss=0.2476, simple_loss=0.3222, pruned_loss=0.0865, over 21842.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3266, pruned_loss=0.07849, over 4269317.87 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:20:01,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2082366.0, ans=0.125 2023-06-25 22:20:04,418 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:20:58,724 INFO [train.py:996] (1/4) Epoch 12, batch 11650, loss[loss=0.2739, simple_loss=0.3524, pruned_loss=0.09771, over 21850.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3332, pruned_loss=0.07953, over 4261112.45 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:21:11,361 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2082546.0, ans=0.1 2023-06-25 22:21:17,676 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.996e+02 1.003e+03 1.550e+03 2.140e+03 4.406e+03, threshold=3.101e+03, percent-clipped=13.0 2023-06-25 22:21:54,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2082666.0, ans=0.0 2023-06-25 22:22:47,746 INFO [train.py:996] (1/4) Epoch 12, batch 11700, loss[loss=0.2194, simple_loss=0.282, pruned_loss=0.07841, over 21828.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3251, pruned_loss=0.07936, over 4258232.59 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:23:17,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2082906.0, ans=0.1 2023-06-25 22:23:58,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2083026.0, ans=0.0 2023-06-25 22:24:38,055 INFO [train.py:996] (1/4) Epoch 12, batch 11750, loss[loss=0.2389, simple_loss=0.3004, pruned_loss=0.08872, over 21315.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3155, pruned_loss=0.07844, over 4259257.88 frames. ], batch size: 549, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:24:57,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.375e+02 1.055e+03 1.810e+03 2.598e+03 5.314e+03, threshold=3.620e+03, percent-clipped=16.0 2023-06-25 22:25:34,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-25 22:25:38,548 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:25:40,382 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2083326.0, ans=0.125 2023-06-25 22:26:32,984 INFO [train.py:996] (1/4) Epoch 12, batch 11800, loss[loss=0.2211, simple_loss=0.3243, pruned_loss=0.0589, over 21746.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3167, pruned_loss=0.08026, over 4269090.05 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:27:14,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2083566.0, ans=0.09899494936611666 2023-06-25 22:27:15,812 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2083566.0, ans=0.125 2023-06-25 22:27:19,108 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2083566.0, ans=0.125 2023-06-25 22:27:22,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2083566.0, ans=0.1 2023-06-25 22:27:50,027 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2083626.0, ans=0.0 2023-06-25 22:28:23,927 INFO [train.py:996] (1/4) Epoch 12, batch 11850, loss[loss=0.2289, simple_loss=0.3042, pruned_loss=0.07678, over 21919.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3161, pruned_loss=0.07962, over 4274635.83 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:28:27,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2083746.0, ans=0.1 2023-06-25 22:28:37,198 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.904e+02 9.109e+02 1.402e+03 2.284e+03 4.807e+03, threshold=2.803e+03, percent-clipped=4.0 2023-06-25 22:28:38,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2083746.0, ans=0.125 2023-06-25 22:28:50,725 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2083806.0, ans=0.125 2023-06-25 22:29:14,097 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.23 vs. limit=12.0 2023-06-25 22:29:21,872 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2083926.0, ans=0.125 2023-06-25 22:29:32,383 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2083926.0, ans=0.1 2023-06-25 22:30:05,987 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-25 22:30:13,549 INFO [train.py:996] (1/4) Epoch 12, batch 11900, loss[loss=0.2352, simple_loss=0.324, pruned_loss=0.07317, over 21602.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3169, pruned_loss=0.07818, over 4275048.07 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:30:19,965 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2084046.0, ans=0.2 2023-06-25 22:30:35,833 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5 2023-06-25 22:30:55,213 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-25 22:31:03,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2084166.0, ans=0.125 2023-06-25 22:32:05,344 INFO [train.py:996] (1/4) Epoch 12, batch 11950, loss[loss=0.1843, simple_loss=0.2643, pruned_loss=0.05213, over 21592.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3154, pruned_loss=0.07466, over 4275200.69 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:32:16,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2084346.0, ans=0.125 2023-06-25 22:32:24,569 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.025e+02 8.288e+02 1.263e+03 1.955e+03 4.768e+03, threshold=2.526e+03, percent-clipped=7.0 2023-06-25 22:33:54,626 INFO [train.py:996] (1/4) Epoch 12, batch 12000, loss[loss=0.2314, simple_loss=0.298, pruned_loss=0.08243, over 21975.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3094, pruned_loss=0.07227, over 4269646.41 frames. ], batch size: 103, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 22:33:54,626 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-25 22:34:17,922 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2583, simple_loss=0.3504, pruned_loss=0.08306, over 1796401.00 frames. 2023-06-25 22:34:17,923 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-25 22:34:18,342 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2084646.0, ans=0.0 2023-06-25 22:34:30,920 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:34:33,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2084706.0, ans=0.125 2023-06-25 22:34:34,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2084706.0, ans=0.2 2023-06-25 22:35:08,684 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2084766.0, ans=0.125 2023-06-25 22:35:20,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2084826.0, ans=0.0 2023-06-25 22:36:03,704 INFO [train.py:996] (1/4) Epoch 12, batch 12050, loss[loss=0.2502, simple_loss=0.3746, pruned_loss=0.06286, over 20737.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3061, pruned_loss=0.07367, over 4268705.45 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:36:12,924 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2084946.0, ans=0.2 2023-06-25 22:36:17,772 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2084946.0, ans=0.125 2023-06-25 22:36:18,984 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 8.043e+02 1.290e+03 1.841e+03 5.321e+03, threshold=2.580e+03, percent-clipped=14.0 2023-06-25 22:36:46,025 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2085066.0, ans=0.125 2023-06-25 22:37:08,803 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2085126.0, ans=0.0 2023-06-25 22:37:13,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-25 22:37:47,025 INFO [train.py:996] (1/4) Epoch 12, batch 12100, loss[loss=0.2401, simple_loss=0.3218, pruned_loss=0.0792, over 21893.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3108, pruned_loss=0.07788, over 4274436.00 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:38:50,674 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2085426.0, ans=0.1 2023-06-25 22:38:54,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2085426.0, ans=0.125 2023-06-25 22:39:17,886 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2085486.0, ans=0.07 2023-06-25 22:39:20,038 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2085486.0, ans=0.125 2023-06-25 22:39:33,570 INFO [train.py:996] (1/4) Epoch 12, batch 12150, loss[loss=0.2054, simple_loss=0.298, pruned_loss=0.05638, over 21582.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.315, pruned_loss=0.07803, over 4276368.35 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:40:00,401 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.100e+02 9.375e+02 1.398e+03 2.052e+03 3.851e+03, threshold=2.797e+03, percent-clipped=12.0 2023-06-25 22:40:47,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2085726.0, ans=0.125 2023-06-25 22:41:26,890 INFO [train.py:996] (1/4) Epoch 12, batch 12200, loss[loss=0.2398, simple_loss=0.2963, pruned_loss=0.09167, over 21635.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.312, pruned_loss=0.0767, over 4264013.31 frames. ], batch size: 231, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:42:21,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2085966.0, ans=0.125 2023-06-25 22:42:24,996 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2085966.0, ans=0.0 2023-06-25 22:42:50,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2086086.0, ans=0.2 2023-06-25 22:42:54,206 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-25 22:43:08,648 INFO [train.py:996] (1/4) Epoch 12, batch 12250, loss[loss=0.1512, simple_loss=0.2344, pruned_loss=0.03397, over 21302.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3038, pruned_loss=0.07434, over 4267146.75 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:43:09,277 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2086146.0, ans=0.125 2023-06-25 22:43:24,138 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.778e+02 9.164e+02 1.305e+03 1.876e+03 4.319e+03, threshold=2.609e+03, percent-clipped=7.0 2023-06-25 22:43:31,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2086206.0, ans=0.125 2023-06-25 22:44:23,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2086326.0, ans=0.0 2023-06-25 22:44:23,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2086326.0, ans=0.0 2023-06-25 22:44:56,866 INFO [train.py:996] (1/4) Epoch 12, batch 12300, loss[loss=0.2053, simple_loss=0.2996, pruned_loss=0.05548, over 21739.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2965, pruned_loss=0.06857, over 4264585.08 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:45:01,135 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2086446.0, ans=0.0 2023-06-25 22:45:02,548 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2086446.0, ans=0.0 2023-06-25 22:45:38,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2086566.0, ans=0.0 2023-06-25 22:46:39,957 INFO [train.py:996] (1/4) Epoch 12, batch 12350, loss[loss=0.2373, simple_loss=0.3208, pruned_loss=0.07692, over 21911.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2987, pruned_loss=0.06833, over 4268857.41 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:47:01,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2086806.0, ans=0.04949747468305833 2023-06-25 22:47:02,273 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.943e+02 9.460e+02 1.568e+03 2.170e+03 4.986e+03, threshold=3.136e+03, percent-clipped=17.0 2023-06-25 22:48:33,554 INFO [train.py:996] (1/4) Epoch 12, batch 12400, loss[loss=0.2869, simple_loss=0.406, pruned_loss=0.08388, over 19826.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3031, pruned_loss=0.07265, over 4280533.25 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:48:42,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2087046.0, ans=0.0 2023-06-25 22:49:15,536 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2087166.0, ans=0.1 2023-06-25 22:49:22,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2087166.0, ans=0.0 2023-06-25 22:50:21,725 INFO [train.py:996] (1/4) Epoch 12, batch 12450, loss[loss=0.2499, simple_loss=0.3247, pruned_loss=0.08749, over 21768.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.307, pruned_loss=0.07539, over 4285556.03 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:50:32,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2087346.0, ans=0.125 2023-06-25 22:50:45,397 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.188e+02 9.401e+02 1.561e+03 2.264e+03 4.297e+03, threshold=3.121e+03, percent-clipped=10.0 2023-06-25 22:51:05,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2087406.0, ans=0.125 2023-06-25 22:51:27,553 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2087526.0, ans=0.1 2023-06-25 22:51:55,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2087586.0, ans=0.125 2023-06-25 22:52:09,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2087586.0, ans=0.125 2023-06-25 22:52:16,029 INFO [train.py:996] (1/4) Epoch 12, batch 12500, loss[loss=0.2615, simple_loss=0.355, pruned_loss=0.08396, over 21301.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3171, pruned_loss=0.07839, over 4290473.66 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:52:23,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2087646.0, ans=0.125 2023-06-25 22:52:44,861 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.91 vs. limit=10.0 2023-06-25 22:53:14,120 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2087766.0, ans=0.09899494936611666 2023-06-25 22:53:35,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2087826.0, ans=0.125 2023-06-25 22:54:06,687 INFO [train.py:996] (1/4) Epoch 12, batch 12550, loss[loss=0.2531, simple_loss=0.3257, pruned_loss=0.09026, over 21970.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3224, pruned_loss=0.08111, over 4285671.08 frames. ], batch size: 317, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:54:26,274 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2087946.0, ans=0.125 2023-06-25 22:54:32,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.421e+02 9.265e+02 1.193e+03 1.894e+03 3.118e+03, threshold=2.386e+03, percent-clipped=0.0 2023-06-25 22:55:09,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-25 22:55:11,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2088066.0, ans=0.0 2023-06-25 22:55:38,330 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.65 vs. limit=15.0 2023-06-25 22:56:01,631 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.62 vs. limit=6.0 2023-06-25 22:56:05,541 INFO [train.py:996] (1/4) Epoch 12, batch 12600, loss[loss=0.2621, simple_loss=0.3505, pruned_loss=0.08682, over 21587.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3227, pruned_loss=0.07917, over 4289991.09 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:56:12,030 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-25 22:56:14,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2088246.0, ans=0.2 2023-06-25 22:56:50,055 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-25 22:57:51,805 INFO [train.py:996] (1/4) Epoch 12, batch 12650, loss[loss=0.216, simple_loss=0.2744, pruned_loss=0.07878, over 17102.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3151, pruned_loss=0.07537, over 4282372.59 frames. ], batch size: 60, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:58:08,915 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.676e+02 8.577e+02 1.170e+03 1.835e+03 4.573e+03, threshold=2.341e+03, percent-clipped=16.0 2023-06-25 22:58:24,921 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2088606.0, ans=0.125 2023-06-25 22:58:42,443 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-25 22:59:11,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2088726.0, ans=0.0 2023-06-25 22:59:13,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2088726.0, ans=0.0 2023-06-25 22:59:40,842 INFO [train.py:996] (1/4) Epoch 12, batch 12700, loss[loss=0.2537, simple_loss=0.327, pruned_loss=0.09016, over 21864.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3141, pruned_loss=0.07785, over 4285833.17 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:59:58,823 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2088846.0, ans=0.125 2023-06-25 22:59:59,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-25 23:00:01,159 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.62 vs. limit=22.5 2023-06-25 23:00:21,093 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-25 23:00:35,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2088966.0, ans=0.0 2023-06-25 23:01:19,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2089086.0, ans=0.125 2023-06-25 23:01:22,594 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2089086.0, ans=0.125 2023-06-25 23:01:30,765 INFO [train.py:996] (1/4) Epoch 12, batch 12750, loss[loss=0.2327, simple_loss=0.3073, pruned_loss=0.07906, over 21397.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3157, pruned_loss=0.07817, over 4283949.42 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:01:35,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2089146.0, ans=0.0 2023-06-25 23:01:51,857 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.914e+02 9.146e+02 1.295e+03 2.167e+03 3.972e+03, threshold=2.590e+03, percent-clipped=20.0 2023-06-25 23:02:15,112 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2089266.0, ans=0.0 2023-06-25 23:02:48,444 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-25 23:02:49,358 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2089326.0, ans=0.2 2023-06-25 23:03:18,705 INFO [train.py:996] (1/4) Epoch 12, batch 12800, loss[loss=0.2543, simple_loss=0.3243, pruned_loss=0.09214, over 21841.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3147, pruned_loss=0.07921, over 4291459.87 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 23:03:22,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2089446.0, ans=0.0 2023-06-25 23:03:33,879 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-25 23:04:34,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-25 23:04:55,602 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2089686.0, ans=0.0 2023-06-25 23:05:15,823 INFO [train.py:996] (1/4) Epoch 12, batch 12850, loss[loss=0.1859, simple_loss=0.2868, pruned_loss=0.04246, over 21684.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3165, pruned_loss=0.08068, over 4295004.80 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:05:23,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2089746.0, ans=0.0 2023-06-25 23:05:24,070 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2089746.0, ans=0.0 2023-06-25 23:05:41,771 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.120e+02 9.253e+02 1.228e+03 1.682e+03 4.174e+03, threshold=2.456e+03, percent-clipped=9.0 2023-06-25 23:06:00,843 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:06:06,684 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-25 23:06:07,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2089866.0, ans=0.0 2023-06-25 23:06:07,929 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2089866.0, ans=0.1 2023-06-25 23:06:09,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2089866.0, ans=0.2 2023-06-25 23:06:35,207 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-25 23:07:15,433 INFO [train.py:996] (1/4) Epoch 12, batch 12900, loss[loss=0.2174, simple_loss=0.3036, pruned_loss=0.06563, over 21682.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3126, pruned_loss=0.07695, over 4286967.44 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:07:45,322 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-06-25 23:09:05,336 INFO [train.py:996] (1/4) Epoch 12, batch 12950, loss[loss=0.2777, simple_loss=0.3526, pruned_loss=0.1014, over 21730.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3103, pruned_loss=0.0755, over 4281649.40 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:09:30,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.722e+02 9.455e+02 1.379e+03 2.078e+03 4.999e+03, threshold=2.758e+03, percent-clipped=14.0 2023-06-25 23:09:38,211 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-25 23:09:48,370 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-25 23:10:14,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-25 23:10:23,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2090526.0, ans=0.125 2023-06-25 23:10:28,529 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-25 23:11:00,093 INFO [train.py:996] (1/4) Epoch 12, batch 13000, loss[loss=0.1822, simple_loss=0.2669, pruned_loss=0.0487, over 21627.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.31, pruned_loss=0.07556, over 4276732.34 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:11:39,218 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.97 vs. limit=10.0 2023-06-25 23:11:45,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2090766.0, ans=0.0 2023-06-25 23:12:41,972 INFO [train.py:996] (1/4) Epoch 12, batch 13050, loss[loss=0.2189, simple_loss=0.2928, pruned_loss=0.07252, over 21306.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3051, pruned_loss=0.07336, over 4282051.26 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:13:06,373 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.723e+02 9.569e+02 1.242e+03 1.863e+03 3.264e+03, threshold=2.484e+03, percent-clipped=2.0 2023-06-25 23:14:38,572 INFO [train.py:996] (1/4) Epoch 12, batch 13100, loss[loss=0.2331, simple_loss=0.3137, pruned_loss=0.07619, over 21372.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3069, pruned_loss=0.07356, over 4284149.46 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:14:44,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2091246.0, ans=0.2 2023-06-25 23:15:35,068 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-25 23:15:54,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2091426.0, ans=0.125 2023-06-25 23:16:12,279 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2091486.0, ans=10.0 2023-06-25 23:16:24,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2091486.0, ans=0.2 2023-06-25 23:16:31,069 INFO [train.py:996] (1/4) Epoch 12, batch 13150, loss[loss=0.2328, simple_loss=0.3071, pruned_loss=0.07926, over 21846.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3089, pruned_loss=0.07571, over 4276647.44 frames. ], batch size: 372, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:16:42,238 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2091546.0, ans=0.125 2023-06-25 23:16:55,952 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.201e+02 8.745e+02 1.410e+03 2.124e+03 5.219e+03, threshold=2.820e+03, percent-clipped=16.0 2023-06-25 23:17:33,475 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2091666.0, ans=0.125 2023-06-25 23:17:36,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2091666.0, ans=0.125 2023-06-25 23:18:28,218 INFO [train.py:996] (1/4) Epoch 12, batch 13200, loss[loss=0.2328, simple_loss=0.3113, pruned_loss=0.07713, over 21733.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.309, pruned_loss=0.07577, over 4273337.07 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 23:19:01,662 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2091906.0, ans=0.0 2023-06-25 23:19:15,220 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:19:37,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2092026.0, ans=0.125 2023-06-25 23:20:10,011 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-25 23:20:10,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2092086.0, ans=0.1 2023-06-25 23:20:15,616 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2092146.0, ans=0.125 2023-06-25 23:20:16,893 INFO [train.py:996] (1/4) Epoch 12, batch 13250, loss[loss=0.2994, simple_loss=0.4448, pruned_loss=0.07699, over 19633.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3121, pruned_loss=0.07771, over 4261984.35 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:20:45,175 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.712e+02 9.106e+02 1.656e+03 2.561e+03 5.361e+03, threshold=3.313e+03, percent-clipped=20.0 2023-06-25 23:20:45,724 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2092206.0, ans=0.125 2023-06-25 23:20:57,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2092206.0, ans=0.2 2023-06-25 23:21:02,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2092266.0, ans=0.1 2023-06-25 23:21:26,945 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2092326.0, ans=0.125 2023-06-25 23:21:32,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-25 23:22:12,986 INFO [train.py:996] (1/4) Epoch 12, batch 13300, loss[loss=0.2637, simple_loss=0.3512, pruned_loss=0.08814, over 21859.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3143, pruned_loss=0.07777, over 4263693.68 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:22:18,644 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2092446.0, ans=0.2 2023-06-25 23:24:00,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2092746.0, ans=0.0 2023-06-25 23:24:01,787 INFO [train.py:996] (1/4) Epoch 12, batch 13350, loss[loss=0.2143, simple_loss=0.279, pruned_loss=0.07479, over 16289.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3208, pruned_loss=0.08152, over 4261943.58 frames. ], batch size: 61, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:24:36,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2092806.0, ans=0.0 2023-06-25 23:24:37,308 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.381e+02 8.931e+02 1.432e+03 2.029e+03 4.000e+03, threshold=2.864e+03, percent-clipped=9.0 2023-06-25 23:25:14,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2092926.0, ans=0.2 2023-06-25 23:25:35,207 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2092986.0, ans=0.0 2023-06-25 23:25:51,932 INFO [train.py:996] (1/4) Epoch 12, batch 13400, loss[loss=0.242, simple_loss=0.3221, pruned_loss=0.08097, over 21857.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.321, pruned_loss=0.08304, over 4271958.35 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:27:00,132 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-25 23:27:14,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2093226.0, ans=0.1 2023-06-25 23:27:30,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.75 vs. limit=15.0 2023-06-25 23:27:50,032 INFO [train.py:996] (1/4) Epoch 12, batch 13450, loss[loss=0.2058, simple_loss=0.2651, pruned_loss=0.07321, over 21332.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.322, pruned_loss=0.08542, over 4277229.50 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:28:12,755 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2093406.0, ans=0.125 2023-06-25 23:28:18,593 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.861e+02 9.284e+02 1.216e+03 1.797e+03 3.595e+03, threshold=2.431e+03, percent-clipped=4.0 2023-06-25 23:28:20,874 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2093406.0, ans=0.125 2023-06-25 23:28:20,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2093406.0, ans=0.0 2023-06-25 23:28:39,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2093466.0, ans=0.2 2023-06-25 23:29:32,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2093586.0, ans=0.2 2023-06-25 23:29:47,953 INFO [train.py:996] (1/4) Epoch 12, batch 13500, loss[loss=0.2475, simple_loss=0.3251, pruned_loss=0.08492, over 21731.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3144, pruned_loss=0.08264, over 4264593.38 frames. ], batch size: 391, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:30:12,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2093706.0, ans=0.125 2023-06-25 23:30:23,694 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2093706.0, ans=0.2 2023-06-25 23:30:29,390 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2093766.0, ans=0.125 2023-06-25 23:30:57,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2093826.0, ans=0.125 2023-06-25 23:31:38,025 INFO [train.py:996] (1/4) Epoch 12, batch 13550, loss[loss=0.1705, simple_loss=0.2429, pruned_loss=0.04899, over 21712.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3184, pruned_loss=0.08167, over 4258302.00 frames. ], batch size: 112, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:31:38,653 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2093946.0, ans=0.2 2023-06-25 23:31:50,905 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2093946.0, ans=0.04949747468305833 2023-06-25 23:32:07,683 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.160e+02 9.861e+02 1.411e+03 2.332e+03 4.219e+03, threshold=2.822e+03, percent-clipped=19.0 2023-06-25 23:33:01,494 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2094126.0, ans=0.125 2023-06-25 23:33:24,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-25 23:33:29,554 INFO [train.py:996] (1/4) Epoch 12, batch 13600, loss[loss=0.2356, simple_loss=0.3158, pruned_loss=0.07771, over 21864.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3195, pruned_loss=0.08221, over 4264158.97 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:33:43,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2094246.0, ans=0.0 2023-06-25 23:33:52,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2094306.0, ans=0.2 2023-06-25 23:33:58,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2094306.0, ans=0.125 2023-06-25 23:34:07,124 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-25 23:34:22,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2094366.0, ans=0.0 2023-06-25 23:34:22,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2094366.0, ans=0.1 2023-06-25 23:34:34,979 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-25 23:34:37,120 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.08 vs. limit=22.5 2023-06-25 23:34:59,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-25 23:35:20,544 INFO [train.py:996] (1/4) Epoch 12, batch 13650, loss[loss=0.1831, simple_loss=0.2587, pruned_loss=0.05379, over 21310.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3123, pruned_loss=0.07786, over 4268079.10 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:35:50,274 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.803e+02 6.699e+02 9.977e+02 1.659e+03 4.040e+03, threshold=1.995e+03, percent-clipped=8.0 2023-06-25 23:36:19,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2094666.0, ans=0.1 2023-06-25 23:36:21,243 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2094726.0, ans=10.0 2023-06-25 23:36:54,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2094786.0, ans=0.0 2023-06-25 23:37:03,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2094846.0, ans=0.2 2023-06-25 23:37:04,052 INFO [train.py:996] (1/4) Epoch 12, batch 13700, loss[loss=0.2348, simple_loss=0.3056, pruned_loss=0.08202, over 21816.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3062, pruned_loss=0.07754, over 4274332.45 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:37:31,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2094906.0, ans=0.0 2023-06-25 23:37:34,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.15 vs. limit=15.0 2023-06-25 23:37:35,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2094906.0, ans=0.125 2023-06-25 23:37:44,183 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2094906.0, ans=0.125 2023-06-25 23:37:54,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2094966.0, ans=0.2 2023-06-25 23:38:26,638 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-25 23:39:04,367 INFO [train.py:996] (1/4) Epoch 12, batch 13750, loss[loss=0.2224, simple_loss=0.2945, pruned_loss=0.07519, over 21622.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3043, pruned_loss=0.07701, over 4271079.33 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:39:17,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2095146.0, ans=0.0 2023-06-25 23:39:28,131 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-25 23:39:33,880 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.606e+02 1.009e+03 1.585e+03 2.865e+03 5.412e+03, threshold=3.169e+03, percent-clipped=34.0 2023-06-25 23:40:49,594 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.16 vs. limit=10.0 2023-06-25 23:40:54,277 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-25 23:40:54,894 INFO [train.py:996] (1/4) Epoch 12, batch 13800, loss[loss=0.2719, simple_loss=0.364, pruned_loss=0.08986, over 21601.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3088, pruned_loss=0.07533, over 4267250.72 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:41:04,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2095446.0, ans=0.0 2023-06-25 23:41:52,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2095566.0, ans=0.125 2023-06-25 23:42:03,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2095566.0, ans=0.125 2023-06-25 23:42:04,553 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 23:42:07,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2095626.0, ans=0.125 2023-06-25 23:42:23,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2095626.0, ans=0.0 2023-06-25 23:42:32,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2095686.0, ans=0.025 2023-06-25 23:42:42,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-25 23:42:46,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2095686.0, ans=0.1 2023-06-25 23:42:52,756 INFO [train.py:996] (1/4) Epoch 12, batch 13850, loss[loss=0.2732, simple_loss=0.3604, pruned_loss=0.09301, over 21855.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3183, pruned_loss=0.07688, over 4264145.14 frames. ], batch size: 371, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:43:21,908 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=22.5 2023-06-25 23:43:28,968 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.539e+02 1.083e+03 1.524e+03 2.080e+03 5.261e+03, threshold=3.047e+03, percent-clipped=9.0 2023-06-25 23:43:41,858 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2095866.0, ans=0.2 2023-06-25 23:44:11,163 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-25 23:44:32,935 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2095986.0, ans=0.2 2023-06-25 23:44:41,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2095986.0, ans=0.0 2023-06-25 23:44:49,839 INFO [train.py:996] (1/4) Epoch 12, batch 13900, loss[loss=0.2772, simple_loss=0.3352, pruned_loss=0.1097, over 21790.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3208, pruned_loss=0.0794, over 4269212.36 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:44:53,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2096046.0, ans=0.09899494936611666 2023-06-25 23:45:03,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2096046.0, ans=0.1 2023-06-25 23:45:11,235 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-25 23:46:05,966 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-25 23:46:38,592 INFO [train.py:996] (1/4) Epoch 12, batch 13950, loss[loss=0.2736, simple_loss=0.3471, pruned_loss=0.1001, over 21706.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3206, pruned_loss=0.08106, over 4278750.29 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:46:50,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2096346.0, ans=0.0 2023-06-25 23:46:55,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2096346.0, ans=0.125 2023-06-25 23:47:08,277 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.165e+02 9.219e+02 1.176e+03 2.067e+03 4.872e+03, threshold=2.352e+03, percent-clipped=8.0 2023-06-25 23:47:11,253 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-25 23:48:25,194 INFO [train.py:996] (1/4) Epoch 12, batch 14000, loss[loss=0.1789, simple_loss=0.2549, pruned_loss=0.05149, over 21266.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3173, pruned_loss=0.07977, over 4268964.41 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:48:34,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2096646.0, ans=0.125 2023-06-25 23:48:51,192 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.86 vs. limit=5.0 2023-06-25 23:49:05,230 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2096766.0, ans=0.0 2023-06-25 23:49:11,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2096766.0, ans=0.0 2023-06-25 23:50:12,359 INFO [train.py:996] (1/4) Epoch 12, batch 14050, loss[loss=0.2023, simple_loss=0.2692, pruned_loss=0.0677, over 21551.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3116, pruned_loss=0.07638, over 4268854.63 frames. ], batch size: 213, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:50:41,380 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.215e+02 8.595e+02 1.187e+03 1.606e+03 3.647e+03, threshold=2.374e+03, percent-clipped=9.0 2023-06-25 23:51:25,381 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-25 23:51:35,067 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2097186.0, ans=0.1 2023-06-25 23:51:53,043 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2097186.0, ans=0.125 2023-06-25 23:52:05,840 INFO [train.py:996] (1/4) Epoch 12, batch 14100, loss[loss=0.2525, simple_loss=0.3142, pruned_loss=0.09541, over 21382.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3043, pruned_loss=0.07569, over 4268397.79 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:52:16,592 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-25 23:52:21,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2097306.0, ans=0.1 2023-06-25 23:52:26,711 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=12.0 2023-06-25 23:52:34,220 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-25 23:52:42,291 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2097366.0, ans=0.125 2023-06-25 23:52:57,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2097426.0, ans=0.125 2023-06-25 23:53:11,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2097426.0, ans=0.05 2023-06-25 23:53:40,200 INFO [train.py:996] (1/4) Epoch 12, batch 14150, loss[loss=0.2565, simple_loss=0.3707, pruned_loss=0.07118, over 19815.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3078, pruned_loss=0.07694, over 4256272.75 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:53:51,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2097546.0, ans=0.125 2023-06-25 23:54:17,219 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.654e+02 9.098e+02 1.301e+03 1.898e+03 3.994e+03, threshold=2.602e+03, percent-clipped=15.0 2023-06-25 23:54:39,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2097666.0, ans=0.0 2023-06-25 23:54:39,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2097666.0, ans=0.1 2023-06-25 23:55:21,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2097786.0, ans=0.125 2023-06-25 23:55:25,915 INFO [train.py:996] (1/4) Epoch 12, batch 14200, loss[loss=0.2417, simple_loss=0.3003, pruned_loss=0.09155, over 21494.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3067, pruned_loss=0.07549, over 4242207.47 frames. ], batch size: 473, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:56:14,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.97 vs. limit=22.5 2023-06-25 23:56:14,853 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2097966.0, ans=0.125 2023-06-25 23:57:10,465 INFO [train.py:996] (1/4) Epoch 12, batch 14250, loss[loss=0.2347, simple_loss=0.2933, pruned_loss=0.08809, over 21353.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3011, pruned_loss=0.07572, over 4244899.17 frames. ], batch size: 473, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:57:53,451 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.950e+02 7.589e+02 1.025e+03 1.608e+03 3.154e+03, threshold=2.050e+03, percent-clipped=3.0 2023-06-25 23:58:10,844 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-25 23:59:06,539 INFO [train.py:996] (1/4) Epoch 12, batch 14300, loss[loss=0.3112, simple_loss=0.4009, pruned_loss=0.1108, over 21791.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3035, pruned_loss=0.0749, over 4242982.89 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:59:08,817 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2098446.0, ans=0.125 2023-06-25 23:59:57,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2098566.0, ans=0.0 2023-06-26 00:00:12,455 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2098626.0, ans=0.1 2023-06-26 00:00:21,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2098626.0, ans=0.035 2023-06-26 00:00:34,725 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:00:56,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2098746.0, ans=0.09899494936611666 2023-06-26 00:00:57,835 INFO [train.py:996] (1/4) Epoch 12, batch 14350, loss[loss=0.31, simple_loss=0.4048, pruned_loss=0.1075, over 21510.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3101, pruned_loss=0.07518, over 4235143.39 frames. ], batch size: 507, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:01:35,775 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.907e+02 9.524e+02 1.577e+03 2.595e+03 6.111e+03, threshold=3.154e+03, percent-clipped=35.0 2023-06-26 00:02:53,131 INFO [train.py:996] (1/4) Epoch 12, batch 14400, loss[loss=0.2115, simple_loss=0.2807, pruned_loss=0.07117, over 21841.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3076, pruned_loss=0.0762, over 4248734.38 frames. ], batch size: 333, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:02:55,357 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2099046.0, ans=0.0 2023-06-26 00:03:12,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2099046.0, ans=0.0 2023-06-26 00:03:24,647 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-26 00:03:54,818 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.76 vs. limit=6.0 2023-06-26 00:03:59,763 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2099226.0, ans=0.125 2023-06-26 00:04:06,960 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2099226.0, ans=0.125 2023-06-26 00:04:38,714 INFO [train.py:996] (1/4) Epoch 12, batch 14450, loss[loss=0.2553, simple_loss=0.3182, pruned_loss=0.09619, over 21742.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3014, pruned_loss=0.07592, over 4247681.81 frames. ], batch size: 112, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:04:57,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2099346.0, ans=0.0 2023-06-26 00:05:10,386 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.479e+02 8.096e+02 1.121e+03 1.806e+03 4.464e+03, threshold=2.243e+03, percent-clipped=9.0 2023-06-26 00:06:30,320 INFO [train.py:996] (1/4) Epoch 12, batch 14500, loss[loss=0.2278, simple_loss=0.3077, pruned_loss=0.07393, over 21218.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2968, pruned_loss=0.07551, over 4247964.77 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:06:30,865 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2099646.0, ans=0.125 2023-06-26 00:06:42,661 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2099646.0, ans=0.0 2023-06-26 00:06:57,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2099706.0, ans=0.1 2023-06-26 00:07:11,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2099766.0, ans=0.0 2023-06-26 00:07:15,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-26 00:08:26,787 INFO [train.py:996] (1/4) Epoch 12, batch 14550, loss[loss=0.2878, simple_loss=0.3525, pruned_loss=0.1116, over 21309.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3025, pruned_loss=0.07794, over 4255026.14 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:08:52,867 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.442e+02 9.003e+02 1.476e+03 2.430e+03 4.476e+03, threshold=2.953e+03, percent-clipped=26.0 2023-06-26 00:09:03,917 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:09:07,539 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2100066.0, ans=0.0 2023-06-26 00:10:11,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2100186.0, ans=0.1 2023-06-26 00:10:17,541 INFO [train.py:996] (1/4) Epoch 12, batch 14600, loss[loss=0.2538, simple_loss=0.3443, pruned_loss=0.08164, over 21868.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3093, pruned_loss=0.08047, over 4263984.50 frames. ], batch size: 371, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:10:55,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2100366.0, ans=10.0 2023-06-26 00:11:02,140 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:11:16,616 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:11:52,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2100486.0, ans=0.125 2023-06-26 00:12:06,919 INFO [train.py:996] (1/4) Epoch 12, batch 14650, loss[loss=0.2213, simple_loss=0.2961, pruned_loss=0.07324, over 21786.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3125, pruned_loss=0.07934, over 4275727.04 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:12:16,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2100546.0, ans=0.2 2023-06-26 00:12:19,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-26 00:12:32,491 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.972e+02 9.038e+02 1.259e+03 1.802e+03 4.365e+03, threshold=2.519e+03, percent-clipped=6.0 2023-06-26 00:13:15,654 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2100726.0, ans=0.2 2023-06-26 00:13:57,519 INFO [train.py:996] (1/4) Epoch 12, batch 14700, loss[loss=0.1747, simple_loss=0.2452, pruned_loss=0.05204, over 21797.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3069, pruned_loss=0.07376, over 4266419.62 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:13:59,756 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2100846.0, ans=0.0 2023-06-26 00:14:20,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2100906.0, ans=0.07 2023-06-26 00:14:35,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2100966.0, ans=0.2 2023-06-26 00:14:50,152 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2100966.0, ans=0.0 2023-06-26 00:14:52,314 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2100966.0, ans=0.95 2023-06-26 00:14:52,328 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2100966.0, ans=0.2 2023-06-26 00:15:04,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2101026.0, ans=0.125 2023-06-26 00:15:19,398 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-26 00:15:22,377 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2101026.0, ans=0.125 2023-06-26 00:15:24,059 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2101086.0, ans=0.125 2023-06-26 00:15:31,701 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2101086.0, ans=0.0 2023-06-26 00:15:33,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2101086.0, ans=0.2 2023-06-26 00:15:44,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2101086.0, ans=0.125 2023-06-26 00:15:49,413 INFO [train.py:996] (1/4) Epoch 12, batch 14750, loss[loss=0.2835, simple_loss=0.3512, pruned_loss=0.1079, over 21610.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3114, pruned_loss=0.07541, over 4260855.36 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:16:09,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2101206.0, ans=0.2 2023-06-26 00:16:21,763 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.018e+02 8.627e+02 1.543e+03 2.178e+03 4.695e+03, threshold=3.085e+03, percent-clipped=15.0 2023-06-26 00:17:24,015 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2101386.0, ans=0.2 2023-06-26 00:17:40,006 INFO [train.py:996] (1/4) Epoch 12, batch 14800, loss[loss=0.2297, simple_loss=0.3032, pruned_loss=0.07808, over 21734.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3219, pruned_loss=0.08073, over 4258276.74 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 32.0 2023-06-26 00:17:57,751 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2101506.0, ans=0.1 2023-06-26 00:18:10,433 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2101506.0, ans=0.125 2023-06-26 00:18:51,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2101566.0, ans=0.0 2023-06-26 00:19:14,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2101686.0, ans=0.0 2023-06-26 00:19:19,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-26 00:19:19,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-26 00:19:29,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2101686.0, ans=0.125 2023-06-26 00:19:32,065 INFO [train.py:996] (1/4) Epoch 12, batch 14850, loss[loss=0.2141, simple_loss=0.2815, pruned_loss=0.07333, over 21786.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3149, pruned_loss=0.07998, over 4257712.39 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:20:00,765 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2101806.0, ans=0.125 2023-06-26 00:20:16,291 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.961e+02 9.941e+02 1.371e+03 2.279e+03 6.206e+03, threshold=2.743e+03, percent-clipped=9.0 2023-06-26 00:20:35,165 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2101866.0, ans=0.0 2023-06-26 00:20:49,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2101926.0, ans=0.2 2023-06-26 00:21:02,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2101926.0, ans=15.0 2023-06-26 00:21:32,293 INFO [train.py:996] (1/4) Epoch 12, batch 14900, loss[loss=0.2525, simple_loss=0.3286, pruned_loss=0.08817, over 21336.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.319, pruned_loss=0.08276, over 4255353.91 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:21:32,791 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2102046.0, ans=0.2 2023-06-26 00:21:49,329 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:22:10,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2102106.0, ans=0.2 2023-06-26 00:23:29,187 INFO [train.py:996] (1/4) Epoch 12, batch 14950, loss[loss=0.232, simple_loss=0.3095, pruned_loss=0.07722, over 21742.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3191, pruned_loss=0.08215, over 4263301.14 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:24:02,301 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.432e+02 8.420e+02 1.194e+03 1.609e+03 3.792e+03, threshold=2.388e+03, percent-clipped=5.0 2023-06-26 00:24:08,435 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-26 00:24:11,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2102466.0, ans=0.2 2023-06-26 00:24:24,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2102466.0, ans=0.2 2023-06-26 00:24:28,778 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2102526.0, ans=0.0 2023-06-26 00:25:04,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2102586.0, ans=0.09899494936611666 2023-06-26 00:25:18,753 INFO [train.py:996] (1/4) Epoch 12, batch 15000, loss[loss=0.232, simple_loss=0.2969, pruned_loss=0.08351, over 21657.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3221, pruned_loss=0.0844, over 4258546.40 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:25:18,757 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 00:25:42,265 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2582, simple_loss=0.348, pruned_loss=0.08425, over 1796401.00 frames. 2023-06-26 00:25:42,266 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-26 00:26:18,488 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2102766.0, ans=0.1 2023-06-26 00:27:29,957 INFO [train.py:996] (1/4) Epoch 12, batch 15050, loss[loss=0.268, simple_loss=0.3608, pruned_loss=0.08764, over 21674.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3214, pruned_loss=0.08462, over 4252865.88 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:27:56,946 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.042e+02 9.488e+02 1.374e+03 1.981e+03 5.080e+03, threshold=2.749e+03, percent-clipped=16.0 2023-06-26 00:28:32,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2103126.0, ans=0.1 2023-06-26 00:29:08,896 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2103186.0, ans=0.125 2023-06-26 00:29:12,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2103186.0, ans=0.0 2023-06-26 00:29:16,854 INFO [train.py:996] (1/4) Epoch 12, batch 15100, loss[loss=0.2815, simple_loss=0.3521, pruned_loss=0.1054, over 21700.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3252, pruned_loss=0.08465, over 4257594.85 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:29:34,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2103306.0, ans=0.1 2023-06-26 00:29:37,016 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-26 00:29:57,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2103366.0, ans=0.1 2023-06-26 00:30:39,685 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2103426.0, ans=0.125 2023-06-26 00:30:45,077 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2103426.0, ans=0.0 2023-06-26 00:31:07,973 INFO [train.py:996] (1/4) Epoch 12, batch 15150, loss[loss=0.229, simple_loss=0.2946, pruned_loss=0.08168, over 21555.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3214, pruned_loss=0.08494, over 4254078.67 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:31:10,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2103546.0, ans=0.125 2023-06-26 00:31:16,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-26 00:31:44,817 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.420e+02 7.677e+02 1.025e+03 1.452e+03 2.770e+03, threshold=2.050e+03, percent-clipped=1.0 2023-06-26 00:32:13,771 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2103666.0, ans=0.1 2023-06-26 00:32:45,137 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-26 00:32:57,474 INFO [train.py:996] (1/4) Epoch 12, batch 15200, loss[loss=0.2161, simple_loss=0.2853, pruned_loss=0.07343, over 21278.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3114, pruned_loss=0.08098, over 4261422.60 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:32:57,927 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2103846.0, ans=0.5 2023-06-26 00:33:58,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2103966.0, ans=0.1 2023-06-26 00:34:11,612 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-26 00:34:22,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2104026.0, ans=0.125 2023-06-26 00:34:45,760 INFO [train.py:996] (1/4) Epoch 12, batch 15250, loss[loss=0.2632, simple_loss=0.3527, pruned_loss=0.08689, over 19762.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3061, pruned_loss=0.0793, over 4264322.56 frames. ], batch size: 704, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:35:33,142 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.968e+02 8.853e+02 1.364e+03 1.967e+03 5.293e+03, threshold=2.727e+03, percent-clipped=20.0 2023-06-26 00:35:39,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2104266.0, ans=0.125 2023-06-26 00:36:04,017 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2104326.0, ans=0.1 2023-06-26 00:36:05,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2104326.0, ans=0.125 2023-06-26 00:36:35,482 INFO [train.py:996] (1/4) Epoch 12, batch 15300, loss[loss=0.2816, simple_loss=0.345, pruned_loss=0.1091, over 21227.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3083, pruned_loss=0.0817, over 4266642.08 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:36:56,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2104506.0, ans=0.125 2023-06-26 00:37:41,130 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2104566.0, ans=0.05 2023-06-26 00:38:10,514 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2104686.0, ans=0.125 2023-06-26 00:38:22,863 INFO [train.py:996] (1/4) Epoch 12, batch 15350, loss[loss=0.2964, simple_loss=0.3627, pruned_loss=0.1151, over 21277.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3156, pruned_loss=0.08394, over 4263342.78 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:38:23,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2104746.0, ans=0.0 2023-06-26 00:39:10,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.965e+02 8.644e+02 1.167e+03 1.665e+03 4.882e+03, threshold=2.334e+03, percent-clipped=9.0 2023-06-26 00:39:30,868 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-26 00:40:00,126 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2104986.0, ans=0.125 2023-06-26 00:40:09,319 INFO [train.py:996] (1/4) Epoch 12, batch 15400, loss[loss=0.2232, simple_loss=0.3009, pruned_loss=0.07269, over 21504.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3148, pruned_loss=0.08233, over 4265982.90 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:41:12,495 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2105166.0, ans=0.2 2023-06-26 00:41:23,244 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2105226.0, ans=0.07 2023-06-26 00:41:45,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2105286.0, ans=0.125 2023-06-26 00:41:58,344 INFO [train.py:996] (1/4) Epoch 12, batch 15450, loss[loss=0.2149, simple_loss=0.2851, pruned_loss=0.07238, over 21554.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3115, pruned_loss=0.08131, over 4269859.33 frames. ], batch size: 548, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:42:01,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2105346.0, ans=0.2 2023-06-26 00:42:26,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2105406.0, ans=0.125 2023-06-26 00:42:40,798 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.072e+02 7.775e+02 1.085e+03 1.677e+03 3.243e+03, threshold=2.170e+03, percent-clipped=5.0 2023-06-26 00:43:03,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2105466.0, ans=0.5 2023-06-26 00:43:17,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2105526.0, ans=0.025 2023-06-26 00:43:42,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2105586.0, ans=0.1 2023-06-26 00:43:47,772 INFO [train.py:996] (1/4) Epoch 12, batch 15500, loss[loss=0.2277, simple_loss=0.3125, pruned_loss=0.07149, over 20666.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3146, pruned_loss=0.08061, over 4270130.79 frames. ], batch size: 607, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:45:13,134 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2105826.0, ans=0.125 2023-06-26 00:45:42,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2105946.0, ans=0.125 2023-06-26 00:45:43,217 INFO [train.py:996] (1/4) Epoch 12, batch 15550, loss[loss=0.2356, simple_loss=0.3187, pruned_loss=0.07625, over 21689.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3131, pruned_loss=0.07824, over 4270933.57 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:45:59,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2105946.0, ans=0.125 2023-06-26 00:46:14,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2106006.0, ans=0.09899494936611666 2023-06-26 00:46:34,319 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 1.052e+03 1.309e+03 1.870e+03 4.327e+03, threshold=2.618e+03, percent-clipped=17.0 2023-06-26 00:46:45,288 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2106066.0, ans=0.04949747468305833 2023-06-26 00:47:41,695 INFO [train.py:996] (1/4) Epoch 12, batch 15600, loss[loss=0.2851, simple_loss=0.345, pruned_loss=0.1126, over 21398.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3068, pruned_loss=0.07705, over 4266254.16 frames. ], batch size: 508, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:48:42,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2106426.0, ans=0.125 2023-06-26 00:48:43,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2106426.0, ans=0.05 2023-06-26 00:49:20,615 INFO [train.py:996] (1/4) Epoch 12, batch 15650, loss[loss=0.2097, simple_loss=0.2774, pruned_loss=0.07105, over 21487.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3053, pruned_loss=0.07607, over 4272192.98 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:49:30,524 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2106546.0, ans=0.07 2023-06-26 00:49:44,426 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-26 00:50:07,692 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.319e+02 9.009e+02 1.257e+03 1.963e+03 4.655e+03, threshold=2.515e+03, percent-clipped=11.0 2023-06-26 00:50:11,463 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2106666.0, ans=0.0 2023-06-26 00:50:16,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2106666.0, ans=0.0 2023-06-26 00:50:25,057 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2106666.0, ans=0.0 2023-06-26 00:50:43,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2106786.0, ans=0.125 2023-06-26 00:51:12,653 INFO [train.py:996] (1/4) Epoch 12, batch 15700, loss[loss=0.2643, simple_loss=0.3905, pruned_loss=0.06903, over 19802.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3037, pruned_loss=0.0753, over 4264673.87 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:51:21,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2106846.0, ans=0.125 2023-06-26 00:51:41,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2106906.0, ans=0.2 2023-06-26 00:51:41,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2106906.0, ans=0.125 2023-06-26 00:51:46,975 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2106906.0, ans=0.125 2023-06-26 00:52:01,818 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2106966.0, ans=0.05 2023-06-26 00:52:04,423 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-26 00:52:37,105 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2107086.0, ans=0.125 2023-06-26 00:52:55,631 INFO [train.py:996] (1/4) Epoch 12, batch 15750, loss[loss=0.2, simple_loss=0.2723, pruned_loss=0.06385, over 22002.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2995, pruned_loss=0.07493, over 4255242.09 frames. ], batch size: 103, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:53:01,239 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2107146.0, ans=0.1 2023-06-26 00:53:35,899 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2107206.0, ans=0.0 2023-06-26 00:53:38,553 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.389e+02 8.722e+02 1.383e+03 2.030e+03 4.451e+03, threshold=2.767e+03, percent-clipped=16.0 2023-06-26 00:53:43,034 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2107266.0, ans=0.2 2023-06-26 00:53:52,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2107266.0, ans=0.5 2023-06-26 00:54:16,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2107386.0, ans=0.125 2023-06-26 00:54:41,615 INFO [train.py:996] (1/4) Epoch 12, batch 15800, loss[loss=0.2184, simple_loss=0.2819, pruned_loss=0.07745, over 21797.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2943, pruned_loss=0.07471, over 4251397.27 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:55:24,572 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2107506.0, ans=6.0 2023-06-26 00:56:01,638 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2107686.0, ans=0.0 2023-06-26 00:56:30,884 INFO [train.py:996] (1/4) Epoch 12, batch 15850, loss[loss=0.2134, simple_loss=0.293, pruned_loss=0.06689, over 21704.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.297, pruned_loss=0.07737, over 4260165.81 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:57:10,191 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.897e+02 9.764e+02 1.470e+03 2.269e+03 4.632e+03, threshold=2.939e+03, percent-clipped=9.0 2023-06-26 00:57:10,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2107866.0, ans=0.0 2023-06-26 00:57:16,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2107866.0, ans=0.125 2023-06-26 00:57:22,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2107866.0, ans=0.1 2023-06-26 00:58:14,235 INFO [train.py:996] (1/4) Epoch 12, batch 15900, loss[loss=0.2185, simple_loss=0.2811, pruned_loss=0.07794, over 21317.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2944, pruned_loss=0.07761, over 4249667.02 frames. ], batch size: 549, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:58:58,850 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2108106.0, ans=0.1 2023-06-26 00:59:06,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2108166.0, ans=0.1 2023-06-26 00:59:35,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2108286.0, ans=0.125 2023-06-26 00:59:56,226 INFO [train.py:996] (1/4) Epoch 12, batch 15950, loss[loss=0.2791, simple_loss=0.3507, pruned_loss=0.1037, over 21595.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2975, pruned_loss=0.07513, over 4255879.33 frames. ], batch size: 508, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 01:00:08,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2108346.0, ans=0.125 2023-06-26 01:00:37,289 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-26 01:00:41,311 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.627e+02 7.587e+02 1.024e+03 1.341e+03 2.731e+03, threshold=2.049e+03, percent-clipped=0.0 2023-06-26 01:01:05,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2108526.0, ans=0.125 2023-06-26 01:01:10,610 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2108526.0, ans=0.0 2023-06-26 01:01:25,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2108586.0, ans=0.125 2023-06-26 01:01:25,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2108586.0, ans=0.2 2023-06-26 01:01:28,186 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2108586.0, ans=0.04949747468305833 2023-06-26 01:01:32,785 INFO [train.py:996] (1/4) Epoch 12, batch 16000, loss[loss=0.195, simple_loss=0.2904, pruned_loss=0.04981, over 21758.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2996, pruned_loss=0.07346, over 4269540.04 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:01:56,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2108646.0, ans=0.125 2023-06-26 01:02:37,546 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2108766.0, ans=0.125 2023-06-26 01:02:41,617 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-26 01:02:47,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2108826.0, ans=0.1 2023-06-26 01:03:00,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2108826.0, ans=0.125 2023-06-26 01:03:26,089 INFO [train.py:996] (1/4) Epoch 12, batch 16050, loss[loss=0.2431, simple_loss=0.3426, pruned_loss=0.07179, over 21782.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3026, pruned_loss=0.07152, over 4269089.99 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:04:12,371 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:04:12,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2109006.0, ans=0.0 2023-06-26 01:04:18,280 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 8.237e+02 1.456e+03 2.962e+03 6.704e+03, threshold=2.913e+03, percent-clipped=34.0 2023-06-26 01:04:22,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2109066.0, ans=0.2 2023-06-26 01:04:34,397 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-26 01:04:56,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2109186.0, ans=0.125 2023-06-26 01:05:16,016 INFO [train.py:996] (1/4) Epoch 12, batch 16100, loss[loss=0.2464, simple_loss=0.3055, pruned_loss=0.09364, over 21269.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3078, pruned_loss=0.07422, over 4270143.82 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:05:57,060 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2109306.0, ans=0.125 2023-06-26 01:06:09,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2109366.0, ans=0.025 2023-06-26 01:06:55,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2109486.0, ans=0.2 2023-06-26 01:07:03,540 INFO [train.py:996] (1/4) Epoch 12, batch 16150, loss[loss=0.2922, simple_loss=0.3456, pruned_loss=0.1193, over 21742.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3061, pruned_loss=0.07663, over 4280436.53 frames. ], batch size: 473, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:07:57,415 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2109606.0, ans=0.0 2023-06-26 01:07:58,379 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.216e+02 8.957e+02 1.137e+03 1.849e+03 5.347e+03, threshold=2.275e+03, percent-clipped=8.0 2023-06-26 01:08:26,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2109726.0, ans=0.0 2023-06-26 01:08:56,302 INFO [train.py:996] (1/4) Epoch 12, batch 16200, loss[loss=0.2949, simple_loss=0.3461, pruned_loss=0.1218, over 22036.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3107, pruned_loss=0.07788, over 4287020.29 frames. ], batch size: 416, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:10:52,258 INFO [train.py:996] (1/4) Epoch 12, batch 16250, loss[loss=0.2435, simple_loss=0.3078, pruned_loss=0.08959, over 21428.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3115, pruned_loss=0.07819, over 4286304.51 frames. ], batch size: 508, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:10:53,883 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=12.0 2023-06-26 01:11:31,331 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.426e+02 1.023e+03 1.432e+03 2.136e+03 5.202e+03, threshold=2.864e+03, percent-clipped=19.0 2023-06-26 01:11:46,235 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-26 01:11:58,866 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2110326.0, ans=0.1 2023-06-26 01:12:09,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-26 01:12:14,031 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2110386.0, ans=0.0 2023-06-26 01:12:33,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2110446.0, ans=0.125 2023-06-26 01:12:34,507 INFO [train.py:996] (1/4) Epoch 12, batch 16300, loss[loss=0.2019, simple_loss=0.2983, pruned_loss=0.05273, over 21601.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3051, pruned_loss=0.07439, over 4282327.46 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 01:12:38,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2110446.0, ans=0.05 2023-06-26 01:12:40,372 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2110446.0, ans=0.125 2023-06-26 01:12:56,790 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2110446.0, ans=0.0 2023-06-26 01:13:13,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2110506.0, ans=0.125 2023-06-26 01:13:39,705 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2110626.0, ans=0.0 2023-06-26 01:14:17,412 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2110686.0, ans=0.2 2023-06-26 01:14:25,781 INFO [train.py:996] (1/4) Epoch 12, batch 16350, loss[loss=0.3008, simple_loss=0.3617, pruned_loss=0.12, over 21289.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3038, pruned_loss=0.07478, over 4269773.58 frames. ], batch size: 507, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:15:07,543 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.769e+02 8.768e+02 1.143e+03 1.640e+03 3.455e+03, threshold=2.286e+03, percent-clipped=4.0 2023-06-26 01:15:11,738 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2110866.0, ans=0.0 2023-06-26 01:15:18,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2110866.0, ans=0.0 2023-06-26 01:15:21,787 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2110866.0, ans=0.0 2023-06-26 01:15:38,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2110926.0, ans=0.125 2023-06-26 01:16:21,195 INFO [train.py:996] (1/4) Epoch 12, batch 16400, loss[loss=0.2208, simple_loss=0.3012, pruned_loss=0.07021, over 21830.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3088, pruned_loss=0.07746, over 4272091.09 frames. ], batch size: 391, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:16:54,195 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2111166.0, ans=0.125 2023-06-26 01:16:59,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2111166.0, ans=0.125 2023-06-26 01:17:23,479 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2111226.0, ans=0.1 2023-06-26 01:18:00,974 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-26 01:18:04,813 INFO [train.py:996] (1/4) Epoch 12, batch 16450, loss[loss=0.2472, simple_loss=0.3168, pruned_loss=0.08882, over 21152.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3084, pruned_loss=0.07842, over 4280314.25 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:18:07,481 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2111346.0, ans=0.125 2023-06-26 01:18:25,116 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2111406.0, ans=0.0 2023-06-26 01:18:39,417 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.567e+02 7.458e+02 1.043e+03 1.518e+03 3.613e+03, threshold=2.086e+03, percent-clipped=5.0 2023-06-26 01:18:45,044 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2111466.0, ans=0.125 2023-06-26 01:18:48,723 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2111466.0, ans=0.125 2023-06-26 01:18:52,178 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:18:57,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2111526.0, ans=0.2 2023-06-26 01:19:41,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2111586.0, ans=0.1 2023-06-26 01:19:54,999 INFO [train.py:996] (1/4) Epoch 12, batch 16500, loss[loss=0.1938, simple_loss=0.2539, pruned_loss=0.06679, over 21398.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3056, pruned_loss=0.07839, over 4283757.49 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:19:58,293 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-06-26 01:20:14,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2111706.0, ans=0.0 2023-06-26 01:20:49,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2111766.0, ans=0.0 2023-06-26 01:21:26,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2111826.0, ans=0.125 2023-06-26 01:21:46,606 INFO [train.py:996] (1/4) Epoch 12, batch 16550, loss[loss=0.2596, simple_loss=0.3417, pruned_loss=0.08875, over 21602.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3022, pruned_loss=0.07499, over 4274940.16 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:22:28,620 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.316e+02 1.128e+03 1.805e+03 2.872e+03 7.168e+03, threshold=3.610e+03, percent-clipped=40.0 2023-06-26 01:23:20,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=2112186.0, ans=0.05 2023-06-26 01:23:20,931 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2112186.0, ans=0.0 2023-06-26 01:23:39,438 INFO [train.py:996] (1/4) Epoch 12, batch 16600, loss[loss=0.3331, simple_loss=0.4255, pruned_loss=0.1203, over 21643.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3087, pruned_loss=0.0772, over 4276628.31 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:24:31,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2112366.0, ans=0.0 2023-06-26 01:25:02,139 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.06 vs. limit=15.0 2023-06-26 01:25:29,304 INFO [train.py:996] (1/4) Epoch 12, batch 16650, loss[loss=0.2748, simple_loss=0.3583, pruned_loss=0.09561, over 21697.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3212, pruned_loss=0.08027, over 4274948.43 frames. ], batch size: 351, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:25:42,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2112546.0, ans=10.0 2023-06-26 01:25:44,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=12.0 2023-06-26 01:25:49,758 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2112546.0, ans=0.2 2023-06-26 01:25:53,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2112606.0, ans=0.125 2023-06-26 01:26:16,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-26 01:26:29,262 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.918e+02 9.386e+02 1.441e+03 2.110e+03 3.541e+03, threshold=2.881e+03, percent-clipped=0.0 2023-06-26 01:26:34,081 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-26 01:26:51,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2112726.0, ans=0.125 2023-06-26 01:27:27,194 INFO [train.py:996] (1/4) Epoch 12, batch 16700, loss[loss=0.2298, simple_loss=0.3401, pruned_loss=0.05979, over 20773.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3236, pruned_loss=0.08139, over 4278839.52 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:27:29,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2112846.0, ans=0.95 2023-06-26 01:28:12,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2112906.0, ans=0.0 2023-06-26 01:28:59,437 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:29:14,234 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2113086.0, ans=0.0 2023-06-26 01:29:14,864 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-26 01:29:38,446 INFO [train.py:996] (1/4) Epoch 12, batch 16750, loss[loss=0.2861, simple_loss=0.3601, pruned_loss=0.1061, over 21468.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3261, pruned_loss=0.08392, over 4268835.38 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:29:53,344 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.73 vs. limit=22.5 2023-06-26 01:29:59,497 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2113206.0, ans=0.125 2023-06-26 01:30:18,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2113266.0, ans=0.025 2023-06-26 01:30:21,353 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.824e+02 1.154e+03 1.795e+03 2.444e+03 4.443e+03, threshold=3.590e+03, percent-clipped=18.0 2023-06-26 01:31:29,988 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-26 01:31:30,353 INFO [train.py:996] (1/4) Epoch 12, batch 16800, loss[loss=0.2728, simple_loss=0.3546, pruned_loss=0.09547, over 21802.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3307, pruned_loss=0.08408, over 4271796.84 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:32:04,742 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2113506.0, ans=0.125 2023-06-26 01:32:06,625 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2113506.0, ans=0.0 2023-06-26 01:32:30,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2113626.0, ans=0.125 2023-06-26 01:33:06,581 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-26 01:33:18,337 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-26 01:33:18,835 INFO [train.py:996] (1/4) Epoch 12, batch 16850, loss[loss=0.2134, simple_loss=0.2843, pruned_loss=0.07123, over 21384.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3258, pruned_loss=0.08338, over 4272066.61 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:33:21,299 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2113746.0, ans=0.125 2023-06-26 01:33:49,166 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.06 vs. limit=10.0 2023-06-26 01:33:54,131 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2113806.0, ans=0.0 2023-06-26 01:34:00,636 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.998e+02 1.125e+03 1.896e+03 2.663e+03 4.313e+03, threshold=3.792e+03, percent-clipped=10.0 2023-06-26 01:34:04,721 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2113866.0, ans=0.125 2023-06-26 01:34:06,563 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2113866.0, ans=0.125 2023-06-26 01:34:28,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2113926.0, ans=0.0 2023-06-26 01:34:56,658 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-26 01:35:07,207 INFO [train.py:996] (1/4) Epoch 12, batch 16900, loss[loss=0.2017, simple_loss=0.2792, pruned_loss=0.0621, over 21611.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3209, pruned_loss=0.08254, over 4267381.46 frames. ], batch size: 263, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:35:08,284 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-26 01:35:28,985 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2114106.0, ans=0.125 2023-06-26 01:35:39,469 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2114106.0, ans=0.1 2023-06-26 01:35:47,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2114166.0, ans=0.2 2023-06-26 01:36:04,983 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2114226.0, ans=0.125 2023-06-26 01:36:17,088 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:36:32,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2114226.0, ans=0.125 2023-06-26 01:36:51,863 INFO [train.py:996] (1/4) Epoch 12, batch 16950, loss[loss=0.2447, simple_loss=0.3134, pruned_loss=0.08799, over 15791.00 frames. ], tot_loss[loss=0.238, simple_loss=0.314, pruned_loss=0.08105, over 4267506.40 frames. ], batch size: 60, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:36:53,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.72 vs. limit=22.5 2023-06-26 01:37:25,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2114406.0, ans=0.0 2023-06-26 01:37:30,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2114466.0, ans=0.125 2023-06-26 01:37:30,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-26 01:37:33,326 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.589e+02 8.373e+02 1.013e+03 1.291e+03 3.071e+03, threshold=2.026e+03, percent-clipped=0.0 2023-06-26 01:37:56,617 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:38:00,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2114526.0, ans=0.0 2023-06-26 01:38:41,242 INFO [train.py:996] (1/4) Epoch 12, batch 17000, loss[loss=0.2086, simple_loss=0.2762, pruned_loss=0.07049, over 21677.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3108, pruned_loss=0.08148, over 4276192.24 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:39:37,470 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.01 vs. limit=15.0 2023-06-26 01:40:24,697 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2114946.0, ans=0.125 2023-06-26 01:40:25,810 INFO [train.py:996] (1/4) Epoch 12, batch 17050, loss[loss=0.2647, simple_loss=0.3476, pruned_loss=0.0909, over 21846.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3187, pruned_loss=0.08432, over 4284415.24 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:40:35,002 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2114946.0, ans=0.1 2023-06-26 01:40:50,785 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-26 01:40:54,197 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-26 01:41:05,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.789e+02 8.713e+02 1.354e+03 1.958e+03 3.911e+03, threshold=2.708e+03, percent-clipped=23.0 2023-06-26 01:41:52,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2115126.0, ans=10.0 2023-06-26 01:42:03,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2115186.0, ans=0.0 2023-06-26 01:42:13,594 INFO [train.py:996] (1/4) Epoch 12, batch 17100, loss[loss=0.2246, simple_loss=0.2964, pruned_loss=0.07642, over 21845.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3184, pruned_loss=0.08485, over 4287288.08 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:42:33,604 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.94 vs. limit=15.0 2023-06-26 01:42:36,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2115306.0, ans=0.125 2023-06-26 01:43:23,160 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:43:33,327 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2115426.0, ans=0.035 2023-06-26 01:43:55,870 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2115486.0, ans=0.1 2023-06-26 01:44:02,129 INFO [train.py:996] (1/4) Epoch 12, batch 17150, loss[loss=0.1975, simple_loss=0.2715, pruned_loss=0.06176, over 21399.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3144, pruned_loss=0.08448, over 4291265.62 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:44:43,622 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.448e+02 7.663e+02 1.094e+03 1.326e+03 2.492e+03, threshold=2.188e+03, percent-clipped=0.0 2023-06-26 01:45:47,678 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2115786.0, ans=0.04949747468305833 2023-06-26 01:45:49,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2115786.0, ans=0.0 2023-06-26 01:45:52,532 INFO [train.py:996] (1/4) Epoch 12, batch 17200, loss[loss=0.2522, simple_loss=0.3236, pruned_loss=0.09046, over 20708.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3141, pruned_loss=0.08377, over 4291109.93 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 01:46:09,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2115906.0, ans=0.125 2023-06-26 01:46:14,798 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-26 01:47:43,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-26 01:47:45,185 INFO [train.py:996] (1/4) Epoch 12, batch 17250, loss[loss=0.2272, simple_loss=0.315, pruned_loss=0.06967, over 21584.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3169, pruned_loss=0.08521, over 4289190.38 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:48:40,276 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-26 01:48:42,514 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.045e+02 8.687e+02 1.216e+03 1.753e+03 5.268e+03, threshold=2.433e+03, percent-clipped=15.0 2023-06-26 01:48:50,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2116266.0, ans=0.2 2023-06-26 01:49:27,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2116386.0, ans=0.125 2023-06-26 01:49:35,547 INFO [train.py:996] (1/4) Epoch 12, batch 17300, loss[loss=0.2544, simple_loss=0.3225, pruned_loss=0.09316, over 21376.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.324, pruned_loss=0.0876, over 4282981.70 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:49:39,405 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:50:52,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2116626.0, ans=0.125 2023-06-26 01:50:55,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-26 01:51:11,329 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2116686.0, ans=0.2 2023-06-26 01:51:43,243 INFO [train.py:996] (1/4) Epoch 12, batch 17350, loss[loss=0.2688, simple_loss=0.3424, pruned_loss=0.0976, over 20680.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.325, pruned_loss=0.08701, over 4278918.55 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:51:54,633 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-26 01:52:01,014 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2116746.0, ans=0.125 2023-06-26 01:52:33,996 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.805e+02 1.057e+03 1.442e+03 1.846e+03 4.357e+03, threshold=2.883e+03, percent-clipped=11.0 2023-06-26 01:53:00,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2116926.0, ans=0.1 2023-06-26 01:53:38,422 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.56 vs. limit=12.0 2023-06-26 01:53:38,842 INFO [train.py:996] (1/4) Epoch 12, batch 17400, loss[loss=0.245, simple_loss=0.3488, pruned_loss=0.07058, over 21211.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3223, pruned_loss=0.08308, over 4278794.98 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:54:16,518 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2117166.0, ans=0.125 2023-06-26 01:54:24,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=2117166.0, ans=0.05 2023-06-26 01:54:43,173 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2117226.0, ans=0.0 2023-06-26 01:54:46,921 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-26 01:54:58,364 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-26 01:55:01,479 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-06-26 01:55:17,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2117286.0, ans=0.2 2023-06-26 01:55:25,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2117346.0, ans=0.0 2023-06-26 01:55:26,403 INFO [train.py:996] (1/4) Epoch 12, batch 17450, loss[loss=0.217, simple_loss=0.3081, pruned_loss=0.06299, over 21730.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3197, pruned_loss=0.08107, over 4268177.09 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:56:00,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2117406.0, ans=0.0 2023-06-26 01:56:11,457 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.427e+02 9.633e+02 1.716e+03 2.627e+03 5.192e+03, threshold=3.432e+03, percent-clipped=16.0 2023-06-26 01:56:46,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2117526.0, ans=0.125 2023-06-26 01:56:49,849 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-26 01:57:16,249 INFO [train.py:996] (1/4) Epoch 12, batch 17500, loss[loss=0.2343, simple_loss=0.3077, pruned_loss=0.08049, over 21631.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3142, pruned_loss=0.07838, over 4273409.31 frames. ], batch size: 263, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:57:49,941 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.85 vs. limit=22.5 2023-06-26 01:57:57,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2117766.0, ans=0.2 2023-06-26 01:58:34,942 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2117826.0, ans=0.0 2023-06-26 01:59:02,059 INFO [train.py:996] (1/4) Epoch 12, batch 17550, loss[loss=0.2431, simple_loss=0.325, pruned_loss=0.08055, over 21251.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3134, pruned_loss=0.07656, over 4278966.54 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:59:10,347 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.36 vs. limit=15.0 2023-06-26 01:59:35,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2118006.0, ans=0.125 2023-06-26 01:59:44,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.370e+02 8.396e+02 1.240e+03 1.763e+03 3.484e+03, threshold=2.480e+03, percent-clipped=3.0 2023-06-26 02:00:44,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2118186.0, ans=0.2 2023-06-26 02:00:47,379 INFO [train.py:996] (1/4) Epoch 12, batch 17600, loss[loss=0.2893, simple_loss=0.3567, pruned_loss=0.111, over 21412.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3175, pruned_loss=0.07757, over 4270637.30 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:00:54,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2118246.0, ans=0.125 2023-06-26 02:01:00,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-26 02:01:07,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2118306.0, ans=0.0 2023-06-26 02:01:41,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2118366.0, ans=0.0 2023-06-26 02:01:44,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2118366.0, ans=0.0 2023-06-26 02:01:46,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2118366.0, ans=0.2 2023-06-26 02:02:16,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2118426.0, ans=0.0 2023-06-26 02:02:30,178 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2118486.0, ans=0.125 2023-06-26 02:02:36,285 INFO [train.py:996] (1/4) Epoch 12, batch 17650, loss[loss=0.1963, simple_loss=0.2842, pruned_loss=0.05422, over 21590.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3141, pruned_loss=0.07719, over 4277751.88 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:03:21,820 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2118666.0, ans=0.125 2023-06-26 02:03:27,856 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.781e+02 8.671e+02 1.388e+03 2.220e+03 4.878e+03, threshold=2.775e+03, percent-clipped=22.0 2023-06-26 02:03:50,487 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2118726.0, ans=0.1 2023-06-26 02:03:51,109 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-26 02:04:00,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2118726.0, ans=0.125 2023-06-26 02:04:08,169 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=22.5 2023-06-26 02:04:09,385 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2118786.0, ans=0.125 2023-06-26 02:04:09,452 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2118786.0, ans=0.125 2023-06-26 02:04:16,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-26 02:04:31,222 INFO [train.py:996] (1/4) Epoch 12, batch 17700, loss[loss=0.2692, simple_loss=0.3517, pruned_loss=0.09336, over 21957.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3076, pruned_loss=0.07493, over 4270461.85 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:05:39,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2119026.0, ans=0.0 2023-06-26 02:05:47,740 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-26 02:05:57,008 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2119086.0, ans=0.125 2023-06-26 02:06:10,321 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2119086.0, ans=0.0 2023-06-26 02:06:20,341 INFO [train.py:996] (1/4) Epoch 12, batch 17750, loss[loss=0.2459, simple_loss=0.3307, pruned_loss=0.08055, over 21802.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3158, pruned_loss=0.07877, over 4276302.69 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:06:41,319 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-26 02:07:18,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.321e+02 9.910e+02 1.512e+03 2.052e+03 5.083e+03, threshold=3.025e+03, percent-clipped=13.0 2023-06-26 02:07:38,888 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2119326.0, ans=0.125 2023-06-26 02:07:40,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2119326.0, ans=0.1 2023-06-26 02:07:59,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2119386.0, ans=0.0 2023-06-26 02:08:11,437 INFO [train.py:996] (1/4) Epoch 12, batch 17800, loss[loss=0.2171, simple_loss=0.2866, pruned_loss=0.07385, over 21129.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3158, pruned_loss=0.0786, over 4274524.33 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:08:20,994 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-26 02:09:12,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2119566.0, ans=0.125 2023-06-26 02:09:20,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2119566.0, ans=0.125 2023-06-26 02:09:26,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2119626.0, ans=0.1 2023-06-26 02:10:04,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2119686.0, ans=0.1 2023-06-26 02:10:07,325 INFO [train.py:996] (1/4) Epoch 12, batch 17850, loss[loss=0.2402, simple_loss=0.314, pruned_loss=0.08318, over 21267.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3166, pruned_loss=0.07882, over 4272940.66 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:10:46,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.79 vs. limit=10.0 2023-06-26 02:10:57,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2119866.0, ans=0.0 2023-06-26 02:10:58,031 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.213e+02 1.072e+03 1.677e+03 2.667e+03 5.853e+03, threshold=3.353e+03, percent-clipped=21.0 2023-06-26 02:12:03,569 INFO [train.py:996] (1/4) Epoch 12, batch 17900, loss[loss=0.2713, simple_loss=0.3644, pruned_loss=0.08908, over 21744.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3198, pruned_loss=0.08, over 4264979.78 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:13:05,060 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-26 02:13:14,997 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-26 02:13:26,211 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2120226.0, ans=0.5 2023-06-26 02:13:59,318 INFO [train.py:996] (1/4) Epoch 12, batch 17950, loss[loss=0.2025, simple_loss=0.2965, pruned_loss=0.0543, over 21656.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3198, pruned_loss=0.07692, over 4267251.15 frames. ], batch size: 263, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:14:20,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2120406.0, ans=0.125 2023-06-26 02:14:44,282 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.194e+02 1.009e+03 1.518e+03 2.027e+03 4.283e+03, threshold=3.036e+03, percent-clipped=1.0 2023-06-26 02:14:44,828 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2120466.0, ans=0.0 2023-06-26 02:14:59,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2120466.0, ans=0.2 2023-06-26 02:15:00,505 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2120526.0, ans=0.0 2023-06-26 02:15:40,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 02:15:45,158 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2120586.0, ans=0.125 2023-06-26 02:15:49,525 INFO [train.py:996] (1/4) Epoch 12, batch 18000, loss[loss=0.2021, simple_loss=0.2626, pruned_loss=0.07081, over 20719.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3127, pruned_loss=0.07502, over 4270364.26 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 02:15:49,526 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 02:16:07,681 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.258, simple_loss=0.3529, pruned_loss=0.08158, over 1796401.00 frames. 2023-06-26 02:16:07,682 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-26 02:16:36,880 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:16:57,379 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2120766.0, ans=0.125 2023-06-26 02:17:16,690 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2120826.0, ans=0.125 2023-06-26 02:17:24,613 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=12.0 2023-06-26 02:17:37,769 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2120886.0, ans=0.1 2023-06-26 02:17:50,058 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-26 02:17:50,148 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=22.5 2023-06-26 02:17:55,810 INFO [train.py:996] (1/4) Epoch 12, batch 18050, loss[loss=0.2291, simple_loss=0.3035, pruned_loss=0.07736, over 21558.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3063, pruned_loss=0.07399, over 4270288.61 frames. ], batch size: 415, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:18:31,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2121006.0, ans=0.125 2023-06-26 02:18:38,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2121006.0, ans=0.015 2023-06-26 02:18:55,876 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.179e+02 8.208e+02 1.156e+03 1.748e+03 3.501e+03, threshold=2.312e+03, percent-clipped=3.0 2023-06-26 02:19:09,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-26 02:19:24,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2121126.0, ans=0.125 2023-06-26 02:19:50,070 INFO [train.py:996] (1/4) Epoch 12, batch 18100, loss[loss=0.2471, simple_loss=0.3413, pruned_loss=0.07645, over 21268.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3097, pruned_loss=0.07623, over 4274079.65 frames. ], batch size: 549, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:19:51,181 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-26 02:20:21,391 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2121306.0, ans=0.125 2023-06-26 02:20:26,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2121306.0, ans=0.125 2023-06-26 02:20:51,757 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2121366.0, ans=0.0 2023-06-26 02:21:20,351 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2121486.0, ans=0.2 2023-06-26 02:21:28,811 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2121486.0, ans=0.125 2023-06-26 02:21:36,994 INFO [train.py:996] (1/4) Epoch 12, batch 18150, loss[loss=0.2413, simple_loss=0.311, pruned_loss=0.08579, over 21678.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3126, pruned_loss=0.07674, over 4274804.51 frames. ], batch size: 333, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:21:37,570 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2121546.0, ans=0.0 2023-06-26 02:22:36,068 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.496e+02 9.455e+02 1.549e+03 2.118e+03 3.915e+03, threshold=3.098e+03, percent-clipped=17.0 2023-06-26 02:22:48,418 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2121726.0, ans=0.125 2023-06-26 02:22:50,612 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2121726.0, ans=0.125 2023-06-26 02:23:01,541 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-26 02:23:24,021 INFO [train.py:996] (1/4) Epoch 12, batch 18200, loss[loss=0.1891, simple_loss=0.265, pruned_loss=0.05657, over 21670.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3065, pruned_loss=0.07689, over 4276464.64 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:24:07,839 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=22.5 2023-06-26 02:24:27,621 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-26 02:25:01,795 INFO [train.py:996] (1/4) Epoch 12, batch 18250, loss[loss=0.1542, simple_loss=0.2349, pruned_loss=0.03673, over 21588.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2996, pruned_loss=0.07486, over 4272847.23 frames. ], batch size: 132, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:25:02,441 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:25:24,296 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-26 02:25:39,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2122206.0, ans=0.2 2023-06-26 02:25:51,194 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2122266.0, ans=0.0 2023-06-26 02:25:53,715 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-26 02:25:54,180 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.441e+02 8.526e+02 1.132e+03 1.532e+03 3.016e+03, threshold=2.265e+03, percent-clipped=0.0 2023-06-26 02:26:40,841 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2122386.0, ans=0.125 2023-06-26 02:26:48,433 INFO [train.py:996] (1/4) Epoch 12, batch 18300, loss[loss=0.2126, simple_loss=0.2822, pruned_loss=0.07149, over 21293.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3005, pruned_loss=0.07414, over 4271601.17 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:27:08,722 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-26 02:28:07,468 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-26 02:28:15,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2122626.0, ans=0.1 2023-06-26 02:28:27,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=2122686.0, ans=15.0 2023-06-26 02:28:34,416 INFO [train.py:996] (1/4) Epoch 12, batch 18350, loss[loss=0.2285, simple_loss=0.2978, pruned_loss=0.07961, over 21721.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3058, pruned_loss=0.07448, over 4279641.35 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:28:38,541 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2122746.0, ans=0.125 2023-06-26 02:28:42,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=12.0 2023-06-26 02:29:14,240 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2122806.0, ans=0.125 2023-06-26 02:29:31,208 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.251e+02 1.279e+03 1.909e+03 2.961e+03 4.815e+03, threshold=3.819e+03, percent-clipped=39.0 2023-06-26 02:29:33,551 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2122866.0, ans=0.125 2023-06-26 02:29:45,053 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=2122926.0, ans=15.0 2023-06-26 02:30:26,218 INFO [train.py:996] (1/4) Epoch 12, batch 18400, loss[loss=0.2111, simple_loss=0.2915, pruned_loss=0.06539, over 21589.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3016, pruned_loss=0.07323, over 4270428.93 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:31:52,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-26 02:31:53,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2123226.0, ans=0.125 2023-06-26 02:32:13,324 INFO [train.py:996] (1/4) Epoch 12, batch 18450, loss[loss=0.2004, simple_loss=0.2824, pruned_loss=0.05915, over 21635.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2979, pruned_loss=0.06977, over 4267590.92 frames. ], batch size: 415, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:32:13,952 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2123346.0, ans=10.0 2023-06-26 02:32:43,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2123406.0, ans=0.125 2023-06-26 02:32:45,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2123406.0, ans=0.0 2023-06-26 02:33:07,630 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.751e+02 8.419e+02 1.219e+03 1.812e+03 4.554e+03, threshold=2.437e+03, percent-clipped=1.0 2023-06-26 02:34:00,078 INFO [train.py:996] (1/4) Epoch 12, batch 18500, loss[loss=0.1919, simple_loss=0.2574, pruned_loss=0.06318, over 21438.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2942, pruned_loss=0.06896, over 4270232.68 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:34:06,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2123646.0, ans=0.025 2023-06-26 02:34:07,739 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2123646.0, ans=0.125 2023-06-26 02:35:05,948 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2123766.0, ans=0.05 2023-06-26 02:35:42,088 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2123886.0, ans=0.125 2023-06-26 02:35:50,216 INFO [train.py:996] (1/4) Epoch 12, batch 18550, loss[loss=0.2269, simple_loss=0.2844, pruned_loss=0.08473, over 21849.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2923, pruned_loss=0.0682, over 4262499.97 frames. ], batch size: 107, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:35:50,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2123946.0, ans=0.125 2023-06-26 02:35:51,501 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-26 02:36:28,201 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2124006.0, ans=0.125 2023-06-26 02:36:59,194 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.320e+02 1.196e+03 2.047e+03 2.769e+03 5.158e+03, threshold=4.094e+03, percent-clipped=37.0 2023-06-26 02:37:06,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2124126.0, ans=0.05 2023-06-26 02:37:25,908 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2124186.0, ans=0.0 2023-06-26 02:37:33,636 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=22.5 2023-06-26 02:37:46,699 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2124246.0, ans=0.1 2023-06-26 02:37:47,745 INFO [train.py:996] (1/4) Epoch 12, batch 18600, loss[loss=0.2444, simple_loss=0.3309, pruned_loss=0.0789, over 21609.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.29, pruned_loss=0.06913, over 4266104.27 frames. ], batch size: 442, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:37:58,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2124246.0, ans=0.2 2023-06-26 02:38:37,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2124366.0, ans=0.125 2023-06-26 02:38:38,433 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-26 02:39:31,000 INFO [train.py:996] (1/4) Epoch 12, batch 18650, loss[loss=0.1967, simple_loss=0.2663, pruned_loss=0.06356, over 21439.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2886, pruned_loss=0.069, over 4240580.47 frames. ], batch size: 212, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:40:30,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.144e+02 8.452e+02 1.273e+03 1.829e+03 4.021e+03, threshold=2.546e+03, percent-clipped=0.0 2023-06-26 02:41:12,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2124786.0, ans=0.0 2023-06-26 02:41:20,345 INFO [train.py:996] (1/4) Epoch 12, batch 18700, loss[loss=0.2208, simple_loss=0.2737, pruned_loss=0.0839, over 21465.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2862, pruned_loss=0.07029, over 4250531.61 frames. ], batch size: 195, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:42:20,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2124966.0, ans=0.0 2023-06-26 02:42:45,972 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2125026.0, ans=0.0 2023-06-26 02:43:03,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2125086.0, ans=0.125 2023-06-26 02:43:09,806 INFO [train.py:996] (1/4) Epoch 12, batch 18750, loss[loss=0.2166, simple_loss=0.2814, pruned_loss=0.07585, over 21832.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2881, pruned_loss=0.07284, over 4261395.81 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:44:08,701 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.434e+02 9.819e+02 1.428e+03 2.632e+03 5.661e+03, threshold=2.856e+03, percent-clipped=25.0 2023-06-26 02:44:10,833 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2125266.0, ans=0.125 2023-06-26 02:44:18,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2125326.0, ans=0.125 2023-06-26 02:44:41,125 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2125386.0, ans=0.2 2023-06-26 02:44:43,651 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=22.5 2023-06-26 02:44:57,217 INFO [train.py:996] (1/4) Epoch 12, batch 18800, loss[loss=0.2374, simple_loss=0.3229, pruned_loss=0.07594, over 21750.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2971, pruned_loss=0.07478, over 4264599.64 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 02:46:26,514 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:46:41,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=2125686.0, ans=0.2 2023-06-26 02:46:44,217 INFO [train.py:996] (1/4) Epoch 12, batch 18850, loss[loss=0.2062, simple_loss=0.2797, pruned_loss=0.06637, over 20797.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.295, pruned_loss=0.07064, over 4253875.31 frames. ], batch size: 609, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:46:52,446 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-26 02:47:03,576 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2125746.0, ans=0.125 2023-06-26 02:47:45,907 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 8.171e+02 1.119e+03 1.966e+03 5.674e+03, threshold=2.238e+03, percent-clipped=7.0 2023-06-26 02:48:31,739 INFO [train.py:996] (1/4) Epoch 12, batch 18900, loss[loss=0.2015, simple_loss=0.2619, pruned_loss=0.07052, over 21466.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2902, pruned_loss=0.07019, over 4246850.05 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:48:52,182 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-26 02:50:07,801 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2126286.0, ans=0.0 2023-06-26 02:50:09,369 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2126286.0, ans=0.1 2023-06-26 02:50:19,378 INFO [train.py:996] (1/4) Epoch 12, batch 18950, loss[loss=0.1853, simple_loss=0.2417, pruned_loss=0.06441, over 20239.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2892, pruned_loss=0.07168, over 4255258.43 frames. ], batch size: 703, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:50:31,932 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:50:51,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2126406.0, ans=0.125 2023-06-26 02:51:02,272 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2126406.0, ans=0.125 2023-06-26 02:51:16,728 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2126466.0, ans=0.0 2023-06-26 02:51:21,185 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.969e+02 7.601e+02 1.025e+03 1.566e+03 4.291e+03, threshold=2.050e+03, percent-clipped=8.0 2023-06-26 02:51:23,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2126466.0, ans=0.125 2023-06-26 02:51:44,245 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2126526.0, ans=10.0 2023-06-26 02:52:07,276 INFO [train.py:996] (1/4) Epoch 12, batch 19000, loss[loss=0.2338, simple_loss=0.3169, pruned_loss=0.07537, over 21305.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2982, pruned_loss=0.07389, over 4256917.43 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:52:22,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2126646.0, ans=0.0 2023-06-26 02:52:26,422 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2126646.0, ans=0.125 2023-06-26 02:53:55,648 INFO [train.py:996] (1/4) Epoch 12, batch 19050, loss[loss=0.2485, simple_loss=0.3062, pruned_loss=0.09538, over 21305.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3031, pruned_loss=0.07711, over 4263276.16 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:54:29,871 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=8.144e-03 2023-06-26 02:55:00,806 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+02 8.291e+02 1.142e+03 1.622e+03 3.399e+03, threshold=2.283e+03, percent-clipped=17.0 2023-06-26 02:55:11,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2127126.0, ans=0.125 2023-06-26 02:55:38,498 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2127186.0, ans=0.1 2023-06-26 02:55:40,685 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-26 02:55:46,508 INFO [train.py:996] (1/4) Epoch 12, batch 19100, loss[loss=0.2063, simple_loss=0.2684, pruned_loss=0.07207, over 21610.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3024, pruned_loss=0.07882, over 4264326.44 frames. ], batch size: 231, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:56:27,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2127306.0, ans=0.0 2023-06-26 02:56:32,763 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:57:18,690 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:57:20,323 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2127486.0, ans=0.05 2023-06-26 02:57:49,141 INFO [train.py:996] (1/4) Epoch 12, batch 19150, loss[loss=0.2083, simple_loss=0.2944, pruned_loss=0.06113, over 21229.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3054, pruned_loss=0.07943, over 4263848.03 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:58:03,318 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2127546.0, ans=0.125 2023-06-26 02:58:42,195 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-26 02:58:50,838 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.298e+02 9.303e+02 1.285e+03 2.071e+03 6.086e+03, threshold=2.570e+03, percent-clipped=20.0 2023-06-26 02:58:59,441 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2127726.0, ans=0.125 2023-06-26 02:59:02,947 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2127726.0, ans=0.0 2023-06-26 02:59:44,206 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2127786.0, ans=0.0 2023-06-26 02:59:48,666 INFO [train.py:996] (1/4) Epoch 12, batch 19200, loss[loss=0.3187, simple_loss=0.4082, pruned_loss=0.1146, over 21642.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3154, pruned_loss=0.08024, over 4262354.57 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 03:01:36,779 INFO [train.py:996] (1/4) Epoch 12, batch 19250, loss[loss=0.1736, simple_loss=0.2684, pruned_loss=0.03941, over 21386.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3166, pruned_loss=0.07582, over 4267391.12 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 03:01:59,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2128206.0, ans=0.1 2023-06-26 03:02:28,886 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.722e+02 8.401e+02 1.172e+03 1.980e+03 3.719e+03, threshold=2.345e+03, percent-clipped=11.0 2023-06-26 03:03:16,162 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2128386.0, ans=0.0 2023-06-26 03:03:18,857 INFO [train.py:996] (1/4) Epoch 12, batch 19300, loss[loss=0.1716, simple_loss=0.255, pruned_loss=0.04414, over 21491.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3132, pruned_loss=0.07429, over 4265304.89 frames. ], batch size: 195, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:04:22,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2128626.0, ans=0.0 2023-06-26 03:04:56,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2128686.0, ans=0.125 2023-06-26 03:05:09,208 INFO [train.py:996] (1/4) Epoch 12, batch 19350, loss[loss=0.1789, simple_loss=0.2534, pruned_loss=0.0522, over 21163.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3091, pruned_loss=0.07089, over 4268742.33 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:05:12,877 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2128746.0, ans=0.2 2023-06-26 03:05:22,953 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2128746.0, ans=0.0 2023-06-26 03:05:37,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2128806.0, ans=0.125 2023-06-26 03:05:58,083 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128866.0, ans=0.1 2023-06-26 03:06:01,932 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-26 03:06:02,467 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.957e+02 9.080e+02 1.347e+03 2.318e+03 4.849e+03, threshold=2.694e+03, percent-clipped=24.0 2023-06-26 03:06:21,936 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2128926.0, ans=0.125 2023-06-26 03:06:51,614 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-26 03:06:57,169 INFO [train.py:996] (1/4) Epoch 12, batch 19400, loss[loss=0.2533, simple_loss=0.3181, pruned_loss=0.09423, over 21927.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3053, pruned_loss=0.0699, over 4277658.20 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:07:07,930 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2129046.0, ans=0.125 2023-06-26 03:07:25,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2129106.0, ans=0.1 2023-06-26 03:08:05,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2129226.0, ans=0.1 2023-06-26 03:08:45,425 INFO [train.py:996] (1/4) Epoch 12, batch 19450, loss[loss=0.1957, simple_loss=0.2562, pruned_loss=0.06756, over 21576.00 frames. ], tot_loss[loss=0.223, simple_loss=0.303, pruned_loss=0.07147, over 4285725.33 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:08:59,957 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2129346.0, ans=0.125 2023-06-26 03:09:28,320 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:09:38,485 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.528e+02 8.932e+02 1.244e+03 1.603e+03 3.427e+03, threshold=2.488e+03, percent-clipped=5.0 2023-06-26 03:09:39,876 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.26 vs. limit=15.0 2023-06-26 03:09:51,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2129526.0, ans=0.07 2023-06-26 03:10:12,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2129586.0, ans=0.125 2023-06-26 03:10:32,566 INFO [train.py:996] (1/4) Epoch 12, batch 19500, loss[loss=0.1704, simple_loss=0.2108, pruned_loss=0.06496, over 16489.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2988, pruned_loss=0.07245, over 4263624.19 frames. ], batch size: 62, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:10:38,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2129646.0, ans=0.125 2023-06-26 03:10:43,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2129646.0, ans=0.035 2023-06-26 03:12:21,293 INFO [train.py:996] (1/4) Epoch 12, batch 19550, loss[loss=0.2117, simple_loss=0.3101, pruned_loss=0.05668, over 20861.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2943, pruned_loss=0.071, over 4264047.72 frames. ], batch size: 609, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:13:15,146 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.254e+02 9.637e+02 1.286e+03 1.805e+03 3.756e+03, threshold=2.572e+03, percent-clipped=14.0 2023-06-26 03:14:09,967 INFO [train.py:996] (1/4) Epoch 12, batch 19600, loss[loss=0.2656, simple_loss=0.3345, pruned_loss=0.09832, over 21902.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2958, pruned_loss=0.07162, over 4273053.90 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:15:25,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2130426.0, ans=0.125 2023-06-26 03:16:00,204 INFO [train.py:996] (1/4) Epoch 12, batch 19650, loss[loss=0.2216, simple_loss=0.2885, pruned_loss=0.0774, over 21636.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3003, pruned_loss=0.07545, over 4281915.82 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:16:44,351 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.59 vs. limit=6.0 2023-06-26 03:17:06,817 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.323e+02 8.252e+02 1.350e+03 1.732e+03 4.354e+03, threshold=2.700e+03, percent-clipped=5.0 2023-06-26 03:17:59,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2130846.0, ans=0.125 2023-06-26 03:18:00,770 INFO [train.py:996] (1/4) Epoch 12, batch 19700, loss[loss=0.2293, simple_loss=0.3208, pruned_loss=0.0689, over 21719.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3038, pruned_loss=0.07605, over 4273554.10 frames. ], batch size: 415, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:19:14,993 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2131026.0, ans=0.0 2023-06-26 03:19:15,041 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2131026.0, ans=0.125 2023-06-26 03:19:37,555 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:19:47,026 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2131086.0, ans=0.125 2023-06-26 03:19:47,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2131086.0, ans=10.0 2023-06-26 03:19:50,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2131146.0, ans=0.125 2023-06-26 03:19:50,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2131146.0, ans=0.125 2023-06-26 03:19:51,717 INFO [train.py:996] (1/4) Epoch 12, batch 19750, loss[loss=0.2489, simple_loss=0.3375, pruned_loss=0.08017, over 21489.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3139, pruned_loss=0.07764, over 4269125.06 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:19:54,021 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2131146.0, ans=0.0 2023-06-26 03:20:09,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2131146.0, ans=0.125 2023-06-26 03:20:10,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2131146.0, ans=0.0 2023-06-26 03:20:59,655 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.834e+02 9.206e+02 1.478e+03 2.437e+03 4.883e+03, threshold=2.956e+03, percent-clipped=21.0 2023-06-26 03:21:16,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.79 vs. limit=8.0 2023-06-26 03:21:27,550 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-26 03:21:41,942 INFO [train.py:996] (1/4) Epoch 12, batch 19800, loss[loss=0.2455, simple_loss=0.3213, pruned_loss=0.08486, over 21688.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3152, pruned_loss=0.07894, over 4269798.97 frames. ], batch size: 389, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:22:32,132 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2131566.0, ans=0.125 2023-06-26 03:23:20,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2131686.0, ans=0.1 2023-06-26 03:23:33,445 INFO [train.py:996] (1/4) Epoch 12, batch 19850, loss[loss=0.1878, simple_loss=0.2792, pruned_loss=0.04822, over 21748.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3056, pruned_loss=0.07364, over 4274524.54 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:24:27,606 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2131866.0, ans=0.0 2023-06-26 03:24:28,305 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.89 vs. limit=15.0 2023-06-26 03:24:41,356 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 9.499e+02 1.416e+03 2.010e+03 4.711e+03, threshold=2.833e+03, percent-clipped=4.0 2023-06-26 03:25:29,032 INFO [train.py:996] (1/4) Epoch 12, batch 19900, loss[loss=0.2589, simple_loss=0.3733, pruned_loss=0.07219, over 19811.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3078, pruned_loss=0.07165, over 4269614.56 frames. ], batch size: 702, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:25:36,096 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2132046.0, ans=0.125 2023-06-26 03:25:42,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132046.0, ans=0.1 2023-06-26 03:25:47,037 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-26 03:26:04,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2132106.0, ans=0.125 2023-06-26 03:26:25,989 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:26:40,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2132226.0, ans=0.0 2023-06-26 03:26:47,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2132226.0, ans=0.0 2023-06-26 03:26:47,894 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-26 03:27:24,346 INFO [train.py:996] (1/4) Epoch 12, batch 19950, loss[loss=0.2263, simple_loss=0.2878, pruned_loss=0.08237, over 21774.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3019, pruned_loss=0.0718, over 4265064.42 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:27:47,151 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2132406.0, ans=0.07 2023-06-26 03:28:25,140 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.68 vs. limit=8.0 2023-06-26 03:28:29,306 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.346e+02 8.702e+02 1.204e+03 1.763e+03 4.092e+03, threshold=2.408e+03, percent-clipped=5.0 2023-06-26 03:28:47,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2132586.0, ans=0.125 2023-06-26 03:29:17,281 INFO [train.py:996] (1/4) Epoch 12, batch 20000, loss[loss=0.2248, simple_loss=0.3226, pruned_loss=0.06354, over 21741.00 frames. ], tot_loss[loss=0.223, simple_loss=0.302, pruned_loss=0.072, over 4263738.82 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:29:19,709 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2132646.0, ans=0.125 2023-06-26 03:29:23,128 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2132646.0, ans=0.125 2023-06-26 03:29:49,670 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132706.0, ans=0.1 2023-06-26 03:29:57,883 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2132766.0, ans=0.2 2023-06-26 03:30:01,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2132766.0, ans=0.09899494936611666 2023-06-26 03:30:20,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2132826.0, ans=0.125 2023-06-26 03:30:28,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2132826.0, ans=0.0 2023-06-26 03:30:32,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2132826.0, ans=0.125 2023-06-26 03:30:46,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2132886.0, ans=0.1 2023-06-26 03:30:57,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2132886.0, ans=0.2 2023-06-26 03:31:06,159 INFO [train.py:996] (1/4) Epoch 12, batch 20050, loss[loss=0.2118, simple_loss=0.2905, pruned_loss=0.06661, over 21892.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3048, pruned_loss=0.07465, over 4274155.15 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:32:07,825 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.733e+02 9.408e+02 1.346e+03 1.718e+03 3.911e+03, threshold=2.692e+03, percent-clipped=11.0 2023-06-26 03:32:13,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2133126.0, ans=0.125 2023-06-26 03:32:21,343 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-26 03:32:42,272 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-26 03:32:55,549 INFO [train.py:996] (1/4) Epoch 12, batch 20100, loss[loss=0.2116, simple_loss=0.2926, pruned_loss=0.06533, over 21206.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3066, pruned_loss=0.07675, over 4287043.61 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:33:15,834 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2133246.0, ans=0.125 2023-06-26 03:34:17,255 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2133426.0, ans=0.125 2023-06-26 03:34:17,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2133426.0, ans=0.2 2023-06-26 03:34:46,178 INFO [train.py:996] (1/4) Epoch 12, batch 20150, loss[loss=0.2657, simple_loss=0.3519, pruned_loss=0.08974, over 21823.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3139, pruned_loss=0.08057, over 4280784.41 frames. ], batch size: 124, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:35:04,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2133546.0, ans=0.1 2023-06-26 03:35:15,642 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2133606.0, ans=0.2 2023-06-26 03:35:46,175 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.75 vs. limit=22.5 2023-06-26 03:36:04,702 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.317e+02 8.895e+02 1.198e+03 1.726e+03 5.010e+03, threshold=2.397e+03, percent-clipped=8.0 2023-06-26 03:36:52,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2133846.0, ans=0.125 2023-06-26 03:36:53,599 INFO [train.py:996] (1/4) Epoch 12, batch 20200, loss[loss=0.3037, simple_loss=0.3887, pruned_loss=0.1093, over 21538.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3198, pruned_loss=0.08298, over 4277120.09 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:36:56,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2133846.0, ans=0.0 2023-06-26 03:36:56,160 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2133846.0, ans=0.125 2023-06-26 03:37:52,003 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:38:09,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2134026.0, ans=0.125 2023-06-26 03:38:46,332 INFO [train.py:996] (1/4) Epoch 12, batch 20250, loss[loss=0.2326, simple_loss=0.3069, pruned_loss=0.07913, over 21454.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3203, pruned_loss=0.08154, over 4272602.41 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:39:29,271 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-26 03:39:32,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2134266.0, ans=0.125 2023-06-26 03:39:49,453 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.636e+02 8.204e+02 1.224e+03 2.058e+03 5.091e+03, threshold=2.449e+03, percent-clipped=18.0 2023-06-26 03:40:38,155 INFO [train.py:996] (1/4) Epoch 12, batch 20300, loss[loss=0.2392, simple_loss=0.3302, pruned_loss=0.07406, over 21590.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3176, pruned_loss=0.07876, over 4262517.38 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:40:52,940 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:41:40,215 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-26 03:42:27,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2134746.0, ans=0.125 2023-06-26 03:42:28,296 INFO [train.py:996] (1/4) Epoch 12, batch 20350, loss[loss=0.2361, simple_loss=0.3069, pruned_loss=0.08264, over 21764.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3184, pruned_loss=0.0795, over 4267228.20 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:43:06,970 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2134806.0, ans=0.0 2023-06-26 03:43:31,519 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 1.038e+03 1.419e+03 2.103e+03 3.160e+03, threshold=2.839e+03, percent-clipped=11.0 2023-06-26 03:43:33,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2134926.0, ans=0.0 2023-06-26 03:43:41,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2134926.0, ans=0.125 2023-06-26 03:44:05,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2134986.0, ans=0.0 2023-06-26 03:44:17,371 INFO [train.py:996] (1/4) Epoch 12, batch 20400, loss[loss=0.2671, simple_loss=0.3297, pruned_loss=0.1023, over 21491.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3219, pruned_loss=0.0822, over 4264682.54 frames. ], batch size: 194, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:44:31,472 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2135046.0, ans=0.0 2023-06-26 03:44:43,344 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:44:47,575 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2135106.0, ans=15.0 2023-06-26 03:45:29,893 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2135226.0, ans=0.125 2023-06-26 03:46:02,275 INFO [train.py:996] (1/4) Epoch 12, batch 20450, loss[loss=0.2196, simple_loss=0.2967, pruned_loss=0.07127, over 21876.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3223, pruned_loss=0.08453, over 4258954.95 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:46:02,904 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2135346.0, ans=0.125 2023-06-26 03:46:16,071 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.81 vs. limit=10.0 2023-06-26 03:46:20,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2135406.0, ans=0.125 2023-06-26 03:46:57,366 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-26 03:47:05,914 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.613e+02 8.799e+02 1.280e+03 2.012e+03 4.043e+03, threshold=2.560e+03, percent-clipped=9.0 2023-06-26 03:47:30,118 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2135586.0, ans=0.5 2023-06-26 03:47:52,260 INFO [train.py:996] (1/4) Epoch 12, batch 20500, loss[loss=0.2393, simple_loss=0.3057, pruned_loss=0.08645, over 21811.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3182, pruned_loss=0.08426, over 4257213.86 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:49:41,386 INFO [train.py:996] (1/4) Epoch 12, batch 20550, loss[loss=0.1991, simple_loss=0.2821, pruned_loss=0.05806, over 21413.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3113, pruned_loss=0.08218, over 4246247.28 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:49:41,954 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2135946.0, ans=0.09899494936611666 2023-06-26 03:49:44,942 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-26 03:49:47,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2135946.0, ans=0.125 2023-06-26 03:50:06,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2136006.0, ans=0.125 2023-06-26 03:50:12,213 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2136006.0, ans=0.0 2023-06-26 03:50:48,920 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.549e+02 9.305e+02 1.211e+03 1.898e+03 4.893e+03, threshold=2.421e+03, percent-clipped=7.0 2023-06-26 03:50:52,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2136126.0, ans=0.125 2023-06-26 03:51:00,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2136126.0, ans=0.2 2023-06-26 03:51:32,371 INFO [train.py:996] (1/4) Epoch 12, batch 20600, loss[loss=0.2539, simple_loss=0.3224, pruned_loss=0.09271, over 21587.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3125, pruned_loss=0.08016, over 4248487.02 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:51:32,592 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2136246.0, ans=0.125 2023-06-26 03:51:34,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2136246.0, ans=0.0 2023-06-26 03:51:36,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2136246.0, ans=0.125 2023-06-26 03:52:11,671 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:53:19,384 INFO [train.py:996] (1/4) Epoch 12, batch 20650, loss[loss=0.1928, simple_loss=0.2619, pruned_loss=0.06183, over 21668.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3078, pruned_loss=0.07976, over 4237580.41 frames. ], batch size: 332, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:53:43,436 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2136606.0, ans=0.2 2023-06-26 03:54:03,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2136666.0, ans=0.125 2023-06-26 03:54:23,262 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.638e+02 7.911e+02 1.093e+03 1.392e+03 2.795e+03, threshold=2.187e+03, percent-clipped=3.0 2023-06-26 03:54:40,959 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2136726.0, ans=0.0 2023-06-26 03:55:06,922 INFO [train.py:996] (1/4) Epoch 12, batch 20700, loss[loss=0.1832, simple_loss=0.252, pruned_loss=0.05724, over 21474.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3003, pruned_loss=0.07662, over 4236022.31 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:56:21,462 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2137026.0, ans=0.125 2023-06-26 03:56:48,801 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:56:57,179 INFO [train.py:996] (1/4) Epoch 12, batch 20750, loss[loss=0.2207, simple_loss=0.3097, pruned_loss=0.06579, over 21738.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3013, pruned_loss=0.07522, over 4240800.89 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:57:06,641 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2137146.0, ans=0.125 2023-06-26 03:57:41,794 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-26 03:58:14,460 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.764e+02 1.037e+03 1.665e+03 2.328e+03 7.151e+03, threshold=3.329e+03, percent-clipped=27.0 2023-06-26 03:58:35,514 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=22.5 2023-06-26 03:58:51,628 INFO [train.py:996] (1/4) Epoch 12, batch 20800, loss[loss=0.2197, simple_loss=0.2836, pruned_loss=0.07787, over 21652.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3048, pruned_loss=0.07621, over 4245484.85 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:58:52,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2137446.0, ans=0.035 2023-06-26 03:59:21,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2137506.0, ans=0.2 2023-06-26 03:59:41,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-26 04:00:34,754 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2137686.0, ans=0.125 2023-06-26 04:00:39,521 INFO [train.py:996] (1/4) Epoch 12, batch 20850, loss[loss=0.248, simple_loss=0.3047, pruned_loss=0.09566, over 21619.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2963, pruned_loss=0.07364, over 4241996.72 frames. ], batch size: 509, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:00:41,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2137746.0, ans=0.1 2023-06-26 04:01:13,998 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-26 04:01:49,783 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.795e+02 7.660e+02 1.067e+03 1.526e+03 3.659e+03, threshold=2.133e+03, percent-clipped=1.0 2023-06-26 04:01:57,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2137926.0, ans=0.125 2023-06-26 04:02:27,174 INFO [train.py:996] (1/4) Epoch 12, batch 20900, loss[loss=0.2342, simple_loss=0.312, pruned_loss=0.07821, over 21272.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2987, pruned_loss=0.07553, over 4255276.82 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:02:55,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2138106.0, ans=0.0 2023-06-26 04:02:56,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2138106.0, ans=10.0 2023-06-26 04:03:18,627 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2138166.0, ans=0.0 2023-06-26 04:03:32,556 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2138226.0, ans=0.125 2023-06-26 04:03:46,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2138226.0, ans=0.0 2023-06-26 04:03:51,420 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2138286.0, ans=0.0 2023-06-26 04:04:06,631 INFO [train.py:996] (1/4) Epoch 12, batch 20950, loss[loss=0.2217, simple_loss=0.2871, pruned_loss=0.07818, over 21898.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2951, pruned_loss=0.07255, over 4253598.11 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:05:00,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2138466.0, ans=0.125 2023-06-26 04:05:18,160 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.308e+02 8.469e+02 1.535e+03 2.193e+03 7.053e+03, threshold=3.069e+03, percent-clipped=28.0 2023-06-26 04:05:20,947 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-26 04:05:32,438 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2138526.0, ans=0.0 2023-06-26 04:05:32,547 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2138526.0, ans=0.125 2023-06-26 04:05:38,991 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2138586.0, ans=0.125 2023-06-26 04:05:52,941 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2138646.0, ans=0.125 2023-06-26 04:05:53,930 INFO [train.py:996] (1/4) Epoch 12, batch 21000, loss[loss=0.235, simple_loss=0.3023, pruned_loss=0.08383, over 21900.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2947, pruned_loss=0.07253, over 4257788.10 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:05:53,931 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 04:06:16,505 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2617, simple_loss=0.359, pruned_loss=0.08218, over 1796401.00 frames. 2023-06-26 04:06:16,505 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-26 04:06:31,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2138646.0, ans=0.0 2023-06-26 04:06:44,477 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2138706.0, ans=0.0 2023-06-26 04:07:52,146 INFO [train.py:996] (1/4) Epoch 12, batch 21050, loss[loss=0.1968, simple_loss=0.2659, pruned_loss=0.06385, over 21155.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.293, pruned_loss=0.07362, over 4265063.48 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:08:50,081 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2139066.0, ans=0.125 2023-06-26 04:08:56,409 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.279e+02 7.328e+02 1.060e+03 1.380e+03 3.297e+03, threshold=2.119e+03, percent-clipped=1.0 2023-06-26 04:09:07,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2139126.0, ans=0.125 2023-06-26 04:09:37,014 INFO [train.py:996] (1/4) Epoch 12, batch 21100, loss[loss=0.2049, simple_loss=0.2754, pruned_loss=0.06723, over 21558.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2887, pruned_loss=0.07251, over 4255389.36 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:09:55,868 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2139246.0, ans=0.125 2023-06-26 04:11:22,948 INFO [train.py:996] (1/4) Epoch 12, batch 21150, loss[loss=0.2466, simple_loss=0.3049, pruned_loss=0.09417, over 21724.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2847, pruned_loss=0.07294, over 4263430.17 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:11:44,200 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2139606.0, ans=0.125 2023-06-26 04:11:53,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2139606.0, ans=0.1 2023-06-26 04:11:56,717 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-26 04:11:59,600 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2139606.0, ans=0.2 2023-06-26 04:12:28,507 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.534e+02 8.733e+02 1.101e+03 1.484e+03 2.918e+03, threshold=2.203e+03, percent-clipped=8.0 2023-06-26 04:12:44,764 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2139726.0, ans=0.0 2023-06-26 04:12:59,395 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2139786.0, ans=0.1 2023-06-26 04:13:08,703 INFO [train.py:996] (1/4) Epoch 12, batch 21200, loss[loss=0.219, simple_loss=0.2837, pruned_loss=0.07712, over 21907.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2823, pruned_loss=0.07204, over 4252008.85 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:13:20,839 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2139846.0, ans=0.5 2023-06-26 04:13:37,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2139906.0, ans=0.125 2023-06-26 04:13:40,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2139906.0, ans=0.125 2023-06-26 04:14:22,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2140026.0, ans=0.1 2023-06-26 04:14:26,103 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2140026.0, ans=0.1 2023-06-26 04:14:51,536 INFO [train.py:996] (1/4) Epoch 12, batch 21250, loss[loss=0.2483, simple_loss=0.3203, pruned_loss=0.08814, over 21738.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2798, pruned_loss=0.07147, over 4255834.61 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:15:16,963 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2140146.0, ans=0.1 2023-06-26 04:15:48,637 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-26 04:15:56,785 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2140266.0, ans=0.125 2023-06-26 04:15:56,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2140266.0, ans=0.125 2023-06-26 04:15:58,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2140326.0, ans=0.125 2023-06-26 04:16:07,976 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.686e+02 9.766e+02 1.414e+03 1.888e+03 3.901e+03, threshold=2.827e+03, percent-clipped=19.0 2023-06-26 04:16:10,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2140326.0, ans=0.1 2023-06-26 04:16:39,737 INFO [train.py:996] (1/4) Epoch 12, batch 21300, loss[loss=0.2236, simple_loss=0.2995, pruned_loss=0.07386, over 21575.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2865, pruned_loss=0.07358, over 4265078.92 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:16:41,174 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.16 vs. limit=15.0 2023-06-26 04:17:34,072 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2140566.0, ans=0.0 2023-06-26 04:18:24,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2140686.0, ans=0.125 2023-06-26 04:18:37,773 INFO [train.py:996] (1/4) Epoch 12, batch 21350, loss[loss=0.2832, simple_loss=0.3632, pruned_loss=0.1016, over 21509.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2922, pruned_loss=0.07433, over 4276295.89 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:18:59,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2140746.0, ans=0.0 2023-06-26 04:19:52,130 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.340e+02 8.083e+02 1.206e+03 1.998e+03 5.884e+03, threshold=2.412e+03, percent-clipped=11.0 2023-06-26 04:20:15,259 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2140986.0, ans=0.2 2023-06-26 04:20:34,818 INFO [train.py:996] (1/4) Epoch 12, batch 21400, loss[loss=0.2358, simple_loss=0.3146, pruned_loss=0.07851, over 21315.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2946, pruned_loss=0.07316, over 4273102.04 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:20:35,249 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2141046.0, ans=0.125 2023-06-26 04:22:02,781 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2141286.0, ans=0.0 2023-06-26 04:22:23,092 INFO [train.py:996] (1/4) Epoch 12, batch 21450, loss[loss=0.2583, simple_loss=0.3228, pruned_loss=0.09684, over 21609.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3002, pruned_loss=0.07615, over 4280015.92 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:22:47,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=2141406.0, ans=0.05 2023-06-26 04:22:57,852 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2141406.0, ans=0.125 2023-06-26 04:23:19,216 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2141466.0, ans=10.0 2023-06-26 04:23:28,957 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.921e+02 8.614e+02 1.088e+03 1.616e+03 2.799e+03, threshold=2.175e+03, percent-clipped=3.0 2023-06-26 04:24:11,509 INFO [train.py:996] (1/4) Epoch 12, batch 21500, loss[loss=0.2333, simple_loss=0.2944, pruned_loss=0.08614, over 21791.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2981, pruned_loss=0.07759, over 4278276.04 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:24:53,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-26 04:25:58,253 INFO [train.py:996] (1/4) Epoch 12, batch 21550, loss[loss=0.2472, simple_loss=0.3473, pruned_loss=0.07355, over 19857.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2913, pruned_loss=0.07512, over 4268517.31 frames. ], batch size: 702, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:27:02,004 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2142066.0, ans=0.0 2023-06-26 04:27:07,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2142126.0, ans=0.0 2023-06-26 04:27:09,921 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.662e+02 8.216e+02 1.112e+03 1.423e+03 3.148e+03, threshold=2.223e+03, percent-clipped=7.0 2023-06-26 04:27:23,523 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-26 04:27:50,971 INFO [train.py:996] (1/4) Epoch 12, batch 21600, loss[loss=0.2159, simple_loss=0.3352, pruned_loss=0.04834, over 19670.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2872, pruned_loss=0.07312, over 4263130.38 frames. ], batch size: 703, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:28:27,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2142306.0, ans=0.125 2023-06-26 04:29:00,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-26 04:29:21,689 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2142426.0, ans=0.125 2023-06-26 04:29:43,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.17 vs. limit=15.0 2023-06-26 04:29:43,324 INFO [train.py:996] (1/4) Epoch 12, batch 21650, loss[loss=0.2403, simple_loss=0.3371, pruned_loss=0.0718, over 21736.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2918, pruned_loss=0.07089, over 4265676.24 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:29:58,717 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2142546.0, ans=0.0 2023-06-26 04:30:08,955 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2142606.0, ans=0.0 2023-06-26 04:30:31,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-26 04:30:56,287 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 7.832e+02 1.347e+03 2.277e+03 5.515e+03, threshold=2.694e+03, percent-clipped=27.0 2023-06-26 04:31:30,515 INFO [train.py:996] (1/4) Epoch 12, batch 21700, loss[loss=0.2335, simple_loss=0.2934, pruned_loss=0.0868, over 21542.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2943, pruned_loss=0.06957, over 4267268.31 frames. ], batch size: 442, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:31:33,042 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2142846.0, ans=0.125 2023-06-26 04:31:45,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2142846.0, ans=0.035 2023-06-26 04:31:55,903 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2142906.0, ans=0.0 2023-06-26 04:32:21,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2142966.0, ans=0.125 2023-06-26 04:32:34,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2143026.0, ans=0.04949747468305833 2023-06-26 04:33:05,979 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2143086.0, ans=0.125 2023-06-26 04:33:07,320 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2143086.0, ans=0.0 2023-06-26 04:33:07,450 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:33:20,560 INFO [train.py:996] (1/4) Epoch 12, batch 21750, loss[loss=0.2182, simple_loss=0.2799, pruned_loss=0.07829, over 21542.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2898, pruned_loss=0.07032, over 4275400.90 frames. ], batch size: 391, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:33:29,043 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-26 04:33:55,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2143206.0, ans=0.05 2023-06-26 04:34:30,623 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.472e+02 7.650e+02 1.064e+03 1.573e+03 4.038e+03, threshold=2.129e+03, percent-clipped=2.0 2023-06-26 04:35:12,911 INFO [train.py:996] (1/4) Epoch 12, batch 21800, loss[loss=0.2119, simple_loss=0.2902, pruned_loss=0.06677, over 21675.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2863, pruned_loss=0.07099, over 4277889.99 frames. ], batch size: 248, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:35:43,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2143506.0, ans=0.125 2023-06-26 04:36:38,531 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2143626.0, ans=0.125 2023-06-26 04:36:43,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2143626.0, ans=0.0 2023-06-26 04:37:09,713 INFO [train.py:996] (1/4) Epoch 12, batch 21850, loss[loss=0.2162, simple_loss=0.2937, pruned_loss=0.06935, over 21455.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.293, pruned_loss=0.07137, over 4268699.24 frames. ], batch size: 194, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:37:32,655 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2143806.0, ans=0.1 2023-06-26 04:37:45,189 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2143806.0, ans=0.0 2023-06-26 04:38:16,878 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-26 04:38:17,400 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.530e+02 9.705e+02 1.349e+03 2.023e+03 4.101e+03, threshold=2.697e+03, percent-clipped=20.0 2023-06-26 04:38:50,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2143986.0, ans=0.125 2023-06-26 04:38:59,637 INFO [train.py:996] (1/4) Epoch 12, batch 21900, loss[loss=0.2093, simple_loss=0.2769, pruned_loss=0.07083, over 21818.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2934, pruned_loss=0.07238, over 4279641.21 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:39:00,707 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-26 04:39:36,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2144106.0, ans=0.0 2023-06-26 04:39:41,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2144166.0, ans=0.125 2023-06-26 04:39:43,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2144166.0, ans=0.2 2023-06-26 04:39:43,284 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2144166.0, ans=0.125 2023-06-26 04:40:18,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2144226.0, ans=0.125 2023-06-26 04:40:20,308 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2144226.0, ans=0.125 2023-06-26 04:40:23,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2144286.0, ans=0.125 2023-06-26 04:40:40,000 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2144346.0, ans=0.125 2023-06-26 04:40:41,125 INFO [train.py:996] (1/4) Epoch 12, batch 21950, loss[loss=0.1769, simple_loss=0.252, pruned_loss=0.05087, over 21782.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2876, pruned_loss=0.0712, over 4278096.87 frames. ], batch size: 124, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:41:00,590 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144346.0, ans=0.1 2023-06-26 04:41:35,454 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2144466.0, ans=0.2 2023-06-26 04:41:57,275 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.471e+02 7.504e+02 1.022e+03 1.599e+03 3.109e+03, threshold=2.043e+03, percent-clipped=2.0 2023-06-26 04:42:18,808 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2144586.0, ans=0.125 2023-06-26 04:42:20,845 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2144586.0, ans=0.125 2023-06-26 04:42:33,007 INFO [train.py:996] (1/4) Epoch 12, batch 22000, loss[loss=0.2093, simple_loss=0.2746, pruned_loss=0.07196, over 21800.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2806, pruned_loss=0.06773, over 4275014.92 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 32.0 2023-06-26 04:43:24,664 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2144766.0, ans=0.125 2023-06-26 04:43:45,122 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-26 04:44:28,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-26 04:44:30,596 INFO [train.py:996] (1/4) Epoch 12, batch 22050, loss[loss=0.2292, simple_loss=0.3023, pruned_loss=0.07802, over 21667.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2866, pruned_loss=0.07015, over 4269555.04 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:44:47,390 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:45:04,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2145006.0, ans=0.2 2023-06-26 04:45:47,961 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.424e+02 9.779e+02 1.604e+03 2.153e+03 5.995e+03, threshold=3.207e+03, percent-clipped=28.0 2023-06-26 04:45:48,456 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2145126.0, ans=0.125 2023-06-26 04:46:10,144 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2145186.0, ans=0.0 2023-06-26 04:46:21,815 INFO [train.py:996] (1/4) Epoch 12, batch 22100, loss[loss=0.2431, simple_loss=0.3137, pruned_loss=0.08627, over 21216.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2967, pruned_loss=0.07512, over 4250205.73 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:46:24,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2145246.0, ans=0.2 2023-06-26 04:46:36,345 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2145246.0, ans=0.125 2023-06-26 04:46:38,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2145306.0, ans=0.125 2023-06-26 04:47:08,886 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:47:50,656 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2145486.0, ans=0.125 2023-06-26 04:47:53,646 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2145486.0, ans=0.125 2023-06-26 04:48:02,837 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.45 vs. limit=10.0 2023-06-26 04:48:11,837 INFO [train.py:996] (1/4) Epoch 12, batch 22150, loss[loss=0.1915, simple_loss=0.2729, pruned_loss=0.05507, over 21692.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2992, pruned_loss=0.07583, over 4256084.46 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:49:27,211 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.601e+02 7.424e+02 1.002e+03 1.530e+03 2.924e+03, threshold=2.004e+03, percent-clipped=0.0 2023-06-26 04:50:01,390 INFO [train.py:996] (1/4) Epoch 12, batch 22200, loss[loss=0.2636, simple_loss=0.3462, pruned_loss=0.09045, over 20042.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3015, pruned_loss=0.07778, over 4271012.56 frames. ], batch size: 702, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:50:38,552 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2145906.0, ans=0.125 2023-06-26 04:51:52,881 INFO [train.py:996] (1/4) Epoch 12, batch 22250, loss[loss=0.2702, simple_loss=0.3833, pruned_loss=0.07859, over 19791.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.31, pruned_loss=0.07931, over 4269450.87 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:52:38,719 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2146266.0, ans=0.1 2023-06-26 04:52:51,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2146266.0, ans=0.2 2023-06-26 04:52:58,270 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2146266.0, ans=0.0 2023-06-26 04:53:05,317 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2146326.0, ans=0.125 2023-06-26 04:53:10,019 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.929e+02 1.009e+03 1.467e+03 2.182e+03 5.502e+03, threshold=2.934e+03, percent-clipped=31.0 2023-06-26 04:53:10,491 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2146326.0, ans=0.125 2023-06-26 04:53:29,439 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2146386.0, ans=0.0 2023-06-26 04:53:42,559 INFO [train.py:996] (1/4) Epoch 12, batch 22300, loss[loss=0.2345, simple_loss=0.3089, pruned_loss=0.08002, over 21897.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.313, pruned_loss=0.08199, over 4272796.78 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:54:04,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2146506.0, ans=0.125 2023-06-26 04:54:22,767 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2146506.0, ans=0.125 2023-06-26 04:54:22,777 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2146506.0, ans=0.2 2023-06-26 04:54:59,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2146626.0, ans=0.0 2023-06-26 04:55:06,673 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-26 04:55:30,145 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2146686.0, ans=0.2 2023-06-26 04:55:32,925 INFO [train.py:996] (1/4) Epoch 12, batch 22350, loss[loss=0.1968, simple_loss=0.2752, pruned_loss=0.0592, over 21864.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3104, pruned_loss=0.08191, over 4281398.19 frames. ], batch size: 333, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:55:36,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2146746.0, ans=0.125 2023-06-26 04:55:51,020 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2146746.0, ans=0.125 2023-06-26 04:56:03,564 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2146806.0, ans=0.125 2023-06-26 04:56:57,051 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.431e+02 9.155e+02 1.132e+03 1.676e+03 3.302e+03, threshold=2.265e+03, percent-clipped=3.0 2023-06-26 04:57:23,249 INFO [train.py:996] (1/4) Epoch 12, batch 22400, loss[loss=0.2132, simple_loss=0.2854, pruned_loss=0.07049, over 21765.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3076, pruned_loss=0.07793, over 4283417.81 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:57:46,256 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2147046.0, ans=0.125 2023-06-26 04:57:46,445 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2147046.0, ans=0.0 2023-06-26 04:57:52,430 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-26 04:58:19,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2147166.0, ans=0.0 2023-06-26 04:59:13,379 INFO [train.py:996] (1/4) Epoch 12, batch 22450, loss[loss=0.2095, simple_loss=0.2779, pruned_loss=0.07054, over 21775.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3029, pruned_loss=0.0776, over 4278774.22 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:00:02,729 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2147406.0, ans=0.0 2023-06-26 05:00:22,212 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-26 05:00:37,265 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.261e+02 8.888e+02 1.140e+03 1.642e+03 4.602e+03, threshold=2.279e+03, percent-clipped=11.0 2023-06-26 05:01:13,412 INFO [train.py:996] (1/4) Epoch 12, batch 22500, loss[loss=0.2377, simple_loss=0.3255, pruned_loss=0.07498, over 21660.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.298, pruned_loss=0.07688, over 4272681.81 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:01:53,525 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2147766.0, ans=0.1 2023-06-26 05:02:13,058 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=2147766.0, ans=0.5 2023-06-26 05:02:24,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2147826.0, ans=0.2 2023-06-26 05:02:31,891 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=12.0 2023-06-26 05:02:38,123 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2147886.0, ans=0.125 2023-06-26 05:02:41,736 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2147886.0, ans=0.0 2023-06-26 05:03:04,492 INFO [train.py:996] (1/4) Epoch 12, batch 22550, loss[loss=0.2285, simple_loss=0.3034, pruned_loss=0.0768, over 21870.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3042, pruned_loss=0.07858, over 4269493.81 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:04:18,089 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.902e+02 9.532e+02 1.362e+03 1.911e+03 4.517e+03, threshold=2.723e+03, percent-clipped=17.0 2023-06-26 05:05:00,747 INFO [train.py:996] (1/4) Epoch 12, batch 22600, loss[loss=0.1869, simple_loss=0.2649, pruned_loss=0.05449, over 21623.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3059, pruned_loss=0.07831, over 4270442.14 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:05:43,099 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2148366.0, ans=0.035 2023-06-26 05:05:48,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2148366.0, ans=0.95 2023-06-26 05:06:06,652 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2148426.0, ans=0.0 2023-06-26 05:06:45,108 INFO [train.py:996] (1/4) Epoch 12, batch 22650, loss[loss=0.2484, simple_loss=0.2936, pruned_loss=0.1016, over 21355.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.302, pruned_loss=0.07789, over 4275452.02 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:06:54,137 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2148546.0, ans=0.125 2023-06-26 05:06:57,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2148546.0, ans=0.125 2023-06-26 05:07:52,292 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.795e+02 9.798e+02 1.388e+03 1.950e+03 5.687e+03, threshold=2.777e+03, percent-clipped=13.0 2023-06-26 05:08:01,863 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-26 05:08:31,884 INFO [train.py:996] (1/4) Epoch 12, batch 22700, loss[loss=0.2273, simple_loss=0.2842, pruned_loss=0.08522, over 21565.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2959, pruned_loss=0.07743, over 4281923.73 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:08:36,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2148846.0, ans=0.5 2023-06-26 05:08:39,809 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2148846.0, ans=0.2 2023-06-26 05:09:13,268 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-26 05:09:16,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2148966.0, ans=0.1 2023-06-26 05:09:18,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2148966.0, ans=0.125 2023-06-26 05:10:24,748 INFO [train.py:996] (1/4) Epoch 12, batch 22750, loss[loss=0.3232, simple_loss=0.3738, pruned_loss=0.1363, over 21306.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2972, pruned_loss=0.07961, over 4267504.01 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:10:51,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2149206.0, ans=0.0 2023-06-26 05:10:56,880 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2149206.0, ans=0.125 2023-06-26 05:11:11,057 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.72 vs. limit=15.0 2023-06-26 05:11:24,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2149326.0, ans=0.0 2023-06-26 05:11:44,547 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.150e+02 7.737e+02 1.012e+03 1.483e+03 2.915e+03, threshold=2.025e+03, percent-clipped=0.0 2023-06-26 05:11:59,545 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2149386.0, ans=0.125 2023-06-26 05:12:14,853 INFO [train.py:996] (1/4) Epoch 12, batch 22800, loss[loss=0.2758, simple_loss=0.3449, pruned_loss=0.1034, over 21879.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3017, pruned_loss=0.08193, over 4276585.76 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:12:24,264 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2149446.0, ans=0.1 2023-06-26 05:12:43,621 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2149506.0, ans=0.125 2023-06-26 05:13:12,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2149626.0, ans=0.0 2023-06-26 05:13:54,238 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-26 05:14:03,896 INFO [train.py:996] (1/4) Epoch 12, batch 22850, loss[loss=0.2394, simple_loss=0.3034, pruned_loss=0.08767, over 21652.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2982, pruned_loss=0.08118, over 4278095.23 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:14:07,897 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2149746.0, ans=0.125 2023-06-26 05:14:14,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2149746.0, ans=0.125 2023-06-26 05:14:18,071 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2149746.0, ans=0.125 2023-06-26 05:14:43,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2149866.0, ans=0.025 2023-06-26 05:14:50,943 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2149866.0, ans=0.125 2023-06-26 05:14:54,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-26 05:14:59,971 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-26 05:15:00,815 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2149926.0, ans=0.125 2023-06-26 05:15:06,011 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2149926.0, ans=0.0 2023-06-26 05:15:15,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2149926.0, ans=0.0 2023-06-26 05:15:15,805 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.30 vs. limit=15.0 2023-06-26 05:15:23,330 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.516e+02 9.685e+02 1.570e+03 2.544e+03 4.880e+03, threshold=3.139e+03, percent-clipped=35.0 2023-06-26 05:15:31,912 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2149926.0, ans=0.125 2023-06-26 05:15:45,603 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2149986.0, ans=0.125 2023-06-26 05:15:54,418 INFO [train.py:996] (1/4) Epoch 12, batch 22900, loss[loss=0.2376, simple_loss=0.3311, pruned_loss=0.07202, over 21722.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2988, pruned_loss=0.07952, over 4270626.98 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:15:56,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2150046.0, ans=0.2 2023-06-26 05:16:31,130 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.71 vs. limit=5.0 2023-06-26 05:17:18,855 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2150226.0, ans=0.0 2023-06-26 05:17:24,280 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-26 05:17:29,702 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.80 vs. limit=6.0 2023-06-26 05:17:49,290 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2150286.0, ans=0.125 2023-06-26 05:17:53,773 INFO [train.py:996] (1/4) Epoch 12, batch 22950, loss[loss=0.264, simple_loss=0.3719, pruned_loss=0.07803, over 21650.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3109, pruned_loss=0.07765, over 4267652.03 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:17:55,179 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-26 05:18:51,761 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2150466.0, ans=0.0 2023-06-26 05:19:12,497 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.963e+02 8.459e+02 1.367e+03 2.050e+03 4.078e+03, threshold=2.734e+03, percent-clipped=4.0 2023-06-26 05:19:42,499 INFO [train.py:996] (1/4) Epoch 12, batch 23000, loss[loss=0.1965, simple_loss=0.277, pruned_loss=0.05802, over 21094.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.313, pruned_loss=0.07616, over 4276194.53 frames. ], batch size: 608, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:21:15,314 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-26 05:21:26,353 INFO [train.py:996] (1/4) Epoch 12, batch 23050, loss[loss=0.2828, simple_loss=0.35, pruned_loss=0.1077, over 21810.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3146, pruned_loss=0.07834, over 4282082.42 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:22:43,108 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=22.5 2023-06-26 05:22:54,534 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.118e+02 9.048e+02 1.306e+03 1.787e+03 2.952e+03, threshold=2.611e+03, percent-clipped=5.0 2023-06-26 05:23:19,197 INFO [train.py:996] (1/4) Epoch 12, batch 23100, loss[loss=0.2071, simple_loss=0.2725, pruned_loss=0.07086, over 21662.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3107, pruned_loss=0.07925, over 4275034.99 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:23:42,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2151306.0, ans=0.125 2023-06-26 05:24:31,246 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2151426.0, ans=0.0 2023-06-26 05:25:08,273 INFO [train.py:996] (1/4) Epoch 12, batch 23150, loss[loss=0.2255, simple_loss=0.2933, pruned_loss=0.07885, over 21846.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.304, pruned_loss=0.07808, over 4277109.58 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:25:42,049 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-26 05:26:25,824 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.610e+02 9.726e+02 1.363e+03 3.124e+03, threshold=1.945e+03, percent-clipped=3.0 2023-06-26 05:26:42,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2151786.0, ans=0.1 2023-06-26 05:26:53,450 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-26 05:26:55,437 INFO [train.py:996] (1/4) Epoch 12, batch 23200, loss[loss=0.2139, simple_loss=0.2875, pruned_loss=0.07018, over 21906.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3039, pruned_loss=0.07979, over 4289997.56 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 05:27:54,304 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2151966.0, ans=0.125 2023-06-26 05:28:06,814 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.23 vs. limit=15.0 2023-06-26 05:28:32,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2152086.0, ans=0.0 2023-06-26 05:28:41,961 INFO [train.py:996] (1/4) Epoch 12, batch 23250, loss[loss=0.2044, simple_loss=0.2723, pruned_loss=0.06823, over 21474.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3038, pruned_loss=0.08085, over 4297388.45 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:28:42,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=2152146.0, ans=12.0 2023-06-26 05:29:38,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=2152266.0, ans=0.5 2023-06-26 05:30:12,615 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.710e+02 8.107e+02 1.074e+03 1.814e+03 3.794e+03, threshold=2.148e+03, percent-clipped=19.0 2023-06-26 05:30:26,648 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:30:35,137 INFO [train.py:996] (1/4) Epoch 12, batch 23300, loss[loss=0.3701, simple_loss=0.4544, pruned_loss=0.1429, over 21428.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3113, pruned_loss=0.08283, over 4299377.41 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:30:35,549 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2152446.0, ans=0.125 2023-06-26 05:30:57,264 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-26 05:31:48,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2152626.0, ans=0.125 2023-06-26 05:32:10,527 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2152686.0, ans=0.1 2023-06-26 05:32:12,294 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2152686.0, ans=0.2 2023-06-26 05:32:12,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2152686.0, ans=0.1 2023-06-26 05:32:31,082 INFO [train.py:996] (1/4) Epoch 12, batch 23350, loss[loss=0.282, simple_loss=0.3685, pruned_loss=0.09774, over 21482.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3152, pruned_loss=0.08171, over 4297741.28 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:33:39,520 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-26 05:33:52,920 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2152926.0, ans=0.125 2023-06-26 05:33:54,025 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.865e+02 9.116e+02 1.311e+03 1.870e+03 4.347e+03, threshold=2.623e+03, percent-clipped=16.0 2023-06-26 05:33:59,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2152986.0, ans=0.0 2023-06-26 05:34:21,137 INFO [train.py:996] (1/4) Epoch 12, batch 23400, loss[loss=0.2046, simple_loss=0.3111, pruned_loss=0.049, over 20749.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3102, pruned_loss=0.07877, over 4294289.89 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:34:35,077 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-26 05:35:31,074 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2153226.0, ans=0.0 2023-06-26 05:36:01,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2153286.0, ans=0.1 2023-06-26 05:36:11,445 INFO [train.py:996] (1/4) Epoch 12, batch 23450, loss[loss=0.273, simple_loss=0.3372, pruned_loss=0.1044, over 21485.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3105, pruned_loss=0.08066, over 4298435.77 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:36:36,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2153346.0, ans=0.125 2023-06-26 05:37:29,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2153526.0, ans=0.125 2023-06-26 05:37:31,633 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.599e+02 8.250e+02 1.125e+03 1.414e+03 2.942e+03, threshold=2.251e+03, percent-clipped=1.0 2023-06-26 05:38:03,804 INFO [train.py:996] (1/4) Epoch 12, batch 23500, loss[loss=0.2804, simple_loss=0.334, pruned_loss=0.1134, over 21606.00 frames. ], tot_loss[loss=0.237, simple_loss=0.31, pruned_loss=0.08198, over 4297206.35 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:38:20,317 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-26 05:38:21,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2153646.0, ans=0.1 2023-06-26 05:38:28,669 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-26 05:39:12,984 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2153826.0, ans=0.0 2023-06-26 05:39:22,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2153826.0, ans=0.125 2023-06-26 05:39:52,430 INFO [train.py:996] (1/4) Epoch 12, batch 23550, loss[loss=0.2261, simple_loss=0.3588, pruned_loss=0.04674, over 19784.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.305, pruned_loss=0.08183, over 4288784.59 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:41:09,212 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.684e+02 8.231e+02 1.204e+03 1.979e+03 6.408e+03, threshold=2.407e+03, percent-clipped=19.0 2023-06-26 05:41:48,759 INFO [train.py:996] (1/4) Epoch 12, batch 23600, loss[loss=0.2177, simple_loss=0.2925, pruned_loss=0.07145, over 21415.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3052, pruned_loss=0.08093, over 4285538.26 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 05:41:49,212 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2154246.0, ans=0.0 2023-06-26 05:42:09,749 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-26 05:42:30,513 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2154366.0, ans=0.125 2023-06-26 05:42:40,773 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2154366.0, ans=0.05 2023-06-26 05:43:26,066 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=2154486.0, ans=15.0 2023-06-26 05:43:40,729 INFO [train.py:996] (1/4) Epoch 12, batch 23650, loss[loss=0.1855, simple_loss=0.2545, pruned_loss=0.05825, over 16729.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3061, pruned_loss=0.07942, over 4281417.78 frames. ], batch size: 61, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:43:57,460 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2154546.0, ans=0.0 2023-06-26 05:44:18,675 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2154606.0, ans=0.125 2023-06-26 05:45:11,531 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.083e+02 1.007e+03 1.364e+03 1.897e+03 4.250e+03, threshold=2.728e+03, percent-clipped=14.0 2023-06-26 05:45:36,468 INFO [train.py:996] (1/4) Epoch 12, batch 23700, loss[loss=0.2634, simple_loss=0.3375, pruned_loss=0.09462, over 21394.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3076, pruned_loss=0.07851, over 4281542.57 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:45:44,175 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2154846.0, ans=10.0 2023-06-26 05:46:03,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2154906.0, ans=0.1 2023-06-26 05:46:22,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2154966.0, ans=0.125 2023-06-26 05:47:27,039 INFO [train.py:996] (1/4) Epoch 12, batch 23750, loss[loss=0.2075, simple_loss=0.3058, pruned_loss=0.05466, over 21705.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3101, pruned_loss=0.07817, over 4271427.38 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:47:29,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2155146.0, ans=0.2 2023-06-26 05:47:31,384 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2155146.0, ans=0.0 2023-06-26 05:47:51,214 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2155206.0, ans=0.1 2023-06-26 05:48:42,257 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-26 05:48:55,967 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 9.334e+02 1.253e+03 1.718e+03 3.362e+03, threshold=2.506e+03, percent-clipped=5.0 2023-06-26 05:49:11,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2155386.0, ans=0.1 2023-06-26 05:49:21,857 INFO [train.py:996] (1/4) Epoch 12, batch 23800, loss[loss=0.2524, simple_loss=0.3453, pruned_loss=0.07975, over 21712.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3075, pruned_loss=0.07629, over 4273415.12 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:49:36,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2155446.0, ans=0.125 2023-06-26 05:50:24,903 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-26 05:51:16,628 INFO [train.py:996] (1/4) Epoch 12, batch 23850, loss[loss=0.266, simple_loss=0.329, pruned_loss=0.1015, over 21448.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3177, pruned_loss=0.07951, over 4271819.43 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:51:52,544 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:52:06,533 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2155866.0, ans=0.125 2023-06-26 05:52:31,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2155866.0, ans=0.1 2023-06-26 05:52:47,262 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2155926.0, ans=0.0 2023-06-26 05:52:49,994 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.758e+02 1.136e+03 1.834e+03 2.559e+03 6.160e+03, threshold=3.668e+03, percent-clipped=28.0 2023-06-26 05:52:52,324 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2155986.0, ans=0.125 2023-06-26 05:53:13,713 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2156046.0, ans=0.0 2023-06-26 05:53:14,989 INFO [train.py:996] (1/4) Epoch 12, batch 23900, loss[loss=0.2322, simple_loss=0.3141, pruned_loss=0.07509, over 21585.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3246, pruned_loss=0.0815, over 4264836.48 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:53:38,169 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2156046.0, ans=0.1 2023-06-26 05:53:52,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-26 05:54:35,227 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-26 05:55:04,334 INFO [train.py:996] (1/4) Epoch 12, batch 23950, loss[loss=0.2305, simple_loss=0.3041, pruned_loss=0.0785, over 21721.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.318, pruned_loss=0.08146, over 4259990.21 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:56:32,153 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.839e+02 7.798e+02 1.081e+03 1.467e+03 3.120e+03, threshold=2.162e+03, percent-clipped=0.0 2023-06-26 05:56:43,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2156586.0, ans=0.1 2023-06-26 05:56:52,771 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:57:07,628 INFO [train.py:996] (1/4) Epoch 12, batch 24000, loss[loss=0.189, simple_loss=0.2512, pruned_loss=0.06347, over 20195.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3187, pruned_loss=0.08395, over 4260698.71 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:57:07,629 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 05:57:25,647 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2659, simple_loss=0.36, pruned_loss=0.08593, over 1796401.00 frames. 2023-06-26 05:57:25,648 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-26 05:57:31,992 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2156646.0, ans=0.035 2023-06-26 05:57:40,894 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2156646.0, ans=0.0 2023-06-26 05:58:04,591 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2156706.0, ans=0.0 2023-06-26 05:58:44,424 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2156826.0, ans=0.125 2023-06-26 05:59:08,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2156886.0, ans=0.125 2023-06-26 05:59:08,368 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2156886.0, ans=0.07 2023-06-26 05:59:08,394 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:59:15,061 INFO [train.py:996] (1/4) Epoch 12, batch 24050, loss[loss=0.2189, simple_loss=0.3075, pruned_loss=0.06515, over 21846.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3215, pruned_loss=0.08492, over 4264565.82 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:00:45,853 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.641e+02 9.752e+02 1.281e+03 1.752e+03 4.034e+03, threshold=2.563e+03, percent-clipped=15.0 2023-06-26 06:00:55,599 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2157186.0, ans=0.2 2023-06-26 06:01:05,430 INFO [train.py:996] (1/4) Epoch 12, batch 24100, loss[loss=0.2642, simple_loss=0.3534, pruned_loss=0.08748, over 21766.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.32, pruned_loss=0.08246, over 4265574.88 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:01:44,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2157306.0, ans=0.0 2023-06-26 06:02:06,824 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2157366.0, ans=0.2 2023-06-26 06:02:21,374 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2157426.0, ans=0.125 2023-06-26 06:02:44,497 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-06-26 06:02:57,861 INFO [train.py:996] (1/4) Epoch 12, batch 24150, loss[loss=0.2457, simple_loss=0.3043, pruned_loss=0.09359, over 21866.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3194, pruned_loss=0.08392, over 4270766.25 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:03:13,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2157546.0, ans=0.125 2023-06-26 06:03:20,138 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2157606.0, ans=0.125 2023-06-26 06:04:07,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2157726.0, ans=0.125 2023-06-26 06:04:23,720 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2157726.0, ans=0.125 2023-06-26 06:04:28,010 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-26 06:04:30,164 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.135e+02 1.050e+03 1.414e+03 1.892e+03 4.392e+03, threshold=2.829e+03, percent-clipped=9.0 2023-06-26 06:04:55,057 INFO [train.py:996] (1/4) Epoch 12, batch 24200, loss[loss=0.2539, simple_loss=0.3481, pruned_loss=0.0799, over 21603.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3205, pruned_loss=0.08437, over 4280010.75 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:05:13,583 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2157906.0, ans=0.1 2023-06-26 06:06:12,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-26 06:06:47,661 INFO [train.py:996] (1/4) Epoch 12, batch 24250, loss[loss=0.2442, simple_loss=0.3386, pruned_loss=0.07493, over 21490.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3172, pruned_loss=0.07894, over 4272439.42 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:06:59,150 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2158146.0, ans=0.95 2023-06-26 06:07:01,778 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.65 vs. limit=10.0 2023-06-26 06:07:11,511 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2158206.0, ans=0.125 2023-06-26 06:07:22,006 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2158206.0, ans=0.0 2023-06-26 06:07:43,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=2158266.0, ans=0.05 2023-06-26 06:08:18,626 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.845e+02 9.840e+02 1.673e+03 2.724e+03 4.672e+03, threshold=3.346e+03, percent-clipped=24.0 2023-06-26 06:08:29,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2158386.0, ans=0.125 2023-06-26 06:08:37,629 INFO [train.py:996] (1/4) Epoch 12, batch 24300, loss[loss=0.1666, simple_loss=0.254, pruned_loss=0.03961, over 21758.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3117, pruned_loss=0.0735, over 4274807.11 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:10:28,352 INFO [train.py:996] (1/4) Epoch 12, batch 24350, loss[loss=0.2224, simple_loss=0.2967, pruned_loss=0.07406, over 21793.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.308, pruned_loss=0.07291, over 4277486.08 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:10:34,227 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2158746.0, ans=0.0 2023-06-26 06:11:00,064 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2158806.0, ans=0.0 2023-06-26 06:11:11,473 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-06-26 06:11:17,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2158806.0, ans=0.0 2023-06-26 06:11:29,766 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2158866.0, ans=0.05 2023-06-26 06:12:01,480 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 8.199e+02 1.178e+03 1.878e+03 3.422e+03, threshold=2.355e+03, percent-clipped=1.0 2023-06-26 06:12:24,458 INFO [train.py:996] (1/4) Epoch 12, batch 24400, loss[loss=0.182, simple_loss=0.2482, pruned_loss=0.05789, over 21790.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.314, pruned_loss=0.07711, over 4276069.03 frames. ], batch size: 102, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:13:22,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2159166.0, ans=0.1 2023-06-26 06:14:17,276 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2159286.0, ans=0.125 2023-06-26 06:14:22,185 INFO [train.py:996] (1/4) Epoch 12, batch 24450, loss[loss=0.2131, simple_loss=0.3013, pruned_loss=0.06245, over 21688.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3141, pruned_loss=0.07802, over 4280108.76 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:14:59,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2159406.0, ans=10.0 2023-06-26 06:15:26,215 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2159526.0, ans=6.0 2023-06-26 06:15:42,754 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.744e+02 8.987e+02 1.393e+03 2.049e+03 5.528e+03, threshold=2.786e+03, percent-clipped=20.0 2023-06-26 06:16:10,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2159646.0, ans=0.125 2023-06-26 06:16:11,939 INFO [train.py:996] (1/4) Epoch 12, batch 24500, loss[loss=0.2524, simple_loss=0.3196, pruned_loss=0.09261, over 21860.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3139, pruned_loss=0.07773, over 4286257.50 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:16:23,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2159646.0, ans=0.125 2023-06-26 06:16:41,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2159706.0, ans=0.125 2023-06-26 06:17:05,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2159766.0, ans=0.025 2023-06-26 06:17:08,529 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2159766.0, ans=0.1 2023-06-26 06:17:17,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2159826.0, ans=0.125 2023-06-26 06:17:19,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2159826.0, ans=0.09899494936611666 2023-06-26 06:17:47,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2159886.0, ans=0.125 2023-06-26 06:17:48,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2159886.0, ans=0.125 2023-06-26 06:18:10,460 INFO [train.py:996] (1/4) Epoch 12, batch 24550, loss[loss=0.2587, simple_loss=0.3345, pruned_loss=0.09143, over 21330.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3173, pruned_loss=0.08092, over 4290452.30 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:18:20,953 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.54 vs. limit=15.0 2023-06-26 06:18:28,319 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2160006.0, ans=0.2 2023-06-26 06:18:47,338 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.90 vs. limit=6.0 2023-06-26 06:19:07,069 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:19:32,465 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2160126.0, ans=0.0 2023-06-26 06:19:40,396 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.015e+02 1.028e+03 1.361e+03 2.117e+03 3.876e+03, threshold=2.722e+03, percent-clipped=8.0 2023-06-26 06:19:43,015 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-26 06:20:03,168 INFO [train.py:996] (1/4) Epoch 12, batch 24600, loss[loss=0.1838, simple_loss=0.2518, pruned_loss=0.05796, over 21452.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.313, pruned_loss=0.08003, over 4277171.03 frames. ], batch size: 212, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:20:18,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2160246.0, ans=0.125 2023-06-26 06:20:31,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2160306.0, ans=0.0 2023-06-26 06:20:49,506 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2160366.0, ans=0.1 2023-06-26 06:20:51,443 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2160366.0, ans=0.0 2023-06-26 06:20:58,493 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2160366.0, ans=0.125 2023-06-26 06:20:58,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2160366.0, ans=0.1 2023-06-26 06:21:37,174 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2160486.0, ans=0.2 2023-06-26 06:21:48,045 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2160486.0, ans=0.2 2023-06-26 06:21:49,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2160486.0, ans=0.125 2023-06-26 06:21:53,015 INFO [train.py:996] (1/4) Epoch 12, batch 24650, loss[loss=0.2039, simple_loss=0.264, pruned_loss=0.07188, over 21281.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3045, pruned_loss=0.0788, over 4277106.57 frames. ], batch size: 551, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:22:25,687 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2160606.0, ans=0.125 2023-06-26 06:23:21,784 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.831e+02 9.882e+02 1.634e+03 2.648e+03 5.082e+03, threshold=3.268e+03, percent-clipped=24.0 2023-06-26 06:23:42,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2160786.0, ans=0.2 2023-06-26 06:23:45,430 INFO [train.py:996] (1/4) Epoch 12, batch 24700, loss[loss=0.1984, simple_loss=0.276, pruned_loss=0.06038, over 21627.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3023, pruned_loss=0.07789, over 4277620.40 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:23:49,100 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2160846.0, ans=0.0 2023-06-26 06:24:12,335 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-26 06:24:50,221 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2161026.0, ans=0.125 2023-06-26 06:25:32,110 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2161086.0, ans=0.125 2023-06-26 06:25:34,930 INFO [train.py:996] (1/4) Epoch 12, batch 24750, loss[loss=0.1834, simple_loss=0.2583, pruned_loss=0.05421, over 21772.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2956, pruned_loss=0.07562, over 4274308.58 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:26:22,917 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2161266.0, ans=0.125 2023-06-26 06:26:30,477 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-26 06:26:52,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2161326.0, ans=0.125 2023-06-26 06:26:55,424 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.337e+02 7.152e+02 1.055e+03 1.533e+03 3.030e+03, threshold=2.111e+03, percent-clipped=0.0 2023-06-26 06:27:17,447 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2161446.0, ans=0.125 2023-06-26 06:27:24,943 INFO [train.py:996] (1/4) Epoch 12, batch 24800, loss[loss=0.212, simple_loss=0.2753, pruned_loss=0.07435, over 21514.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2899, pruned_loss=0.07488, over 4281476.07 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 06:27:42,666 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2161506.0, ans=0.125 2023-06-26 06:29:09,404 INFO [train.py:996] (1/4) Epoch 12, batch 24850, loss[loss=0.2247, simple_loss=0.2853, pruned_loss=0.08201, over 21476.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2909, pruned_loss=0.07599, over 4283901.33 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:29:27,882 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2161746.0, ans=0.125 2023-06-26 06:30:05,225 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:30:26,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2161926.0, ans=0.2 2023-06-26 06:30:30,001 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2161926.0, ans=0.125 2023-06-26 06:30:45,751 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.468e+02 9.714e+02 1.514e+03 2.212e+03 4.137e+03, threshold=3.028e+03, percent-clipped=27.0 2023-06-26 06:31:02,864 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2161986.0, ans=0.04949747468305833 2023-06-26 06:31:04,554 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2161986.0, ans=0.1 2023-06-26 06:31:07,652 INFO [train.py:996] (1/4) Epoch 12, batch 24900, loss[loss=0.2218, simple_loss=0.3046, pruned_loss=0.06949, over 21761.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2934, pruned_loss=0.07668, over 4282725.10 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:31:43,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2162106.0, ans=0.0 2023-06-26 06:31:45,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2162106.0, ans=0.2 2023-06-26 06:31:45,609 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2162106.0, ans=0.125 2023-06-26 06:32:57,856 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2162286.0, ans=0.125 2023-06-26 06:33:08,351 INFO [train.py:996] (1/4) Epoch 12, batch 24950, loss[loss=0.2443, simple_loss=0.3223, pruned_loss=0.08313, over 21411.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3013, pruned_loss=0.08077, over 4281472.50 frames. ], batch size: 549, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:33:29,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2162406.0, ans=0.1 2023-06-26 06:33:31,827 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-26 06:34:30,346 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2162526.0, ans=0.0 2023-06-26 06:34:43,974 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.936e+02 9.070e+02 1.289e+03 1.993e+03 4.042e+03, threshold=2.579e+03, percent-clipped=8.0 2023-06-26 06:34:59,526 INFO [train.py:996] (1/4) Epoch 12, batch 25000, loss[loss=0.2405, simple_loss=0.3136, pruned_loss=0.08367, over 21537.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3071, pruned_loss=0.08148, over 4268675.52 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:35:33,483 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-26 06:36:23,884 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2162826.0, ans=0.125 2023-06-26 06:36:27,121 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=2162886.0, ans=0.2 2023-06-26 06:36:49,072 INFO [train.py:996] (1/4) Epoch 12, batch 25050, loss[loss=0.1934, simple_loss=0.2649, pruned_loss=0.06101, over 21836.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3013, pruned_loss=0.08037, over 4272205.96 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:38:17,928 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2163186.0, ans=0.2 2023-06-26 06:38:26,473 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.221e+02 8.095e+02 1.236e+03 1.556e+03 3.258e+03, threshold=2.471e+03, percent-clipped=4.0 2023-06-26 06:38:35,693 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2163186.0, ans=0.04949747468305833 2023-06-26 06:38:42,000 INFO [train.py:996] (1/4) Epoch 12, batch 25100, loss[loss=0.2206, simple_loss=0.2786, pruned_loss=0.08132, over 21737.00 frames. ], tot_loss[loss=0.228, simple_loss=0.2969, pruned_loss=0.07952, over 4282353.36 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:39:10,737 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2163306.0, ans=0.09899494936611666 2023-06-26 06:39:12,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2163306.0, ans=0.125 2023-06-26 06:39:14,380 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2163306.0, ans=0.2 2023-06-26 06:39:16,079 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2163306.0, ans=0.125 2023-06-26 06:39:22,444 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2163366.0, ans=0.0 2023-06-26 06:39:43,825 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2163426.0, ans=0.2 2023-06-26 06:39:52,094 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2163426.0, ans=0.125 2023-06-26 06:39:58,844 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2163426.0, ans=0.125 2023-06-26 06:40:03,356 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-26 06:40:23,733 INFO [train.py:996] (1/4) Epoch 12, batch 25150, loss[loss=0.2412, simple_loss=0.3234, pruned_loss=0.07952, over 21802.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2995, pruned_loss=0.07703, over 4270684.05 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 06:40:36,154 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2163546.0, ans=0.0 2023-06-26 06:41:15,640 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2163666.0, ans=0.125 2023-06-26 06:41:32,251 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.32 vs. limit=15.0 2023-06-26 06:41:35,796 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-26 06:41:48,275 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.684e+02 1.120e+03 1.527e+03 4.461e+03, threshold=2.241e+03, percent-clipped=8.0 2023-06-26 06:42:02,964 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-26 06:42:06,943 INFO [train.py:996] (1/4) Epoch 12, batch 25200, loss[loss=0.2172, simple_loss=0.3099, pruned_loss=0.06222, over 21666.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2994, pruned_loss=0.07569, over 4273311.91 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:43:56,885 INFO [train.py:996] (1/4) Epoch 12, batch 25250, loss[loss=0.194, simple_loss=0.2647, pruned_loss=0.06159, over 21369.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2979, pruned_loss=0.07322, over 4274379.97 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:44:33,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2164206.0, ans=10.0 2023-06-26 06:45:02,360 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2164326.0, ans=0.0 2023-06-26 06:45:09,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2164326.0, ans=0.0 2023-06-26 06:45:14,966 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2164386.0, ans=0.125 2023-06-26 06:45:25,570 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 7.134e+02 9.127e+02 1.634e+03 5.272e+03, threshold=1.825e+03, percent-clipped=13.0 2023-06-26 06:45:45,405 INFO [train.py:996] (1/4) Epoch 12, batch 25300, loss[loss=0.2449, simple_loss=0.3285, pruned_loss=0.08062, over 21452.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2953, pruned_loss=0.07311, over 4256206.60 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:45:49,273 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2164446.0, ans=0.125 2023-06-26 06:46:34,891 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2164566.0, ans=0.0 2023-06-26 06:47:14,569 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2164686.0, ans=0.125 2023-06-26 06:47:35,409 INFO [train.py:996] (1/4) Epoch 12, batch 25350, loss[loss=0.1938, simple_loss=0.2791, pruned_loss=0.05423, over 21706.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2995, pruned_loss=0.07334, over 4247654.21 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:47:41,229 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2164746.0, ans=0.125 2023-06-26 06:47:44,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2164746.0, ans=0.125 2023-06-26 06:48:17,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2164866.0, ans=0.2 2023-06-26 06:49:04,275 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.341e+02 9.859e+02 1.527e+03 2.457e+03 4.731e+03, threshold=3.054e+03, percent-clipped=38.0 2023-06-26 06:49:17,662 INFO [train.py:996] (1/4) Epoch 12, batch 25400, loss[loss=0.2006, simple_loss=0.262, pruned_loss=0.06965, over 21204.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2941, pruned_loss=0.07149, over 4243246.25 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:49:20,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2165046.0, ans=0.5 2023-06-26 06:49:44,061 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2165106.0, ans=0.0 2023-06-26 06:50:34,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2165286.0, ans=0.125 2023-06-26 06:51:05,459 INFO [train.py:996] (1/4) Epoch 12, batch 25450, loss[loss=0.213, simple_loss=0.3117, pruned_loss=0.05712, over 21597.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2933, pruned_loss=0.07236, over 4233861.07 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:51:13,028 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2165346.0, ans=0.0 2023-06-26 06:51:23,378 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2165346.0, ans=0.1 2023-06-26 06:51:32,013 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2165406.0, ans=0.125 2023-06-26 06:51:36,969 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2165406.0, ans=0.0 2023-06-26 06:51:50,632 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2165466.0, ans=0.125 2023-06-26 06:52:05,050 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2165526.0, ans=0.1 2023-06-26 06:52:13,994 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2165526.0, ans=0.1 2023-06-26 06:52:43,178 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.438e+02 1.024e+03 1.496e+03 3.770e+03, threshold=2.047e+03, percent-clipped=1.0 2023-06-26 06:52:55,778 INFO [train.py:996] (1/4) Epoch 12, batch 25500, loss[loss=0.1728, simple_loss=0.2866, pruned_loss=0.02954, over 20873.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2944, pruned_loss=0.06944, over 4240455.22 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 06:53:12,645 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-26 06:53:22,793 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2165706.0, ans=0.1 2023-06-26 06:53:38,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2165766.0, ans=0.125 2023-06-26 06:54:32,437 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.15 vs. limit=10.0 2023-06-26 06:54:33,504 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2165886.0, ans=0.125 2023-06-26 06:54:50,344 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2165886.0, ans=0.1 2023-06-26 06:54:53,048 INFO [train.py:996] (1/4) Epoch 12, batch 25550, loss[loss=0.2157, simple_loss=0.3147, pruned_loss=0.0583, over 21661.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3027, pruned_loss=0.07066, over 4245290.39 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 06:54:56,875 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2165946.0, ans=0.0 2023-06-26 06:55:05,086 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=15.0 2023-06-26 06:55:29,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-26 06:56:05,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=12.0 2023-06-26 06:56:31,795 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.571e+02 9.605e+02 1.487e+03 2.203e+03 4.525e+03, threshold=2.973e+03, percent-clipped=31.0 2023-06-26 06:56:43,673 INFO [train.py:996] (1/4) Epoch 12, batch 25600, loss[loss=0.273, simple_loss=0.3483, pruned_loss=0.09881, over 21415.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3063, pruned_loss=0.07178, over 4258393.74 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:56:52,253 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-26 06:57:31,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2166366.0, ans=0.125 2023-06-26 06:57:46,519 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2166426.0, ans=0.1 2023-06-26 06:57:49,648 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2166426.0, ans=0.0 2023-06-26 06:58:07,686 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2166426.0, ans=0.125 2023-06-26 06:58:18,018 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2166486.0, ans=0.2 2023-06-26 06:58:27,651 INFO [train.py:996] (1/4) Epoch 12, batch 25650, loss[loss=0.2258, simple_loss=0.2883, pruned_loss=0.08171, over 21871.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3066, pruned_loss=0.07436, over 4258876.95 frames. ], batch size: 373, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:58:31,560 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2166546.0, ans=0.0 2023-06-26 06:58:46,827 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2166606.0, ans=0.2 2023-06-26 06:59:51,095 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2166726.0, ans=0.0 2023-06-26 07:00:05,369 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.624e+02 9.508e+02 1.344e+03 1.911e+03 3.919e+03, threshold=2.688e+03, percent-clipped=6.0 2023-06-26 07:00:17,527 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-26 07:00:18,035 INFO [train.py:996] (1/4) Epoch 12, batch 25700, loss[loss=0.2178, simple_loss=0.309, pruned_loss=0.06334, over 21732.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3051, pruned_loss=0.07564, over 4251078.33 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:01:13,922 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-26 07:01:21,816 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2166966.0, ans=0.125 2023-06-26 07:01:25,862 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2167026.0, ans=0.1 2023-06-26 07:02:06,700 INFO [train.py:996] (1/4) Epoch 12, batch 25750, loss[loss=0.2133, simple_loss=0.2757, pruned_loss=0.07542, over 21168.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3074, pruned_loss=0.07776, over 4258869.99 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:02:40,726 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2167206.0, ans=0.125 2023-06-26 07:02:56,667 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2167206.0, ans=0.2 2023-06-26 07:03:20,248 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-26 07:03:44,354 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2167386.0, ans=0.125 2023-06-26 07:03:44,399 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2167386.0, ans=0.04949747468305833 2023-06-26 07:03:47,664 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.983e+02 9.621e+02 1.292e+03 1.996e+03 6.312e+03, threshold=2.583e+03, percent-clipped=12.0 2023-06-26 07:04:02,188 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2167386.0, ans=0.125 2023-06-26 07:04:02,349 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2167386.0, ans=0.125 2023-06-26 07:04:10,200 INFO [train.py:996] (1/4) Epoch 12, batch 25800, loss[loss=0.2464, simple_loss=0.321, pruned_loss=0.0859, over 21603.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3192, pruned_loss=0.0823, over 4258354.57 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:04:26,461 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2167446.0, ans=0.0 2023-06-26 07:04:29,688 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2167446.0, ans=0.125 2023-06-26 07:04:32,595 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-26 07:05:10,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2167566.0, ans=0.1 2023-06-26 07:05:39,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2167686.0, ans=0.125 2023-06-26 07:06:05,654 INFO [train.py:996] (1/4) Epoch 12, batch 25850, loss[loss=0.2502, simple_loss=0.3139, pruned_loss=0.09319, over 21716.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3218, pruned_loss=0.08276, over 4264038.94 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:06:27,233 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2167806.0, ans=0.125 2023-06-26 07:06:32,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2167806.0, ans=0.125 2023-06-26 07:06:39,037 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2167806.0, ans=0.2 2023-06-26 07:07:48,449 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.401e+02 9.971e+02 1.375e+03 1.928e+03 5.111e+03, threshold=2.750e+03, percent-clipped=7.0 2023-06-26 07:07:52,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2167986.0, ans=0.125 2023-06-26 07:08:04,019 INFO [train.py:996] (1/4) Epoch 12, batch 25900, loss[loss=0.3014, simple_loss=0.3874, pruned_loss=0.1077, over 21856.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3251, pruned_loss=0.08419, over 4270915.35 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:08:33,030 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2168106.0, ans=0.125 2023-06-26 07:08:34,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2168106.0, ans=0.025 2023-06-26 07:09:26,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2168226.0, ans=0.0 2023-06-26 07:09:43,924 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-26 07:09:45,016 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2168286.0, ans=0.2 2023-06-26 07:09:55,092 INFO [train.py:996] (1/4) Epoch 12, batch 25950, loss[loss=0.2297, simple_loss=0.3205, pruned_loss=0.06947, over 21785.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3299, pruned_loss=0.08608, over 4264804.99 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:10:08,047 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2168346.0, ans=0.1 2023-06-26 07:10:09,716 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2168346.0, ans=0.1 2023-06-26 07:10:45,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2168466.0, ans=0.1 2023-06-26 07:11:35,008 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.821e+02 8.955e+02 1.246e+03 1.921e+03 4.535e+03, threshold=2.491e+03, percent-clipped=11.0 2023-06-26 07:11:39,450 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2168586.0, ans=0.04949747468305833 2023-06-26 07:11:45,202 INFO [train.py:996] (1/4) Epoch 12, batch 26000, loss[loss=0.2074, simple_loss=0.2923, pruned_loss=0.06131, over 21212.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3309, pruned_loss=0.08514, over 4265551.23 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:12:00,928 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-26 07:13:34,703 INFO [train.py:996] (1/4) Epoch 12, batch 26050, loss[loss=0.2742, simple_loss=0.3414, pruned_loss=0.1035, over 21816.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3308, pruned_loss=0.08551, over 4260680.00 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:13:38,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2168946.0, ans=0.125 2023-06-26 07:13:47,093 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2168946.0, ans=0.125 2023-06-26 07:14:00,396 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2169006.0, ans=0.05 2023-06-26 07:14:26,136 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2169066.0, ans=0.0 2023-06-26 07:14:46,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2169126.0, ans=0.015 2023-06-26 07:14:50,735 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2169126.0, ans=0.125 2023-06-26 07:15:04,379 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-26 07:15:11,931 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.629e+02 1.039e+03 1.480e+03 2.039e+03 5.924e+03, threshold=2.960e+03, percent-clipped=13.0 2023-06-26 07:15:22,455 INFO [train.py:996] (1/4) Epoch 12, batch 26100, loss[loss=0.2766, simple_loss=0.3382, pruned_loss=0.1075, over 21847.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3249, pruned_loss=0.08532, over 4268624.98 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:16:48,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2169426.0, ans=0.1 2023-06-26 07:16:49,679 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.17 vs. limit=22.5 2023-06-26 07:16:53,063 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-26 07:17:13,275 INFO [train.py:996] (1/4) Epoch 12, batch 26150, loss[loss=0.2422, simple_loss=0.3022, pruned_loss=0.09114, over 20095.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3216, pruned_loss=0.0858, over 4280635.29 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:17:14,790 INFO [scaling.py:962] (1/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=8.0 2023-06-26 07:17:17,235 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2169546.0, ans=0.1 2023-06-26 07:17:32,722 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2169606.0, ans=0.125 2023-06-26 07:17:54,538 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2169606.0, ans=0.125 2023-06-26 07:18:29,859 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2169726.0, ans=0.0 2023-06-26 07:18:31,407 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2169726.0, ans=0.1 2023-06-26 07:18:52,968 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 8.636e+02 1.260e+03 1.715e+03 2.544e+03, threshold=2.520e+03, percent-clipped=0.0 2023-06-26 07:19:03,279 INFO [train.py:996] (1/4) Epoch 12, batch 26200, loss[loss=0.2242, simple_loss=0.2997, pruned_loss=0.07433, over 20028.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3215, pruned_loss=0.08381, over 4284281.58 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:19:20,962 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2169906.0, ans=0.0 2023-06-26 07:19:29,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2169906.0, ans=0.125 2023-06-26 07:20:45,442 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-26 07:20:46,466 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170086.0, ans=0.1 2023-06-26 07:20:52,040 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-26 07:20:52,535 INFO [train.py:996] (1/4) Epoch 12, batch 26250, loss[loss=0.2242, simple_loss=0.3054, pruned_loss=0.07152, over 21845.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3255, pruned_loss=0.0822, over 4285820.80 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:20:55,558 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2170146.0, ans=0.035 2023-06-26 07:20:58,698 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2170146.0, ans=0.0 2023-06-26 07:21:15,978 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.39 vs. limit=15.0 2023-06-26 07:22:05,286 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2170266.0, ans=0.125 2023-06-26 07:22:06,807 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2170326.0, ans=0.125 2023-06-26 07:22:31,724 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-26 07:22:31,923 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.498e+02 1.014e+03 1.360e+03 1.996e+03 4.754e+03, threshold=2.720e+03, percent-clipped=15.0 2023-06-26 07:22:42,511 INFO [train.py:996] (1/4) Epoch 12, batch 26300, loss[loss=0.242, simple_loss=0.3051, pruned_loss=0.08943, over 21847.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3222, pruned_loss=0.0826, over 4292777.95 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:22:42,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2170446.0, ans=0.0 2023-06-26 07:22:57,501 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2170446.0, ans=0.125 2023-06-26 07:22:58,156 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-26 07:23:10,485 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2170446.0, ans=0.0 2023-06-26 07:23:39,961 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2170566.0, ans=0.125 2023-06-26 07:23:50,243 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:24:05,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2170626.0, ans=0.125 2023-06-26 07:24:18,166 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170686.0, ans=0.1 2023-06-26 07:24:19,821 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2170686.0, ans=0.0 2023-06-26 07:24:26,795 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170686.0, ans=0.1 2023-06-26 07:24:40,553 INFO [train.py:996] (1/4) Epoch 12, batch 26350, loss[loss=0.2594, simple_loss=0.3328, pruned_loss=0.09301, over 21776.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3192, pruned_loss=0.08265, over 4289849.53 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:25:16,550 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170806.0, ans=0.1 2023-06-26 07:25:18,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170806.0, ans=0.1 2023-06-26 07:26:10,256 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.697e+02 8.952e+02 1.093e+03 1.459e+03 3.186e+03, threshold=2.186e+03, percent-clipped=0.0 2023-06-26 07:26:16,075 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2170986.0, ans=0.025 2023-06-26 07:26:25,961 INFO [train.py:996] (1/4) Epoch 12, batch 26400, loss[loss=0.2472, simple_loss=0.2921, pruned_loss=0.1011, over 21265.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3143, pruned_loss=0.0828, over 4288394.52 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:26:37,080 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2171046.0, ans=0.125 2023-06-26 07:26:45,306 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2171046.0, ans=0.125 2023-06-26 07:27:51,164 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2171226.0, ans=0.125 2023-06-26 07:28:28,528 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-26 07:28:30,907 INFO [train.py:996] (1/4) Epoch 12, batch 26450, loss[loss=0.2758, simple_loss=0.3679, pruned_loss=0.09192, over 21850.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3131, pruned_loss=0.08205, over 4279641.57 frames. ], batch size: 317, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:28:50,513 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-06-26 07:28:54,762 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2171406.0, ans=0.125 2023-06-26 07:29:09,220 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2171466.0, ans=0.1 2023-06-26 07:29:21,170 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2171466.0, ans=0.125 2023-06-26 07:29:54,323 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-26 07:30:11,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.179e+02 1.055e+03 2.037e+03 2.907e+03 6.247e+03, threshold=4.074e+03, percent-clipped=46.0 2023-06-26 07:30:20,930 INFO [train.py:996] (1/4) Epoch 12, batch 26500, loss[loss=0.2305, simple_loss=0.3188, pruned_loss=0.07115, over 21775.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3135, pruned_loss=0.0807, over 4274530.47 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:31:02,316 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2171766.0, ans=0.0 2023-06-26 07:31:43,324 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-26 07:31:51,426 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2171826.0, ans=0.125 2023-06-26 07:32:13,336 INFO [train.py:996] (1/4) Epoch 12, batch 26550, loss[loss=0.2081, simple_loss=0.3009, pruned_loss=0.05758, over 21723.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3118, pruned_loss=0.07793, over 4274493.51 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:32:30,997 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2172006.0, ans=0.2 2023-06-26 07:32:32,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2172006.0, ans=0.0 2023-06-26 07:33:12,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2172066.0, ans=0.125 2023-06-26 07:33:48,929 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.879e+02 9.613e+02 1.344e+03 2.252e+03 5.344e+03, threshold=2.687e+03, percent-clipped=4.0 2023-06-26 07:33:57,181 INFO [train.py:996] (1/4) Epoch 12, batch 26600, loss[loss=0.2387, simple_loss=0.3124, pruned_loss=0.08248, over 21485.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3097, pruned_loss=0.0753, over 4273920.74 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:34:13,236 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2172306.0, ans=0.125 2023-06-26 07:35:45,165 INFO [train.py:996] (1/4) Epoch 12, batch 26650, loss[loss=0.1705, simple_loss=0.261, pruned_loss=0.04, over 21803.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3028, pruned_loss=0.07331, over 4276524.84 frames. ], batch size: 352, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:36:33,385 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.02 vs. limit=15.0 2023-06-26 07:36:37,300 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2172666.0, ans=0.0 2023-06-26 07:36:45,692 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2172666.0, ans=0.125 2023-06-26 07:37:03,248 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2172726.0, ans=0.125 2023-06-26 07:37:08,963 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-26 07:37:20,215 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.879e+02 7.426e+02 9.644e+02 1.380e+03 2.955e+03, threshold=1.929e+03, percent-clipped=1.0 2023-06-26 07:37:27,396 INFO [train.py:996] (1/4) Epoch 12, batch 26700, loss[loss=0.2039, simple_loss=0.2844, pruned_loss=0.06176, over 21816.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2963, pruned_loss=0.07086, over 4271427.41 frames. ], batch size: 118, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:37:28,257 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2172846.0, ans=0.0 2023-06-26 07:37:33,797 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2172846.0, ans=0.2 2023-06-26 07:37:41,408 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2172846.0, ans=0.125 2023-06-26 07:38:00,696 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2172906.0, ans=0.0 2023-06-26 07:38:35,658 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2172966.0, ans=0.1 2023-06-26 07:39:01,672 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=22.5 2023-06-26 07:39:17,617 INFO [train.py:996] (1/4) Epoch 12, batch 26750, loss[loss=0.2737, simple_loss=0.348, pruned_loss=0.09972, over 21936.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2974, pruned_loss=0.0705, over 4277990.80 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:39:39,660 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2173146.0, ans=0.0 2023-06-26 07:39:40,177 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-26 07:39:49,242 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2173146.0, ans=0.2 2023-06-26 07:40:07,837 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2173266.0, ans=0.0 2023-06-26 07:40:26,261 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2173326.0, ans=0.0 2023-06-26 07:40:36,085 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2173326.0, ans=0.0 2023-06-26 07:41:00,875 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.608e+02 7.962e+02 9.923e+02 1.450e+03 3.695e+03, threshold=1.985e+03, percent-clipped=8.0 2023-06-26 07:41:18,476 INFO [train.py:996] (1/4) Epoch 12, batch 26800, loss[loss=0.2675, simple_loss=0.3464, pruned_loss=0.09431, over 21461.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.305, pruned_loss=0.07496, over 4280397.93 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:41:25,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2173446.0, ans=0.1 2023-06-26 07:42:01,084 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2173566.0, ans=0.125 2023-06-26 07:42:16,437 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2173626.0, ans=0.0 2023-06-26 07:42:18,611 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.98 vs. limit=10.0 2023-06-26 07:42:29,275 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-26 07:43:10,561 INFO [train.py:996] (1/4) Epoch 12, batch 26850, loss[loss=0.253, simple_loss=0.2968, pruned_loss=0.1046, over 21441.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3068, pruned_loss=0.07835, over 4283791.65 frames. ], batch size: 510, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:43:36,528 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2173806.0, ans=0.125 2023-06-26 07:44:00,087 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2173926.0, ans=0.125 2023-06-26 07:44:40,222 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 7.931e+02 1.142e+03 1.522e+03 3.683e+03, threshold=2.283e+03, percent-clipped=9.0 2023-06-26 07:44:43,986 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2173986.0, ans=0.125 2023-06-26 07:44:51,802 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2174046.0, ans=0.125 2023-06-26 07:44:52,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2174046.0, ans=0.1 2023-06-26 07:44:53,083 INFO [train.py:996] (1/4) Epoch 12, batch 26900, loss[loss=0.2484, simple_loss=0.3064, pruned_loss=0.09524, over 21891.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3001, pruned_loss=0.07799, over 4267678.25 frames. ], batch size: 125, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:45:09,909 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2174106.0, ans=0.125 2023-06-26 07:45:14,933 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2174106.0, ans=0.125 2023-06-26 07:45:34,600 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-26 07:46:04,745 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2174286.0, ans=0.1 2023-06-26 07:46:41,259 INFO [train.py:996] (1/4) Epoch 12, batch 26950, loss[loss=0.2469, simple_loss=0.343, pruned_loss=0.07543, over 21596.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2989, pruned_loss=0.07747, over 4272244.11 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:46:50,397 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2174346.0, ans=0.0 2023-06-26 07:46:54,440 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-26 07:46:55,435 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2174346.0, ans=0.2 2023-06-26 07:46:58,923 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2174406.0, ans=0.0 2023-06-26 07:46:58,957 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:47:13,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2174406.0, ans=0.125 2023-06-26 07:48:17,128 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.465e+02 7.709e+02 1.449e+03 2.208e+03 6.166e+03, threshold=2.897e+03, percent-clipped=23.0 2023-06-26 07:48:22,397 INFO [train.py:996] (1/4) Epoch 12, batch 27000, loss[loss=0.1921, simple_loss=0.2869, pruned_loss=0.04865, over 21702.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2998, pruned_loss=0.07564, over 4268240.83 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:48:22,398 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 07:48:40,345 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2401, simple_loss=0.3367, pruned_loss=0.07176, over 1796401.00 frames. 2023-06-26 07:48:40,346 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-26 07:48:41,129 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2174646.0, ans=0.125 2023-06-26 07:49:42,074 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=12.0 2023-06-26 07:50:12,851 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-26 07:50:19,404 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2174886.0, ans=0.0 2023-06-26 07:50:26,735 INFO [train.py:996] (1/4) Epoch 12, batch 27050, loss[loss=0.2124, simple_loss=0.3025, pruned_loss=0.06114, over 21592.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3016, pruned_loss=0.07229, over 4268280.52 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:50:35,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2174946.0, ans=0.0 2023-06-26 07:50:40,386 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2174946.0, ans=0.125 2023-06-26 07:50:44,333 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-26 07:50:50,370 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2175006.0, ans=0.0 2023-06-26 07:50:54,203 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-26 07:52:06,704 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.688e+02 1.022e+03 1.429e+03 2.477e+03 5.037e+03, threshold=2.858e+03, percent-clipped=17.0 2023-06-26 07:52:12,151 INFO [train.py:996] (1/4) Epoch 12, batch 27100, loss[loss=0.2107, simple_loss=0.3318, pruned_loss=0.04482, over 19752.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3031, pruned_loss=0.07382, over 4267199.03 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:53:20,313 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2175426.0, ans=0.125 2023-06-26 07:53:21,730 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2175426.0, ans=0.0 2023-06-26 07:53:36,231 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2175426.0, ans=0.0 2023-06-26 07:54:02,040 INFO [train.py:996] (1/4) Epoch 12, batch 27150, loss[loss=0.2699, simple_loss=0.3646, pruned_loss=0.08759, over 21758.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3144, pruned_loss=0.07662, over 4269297.53 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:54:22,509 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-26 07:54:30,311 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2175606.0, ans=0.125 2023-06-26 07:55:14,464 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2175726.0, ans=0.125 2023-06-26 07:55:35,409 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2175786.0, ans=0.125 2023-06-26 07:55:40,273 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.180e+02 9.529e+02 1.614e+03 2.368e+03 4.440e+03, threshold=3.227e+03, percent-clipped=12.0 2023-06-26 07:55:45,345 INFO [train.py:996] (1/4) Epoch 12, batch 27200, loss[loss=0.2532, simple_loss=0.3385, pruned_loss=0.08393, over 21739.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3231, pruned_loss=0.07908, over 4269696.19 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:55:57,830 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2175846.0, ans=0.125 2023-06-26 07:56:10,796 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2175906.0, ans=0.95 2023-06-26 07:56:38,076 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2175906.0, ans=0.05 2023-06-26 07:57:33,900 INFO [train.py:996] (1/4) Epoch 12, batch 27250, loss[loss=0.2958, simple_loss=0.3595, pruned_loss=0.1161, over 21595.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.325, pruned_loss=0.08282, over 4272648.95 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:58:31,734 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-26 07:58:40,197 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2176266.0, ans=0.0 2023-06-26 07:59:24,077 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.268e+02 8.548e+02 1.157e+03 1.722e+03 4.031e+03, threshold=2.315e+03, percent-clipped=3.0 2023-06-26 07:59:32,608 INFO [train.py:996] (1/4) Epoch 12, batch 27300, loss[loss=0.24, simple_loss=0.3332, pruned_loss=0.0734, over 21927.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3275, pruned_loss=0.08426, over 4271710.99 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:59:49,073 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-26 08:00:22,647 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-26 08:01:08,628 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2176686.0, ans=0.0 2023-06-26 08:01:21,500 INFO [train.py:996] (1/4) Epoch 12, batch 27350, loss[loss=0.2171, simple_loss=0.3055, pruned_loss=0.06431, over 21691.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3295, pruned_loss=0.08462, over 4264014.12 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:02:07,528 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:02:08,079 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-26 08:02:43,330 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2176986.0, ans=0.125 2023-06-26 08:02:58,676 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2176986.0, ans=0.035 2023-06-26 08:02:59,809 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.844e+02 9.303e+02 1.156e+03 1.696e+03 4.537e+03, threshold=2.312e+03, percent-clipped=11.0 2023-06-26 08:03:08,453 INFO [train.py:996] (1/4) Epoch 12, batch 27400, loss[loss=0.1954, simple_loss=0.2683, pruned_loss=0.06127, over 21671.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3246, pruned_loss=0.0839, over 4273194.89 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:03:16,003 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2177046.0, ans=0.125 2023-06-26 08:03:18,082 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2177046.0, ans=0.125 2023-06-26 08:03:40,176 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2177106.0, ans=0.2 2023-06-26 08:04:07,411 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-26 08:04:12,142 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2177226.0, ans=0.125 2023-06-26 08:04:56,354 INFO [train.py:996] (1/4) Epoch 12, batch 27450, loss[loss=0.218, simple_loss=0.3079, pruned_loss=0.06402, over 21538.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3178, pruned_loss=0.08159, over 4270012.50 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:04:58,915 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2177346.0, ans=0.125 2023-06-26 08:05:06,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2177346.0, ans=0.125 2023-06-26 08:05:20,440 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2177406.0, ans=0.125 2023-06-26 08:05:35,902 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2177466.0, ans=0.1 2023-06-26 08:05:41,429 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-26 08:06:14,192 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2177586.0, ans=0.125 2023-06-26 08:06:35,010 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.529e+02 8.908e+02 1.196e+03 1.781e+03 4.590e+03, threshold=2.391e+03, percent-clipped=13.0 2023-06-26 08:06:38,338 INFO [train.py:996] (1/4) Epoch 12, batch 27500, loss[loss=0.2514, simple_loss=0.3179, pruned_loss=0.09243, over 21557.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3155, pruned_loss=0.08145, over 4274887.73 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:06:57,325 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2177706.0, ans=0.0 2023-06-26 08:07:06,210 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2177706.0, ans=0.125 2023-06-26 08:07:06,341 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2177706.0, ans=0.95 2023-06-26 08:07:41,307 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2177826.0, ans=0.0 2023-06-26 08:08:28,011 INFO [train.py:996] (1/4) Epoch 12, batch 27550, loss[loss=0.1966, simple_loss=0.2756, pruned_loss=0.05882, over 21678.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3103, pruned_loss=0.07899, over 4283331.27 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:09:19,605 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2178066.0, ans=0.2 2023-06-26 08:09:23,439 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-26 08:09:46,878 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2178126.0, ans=0.1 2023-06-26 08:10:06,102 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2178186.0, ans=0.0 2023-06-26 08:10:12,157 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.726e+02 8.167e+02 1.305e+03 2.054e+03 4.575e+03, threshold=2.609e+03, percent-clipped=17.0 2023-06-26 08:10:15,788 INFO [train.py:996] (1/4) Epoch 12, batch 27600, loss[loss=0.191, simple_loss=0.2536, pruned_loss=0.06419, over 21619.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3031, pruned_loss=0.07734, over 4268132.13 frames. ], batch size: 231, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:10:52,115 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2178366.0, ans=0.1 2023-06-26 08:11:27,589 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2178426.0, ans=0.125 2023-06-26 08:11:36,179 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2178426.0, ans=0.0 2023-06-26 08:11:55,063 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2178486.0, ans=0.2 2023-06-26 08:12:00,908 INFO [train.py:996] (1/4) Epoch 12, batch 27650, loss[loss=0.2099, simple_loss=0.2927, pruned_loss=0.06357, over 21862.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2993, pruned_loss=0.07722, over 4270104.55 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:12:13,305 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2178546.0, ans=0.0 2023-06-26 08:12:53,287 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2178666.0, ans=0.125 2023-06-26 08:12:54,577 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2178666.0, ans=0.0 2023-06-26 08:12:54,702 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2178666.0, ans=0.2 2023-06-26 08:13:17,620 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2178726.0, ans=0.125 2023-06-26 08:13:20,119 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-26 08:13:45,508 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2178786.0, ans=0.2 2023-06-26 08:13:46,682 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.067e+02 8.445e+02 1.173e+03 2.476e+03 5.396e+03, threshold=2.346e+03, percent-clipped=23.0 2023-06-26 08:13:49,053 INFO [train.py:996] (1/4) Epoch 12, batch 27700, loss[loss=0.3317, simple_loss=0.3995, pruned_loss=0.132, over 21525.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3003, pruned_loss=0.07666, over 4278494.81 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:13:57,946 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2178846.0, ans=0.07 2023-06-26 08:13:59,484 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2178846.0, ans=0.2 2023-06-26 08:14:02,151 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=22.5 2023-06-26 08:15:25,881 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2179086.0, ans=0.0 2023-06-26 08:15:33,985 INFO [train.py:996] (1/4) Epoch 12, batch 27750, loss[loss=0.1945, simple_loss=0.2838, pruned_loss=0.05261, over 21650.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3031, pruned_loss=0.07559, over 4280857.83 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:15:59,951 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2179206.0, ans=0.2 2023-06-26 08:16:44,032 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2179326.0, ans=0.0 2023-06-26 08:17:17,684 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.559e+02 9.501e+02 1.486e+03 2.223e+03 3.680e+03, threshold=2.972e+03, percent-clipped=21.0 2023-06-26 08:17:19,392 INFO [train.py:996] (1/4) Epoch 12, batch 27800, loss[loss=0.1886, simple_loss=0.2419, pruned_loss=0.06767, over 20425.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3014, pruned_loss=0.07551, over 4285148.11 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:17:57,339 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-26 08:18:18,890 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2179626.0, ans=0.2 2023-06-26 08:18:26,180 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2179626.0, ans=0.2 2023-06-26 08:18:38,604 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2179626.0, ans=0.125 2023-06-26 08:18:54,752 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2179686.0, ans=0.125 2023-06-26 08:18:59,216 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-26 08:19:03,780 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2179686.0, ans=0.1 2023-06-26 08:19:10,329 INFO [train.py:996] (1/4) Epoch 12, batch 27850, loss[loss=0.2254, simple_loss=0.2887, pruned_loss=0.08107, over 21586.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3008, pruned_loss=0.07676, over 4285449.45 frames. ], batch size: 212, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:20:06,944 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-26 08:20:18,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2179926.0, ans=0.125 2023-06-26 08:20:38,801 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:20:53,761 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.551e+02 9.151e+02 1.351e+03 2.039e+03 4.854e+03, threshold=2.703e+03, percent-clipped=13.0 2023-06-26 08:20:55,762 INFO [train.py:996] (1/4) Epoch 12, batch 27900, loss[loss=0.2241, simple_loss=0.306, pruned_loss=0.07113, over 21730.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3098, pruned_loss=0.07786, over 4282320.88 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:21:04,898 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2180046.0, ans=0.125 2023-06-26 08:21:15,708 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.92 vs. limit=15.0 2023-06-26 08:21:20,623 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2180106.0, ans=0.035 2023-06-26 08:21:20,637 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2180106.0, ans=0.125 2023-06-26 08:21:51,389 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2180166.0, ans=0.125 2023-06-26 08:22:19,297 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2180226.0, ans=0.125 2023-06-26 08:22:43,345 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-26 08:22:47,202 INFO [train.py:996] (1/4) Epoch 12, batch 27950, loss[loss=0.2291, simple_loss=0.3182, pruned_loss=0.07002, over 21703.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3105, pruned_loss=0.07526, over 4281156.20 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:22:57,271 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-26 08:23:33,934 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2180406.0, ans=0.2 2023-06-26 08:23:37,417 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-26 08:24:33,136 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.115e+02 7.857e+02 1.156e+03 1.726e+03 4.008e+03, threshold=2.312e+03, percent-clipped=7.0 2023-06-26 08:24:34,666 INFO [train.py:996] (1/4) Epoch 12, batch 28000, loss[loss=0.2297, simple_loss=0.2982, pruned_loss=0.08064, over 21358.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3075, pruned_loss=0.07256, over 4287089.67 frames. ], batch size: 143, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:25:10,040 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2180706.0, ans=0.125 2023-06-26 08:25:59,775 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.73 vs. limit=15.0 2023-06-26 08:26:02,338 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2180826.0, ans=0.125 2023-06-26 08:26:05,663 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2180886.0, ans=0.05 2023-06-26 08:26:21,223 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2180946.0, ans=0.0 2023-06-26 08:26:22,273 INFO [train.py:996] (1/4) Epoch 12, batch 28050, loss[loss=0.2061, simple_loss=0.267, pruned_loss=0.07263, over 21822.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3045, pruned_loss=0.07409, over 4294221.41 frames. ], batch size: 118, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:26:26,615 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2180946.0, ans=0.2 2023-06-26 08:27:38,051 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2181126.0, ans=0.125 2023-06-26 08:27:40,564 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=15.0 2023-06-26 08:28:11,568 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.128e+02 1.251e+03 1.568e+03 2.445e+03 4.428e+03, threshold=3.136e+03, percent-clipped=24.0 2023-06-26 08:28:13,344 INFO [train.py:996] (1/4) Epoch 12, batch 28100, loss[loss=0.2151, simple_loss=0.2649, pruned_loss=0.0826, over 19982.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3044, pruned_loss=0.07452, over 4288534.64 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:28:21,069 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2181246.0, ans=0.125 2023-06-26 08:28:31,419 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2181246.0, ans=0.2 2023-06-26 08:28:48,840 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2181306.0, ans=0.125 2023-06-26 08:29:05,181 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2181306.0, ans=0.125 2023-06-26 08:30:00,949 INFO [train.py:996] (1/4) Epoch 12, batch 28150, loss[loss=0.2452, simple_loss=0.3, pruned_loss=0.09524, over 21834.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3001, pruned_loss=0.07456, over 4287154.50 frames. ], batch size: 107, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:30:02,070 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-26 08:30:10,269 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2181546.0, ans=0.0 2023-06-26 08:30:27,078 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2181546.0, ans=0.0 2023-06-26 08:30:38,787 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-26 08:31:09,363 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2181666.0, ans=0.125 2023-06-26 08:31:50,394 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.207e+02 9.538e+02 1.303e+03 2.099e+03 3.954e+03, threshold=2.605e+03, percent-clipped=5.0 2023-06-26 08:31:52,088 INFO [train.py:996] (1/4) Epoch 12, batch 28200, loss[loss=0.2201, simple_loss=0.2959, pruned_loss=0.07213, over 22012.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2979, pruned_loss=0.07601, over 4292545.93 frames. ], batch size: 103, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:32:39,890 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:33:05,107 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2182026.0, ans=0.125 2023-06-26 08:33:07,347 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2182026.0, ans=0.2 2023-06-26 08:33:10,700 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2182026.0, ans=0.0 2023-06-26 08:33:53,926 INFO [train.py:996] (1/4) Epoch 12, batch 28250, loss[loss=0.215, simple_loss=0.2774, pruned_loss=0.07631, over 21227.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3019, pruned_loss=0.07896, over 4290700.14 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:34:34,280 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2182266.0, ans=0.1 2023-06-26 08:34:50,470 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2182326.0, ans=0.125 2023-06-26 08:34:57,760 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2182326.0, ans=0.1 2023-06-26 08:35:03,735 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-26 08:35:14,803 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-26 08:35:42,656 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.099e+02 9.506e+02 1.674e+03 2.975e+03 5.994e+03, threshold=3.348e+03, percent-clipped=30.0 2023-06-26 08:35:44,312 INFO [train.py:996] (1/4) Epoch 12, batch 28300, loss[loss=0.1874, simple_loss=0.2707, pruned_loss=0.05206, over 21452.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2998, pruned_loss=0.0768, over 4285382.81 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:36:08,124 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2182506.0, ans=0.125 2023-06-26 08:36:31,258 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-26 08:36:44,800 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2182626.0, ans=0.2 2023-06-26 08:37:41,292 INFO [train.py:996] (1/4) Epoch 12, batch 28350, loss[loss=0.2229, simple_loss=0.2908, pruned_loss=0.07752, over 21563.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2959, pruned_loss=0.07097, over 4282160.51 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:39:06,036 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2182986.0, ans=0.125 2023-06-26 08:39:27,863 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.238e+02 9.118e+02 1.283e+03 2.054e+03 3.751e+03, threshold=2.567e+03, percent-clipped=3.0 2023-06-26 08:39:29,425 INFO [train.py:996] (1/4) Epoch 12, batch 28400, loss[loss=0.2102, simple_loss=0.2854, pruned_loss=0.0675, over 21267.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2921, pruned_loss=0.07052, over 4274813.82 frames. ], batch size: 549, lr: 2.37e-03, grad_scale: 32.0 2023-06-26 08:41:22,198 INFO [train.py:996] (1/4) Epoch 12, batch 28450, loss[loss=0.2453, simple_loss=0.3269, pruned_loss=0.08187, over 21782.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2965, pruned_loss=0.07385, over 4267382.06 frames. ], batch size: 112, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:41:26,698 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-26 08:41:29,515 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2183346.0, ans=0.125 2023-06-26 08:42:57,184 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2183586.0, ans=0.125 2023-06-26 08:42:58,805 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2183586.0, ans=0.0 2023-06-26 08:43:12,224 INFO [train.py:996] (1/4) Epoch 12, batch 28500, loss[loss=0.2343, simple_loss=0.3046, pruned_loss=0.08194, over 21597.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3, pruned_loss=0.07673, over 4280876.34 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:43:13,813 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.783e+02 7.825e+02 1.127e+03 1.628e+03 3.596e+03, threshold=2.254e+03, percent-clipped=4.0 2023-06-26 08:43:14,519 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:43:38,530 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2183706.0, ans=0.125 2023-06-26 08:43:49,163 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2183706.0, ans=0.125 2023-06-26 08:43:55,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2183766.0, ans=0.1 2023-06-26 08:43:57,967 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2183766.0, ans=0.1 2023-06-26 08:44:21,019 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2183826.0, ans=0.125 2023-06-26 08:44:49,236 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-26 08:44:55,202 INFO [train.py:996] (1/4) Epoch 12, batch 28550, loss[loss=0.2584, simple_loss=0.3548, pruned_loss=0.08107, over 21837.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3085, pruned_loss=0.07953, over 4285411.78 frames. ], batch size: 316, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:44:57,490 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2183946.0, ans=0.04949747468305833 2023-06-26 08:45:31,339 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2184006.0, ans=0.125 2023-06-26 08:45:35,335 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=2184006.0, ans=15.0 2023-06-26 08:45:39,597 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2184006.0, ans=0.0 2023-06-26 08:46:09,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2184126.0, ans=0.1 2023-06-26 08:46:28,911 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2184186.0, ans=0.5 2023-06-26 08:46:45,253 INFO [train.py:996] (1/4) Epoch 12, batch 28600, loss[loss=0.2287, simple_loss=0.3126, pruned_loss=0.07244, over 21405.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3153, pruned_loss=0.08193, over 4286739.93 frames. ], batch size: 131, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:46:47,105 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.084e+02 9.095e+02 1.267e+03 2.017e+03 3.986e+03, threshold=2.534e+03, percent-clipped=14.0 2023-06-26 08:47:14,412 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:47:58,596 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.22 vs. limit=22.5 2023-06-26 08:48:11,334 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2184426.0, ans=0.04949747468305833 2023-06-26 08:48:30,452 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-26 08:48:39,278 INFO [train.py:996] (1/4) Epoch 12, batch 28650, loss[loss=0.1953, simple_loss=0.2634, pruned_loss=0.06366, over 21655.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3099, pruned_loss=0.08114, over 4278198.48 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:49:14,260 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-26 08:49:15,348 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2184606.0, ans=0.5 2023-06-26 08:50:36,123 INFO [train.py:996] (1/4) Epoch 12, batch 28700, loss[loss=0.2028, simple_loss=0.2408, pruned_loss=0.08238, over 20074.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3072, pruned_loss=0.08145, over 4278441.82 frames. ], batch size: 704, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:50:37,768 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.799e+02 8.894e+02 1.316e+03 2.106e+03 4.413e+03, threshold=2.633e+03, percent-clipped=13.0 2023-06-26 08:50:51,800 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.62 vs. limit=12.0 2023-06-26 08:51:04,480 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2184906.0, ans=0.0 2023-06-26 08:51:07,543 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2184906.0, ans=0.125 2023-06-26 08:52:24,677 INFO [train.py:996] (1/4) Epoch 12, batch 28750, loss[loss=0.2678, simple_loss=0.3425, pruned_loss=0.09654, over 22045.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.307, pruned_loss=0.08182, over 4280291.95 frames. ], batch size: 119, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:52:32,791 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=12.0 2023-06-26 08:53:17,499 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2185266.0, ans=0.125 2023-06-26 08:54:13,713 INFO [train.py:996] (1/4) Epoch 12, batch 28800, loss[loss=0.2784, simple_loss=0.3512, pruned_loss=0.1028, over 21234.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3111, pruned_loss=0.08204, over 4282563.22 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:54:15,167 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.541e+02 8.442e+02 1.126e+03 1.600e+03 2.728e+03, threshold=2.251e+03, percent-clipped=1.0 2023-06-26 08:54:17,851 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2185446.0, ans=0.0 2023-06-26 08:54:28,025 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:54:50,140 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2185506.0, ans=0.125 2023-06-26 08:54:52,255 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-26 08:55:51,463 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-26 08:56:00,437 INFO [train.py:996] (1/4) Epoch 12, batch 28850, loss[loss=0.2379, simple_loss=0.302, pruned_loss=0.08686, over 21379.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.312, pruned_loss=0.08358, over 4283008.69 frames. ], batch size: 159, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:56:38,280 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=12.0 2023-06-26 08:56:50,687 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-26 08:57:26,507 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2185926.0, ans=0.07 2023-06-26 08:57:55,861 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2186046.0, ans=0.125 2023-06-26 08:57:57,028 INFO [train.py:996] (1/4) Epoch 12, batch 28900, loss[loss=0.2539, simple_loss=0.3171, pruned_loss=0.09534, over 21489.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3172, pruned_loss=0.08584, over 4280256.95 frames. ], batch size: 211, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:58:06,911 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.674e+02 9.203e+02 1.221e+03 1.627e+03 4.762e+03, threshold=2.441e+03, percent-clipped=10.0 2023-06-26 08:58:11,591 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-26 08:58:16,513 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:58:17,267 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.87 vs. limit=15.0 2023-06-26 08:59:41,982 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2186286.0, ans=0.1 2023-06-26 08:59:46,096 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=22.5 2023-06-26 08:59:49,855 INFO [train.py:996] (1/4) Epoch 12, batch 28950, loss[loss=0.2809, simple_loss=0.3775, pruned_loss=0.09218, over 21487.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3164, pruned_loss=0.08509, over 4284666.86 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:59:56,101 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-26 09:00:06,009 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2186406.0, ans=0.125 2023-06-26 09:00:40,859 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-26 09:00:52,304 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:00:57,414 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2186526.0, ans=0.125 2023-06-26 09:01:20,185 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-26 09:01:39,458 INFO [train.py:996] (1/4) Epoch 12, batch 29000, loss[loss=0.3176, simple_loss=0.3779, pruned_loss=0.1286, over 21349.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3202, pruned_loss=0.08346, over 4284571.82 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:01:42,724 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.937e+02 9.875e+02 1.439e+03 2.075e+03 4.824e+03, threshold=2.879e+03, percent-clipped=20.0 2023-06-26 09:01:45,203 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2186646.0, ans=0.2 2023-06-26 09:03:05,033 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2186826.0, ans=0.2 2023-06-26 09:03:20,007 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2186886.0, ans=0.125 2023-06-26 09:03:27,714 INFO [train.py:996] (1/4) Epoch 12, batch 29050, loss[loss=0.2598, simple_loss=0.3208, pruned_loss=0.09936, over 21604.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3186, pruned_loss=0.08413, over 4284483.11 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:03:33,359 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2186946.0, ans=0.0 2023-06-26 09:04:08,043 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-26 09:05:17,242 INFO [train.py:996] (1/4) Epoch 12, batch 29100, loss[loss=0.2133, simple_loss=0.2776, pruned_loss=0.07451, over 21532.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3098, pruned_loss=0.08135, over 4286723.47 frames. ], batch size: 391, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:05:17,630 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2187246.0, ans=0.1 2023-06-26 09:05:20,344 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.010e+02 8.838e+02 1.250e+03 1.736e+03 3.044e+03, threshold=2.501e+03, percent-clipped=1.0 2023-06-26 09:05:39,114 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2187306.0, ans=0.125 2023-06-26 09:05:42,417 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2187306.0, ans=0.2 2023-06-26 09:06:29,571 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2187426.0, ans=0.0 2023-06-26 09:07:04,381 INFO [train.py:996] (1/4) Epoch 12, batch 29150, loss[loss=0.2672, simple_loss=0.3381, pruned_loss=0.09813, over 21345.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3104, pruned_loss=0.08065, over 4277688.43 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:07:09,779 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2187546.0, ans=0.1 2023-06-26 09:07:25,024 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2187606.0, ans=0.0 2023-06-26 09:07:26,567 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2187606.0, ans=0.0 2023-06-26 09:07:36,158 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:07:45,523 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2187606.0, ans=0.0 2023-06-26 09:07:47,406 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2187666.0, ans=0.125 2023-06-26 09:08:31,176 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-26 09:08:31,243 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-26 09:08:33,956 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2187786.0, ans=0.125 2023-06-26 09:08:51,329 INFO [train.py:996] (1/4) Epoch 12, batch 29200, loss[loss=0.2362, simple_loss=0.2892, pruned_loss=0.09158, over 21571.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3057, pruned_loss=0.08, over 4270484.78 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:08:54,717 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.344e+02 9.543e+02 1.197e+03 1.931e+03 4.156e+03, threshold=2.395e+03, percent-clipped=13.0 2023-06-26 09:10:00,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2187966.0, ans=0.0 2023-06-26 09:10:09,516 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2188026.0, ans=0.125 2023-06-26 09:10:14,860 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2188026.0, ans=0.0 2023-06-26 09:10:39,258 INFO [train.py:996] (1/4) Epoch 12, batch 29250, loss[loss=0.2206, simple_loss=0.299, pruned_loss=0.07108, over 21346.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3046, pruned_loss=0.07786, over 4273447.97 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:10:40,127 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2188146.0, ans=0.04949747468305833 2023-06-26 09:10:40,141 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2188146.0, ans=0.0 2023-06-26 09:10:43,622 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2188146.0, ans=0.2 2023-06-26 09:11:04,691 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2188206.0, ans=0.2 2023-06-26 09:11:34,668 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2188266.0, ans=0.125 2023-06-26 09:11:57,971 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2188326.0, ans=0.1 2023-06-26 09:12:24,403 INFO [train.py:996] (1/4) Epoch 12, batch 29300, loss[loss=0.2014, simple_loss=0.2493, pruned_loss=0.07676, over 20053.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3048, pruned_loss=0.07682, over 4267332.81 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:12:27,564 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.498e+02 8.933e+02 1.350e+03 1.847e+03 4.429e+03, threshold=2.700e+03, percent-clipped=13.0 2023-06-26 09:13:57,309 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2188686.0, ans=0.125 2023-06-26 09:14:04,023 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:14:12,351 INFO [train.py:996] (1/4) Epoch 12, batch 29350, loss[loss=0.2302, simple_loss=0.3, pruned_loss=0.08025, over 21441.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3013, pruned_loss=0.07661, over 4265333.68 frames. ], batch size: 195, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:14:40,022 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2188806.0, ans=0.04949747468305833 2023-06-26 09:15:07,608 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2188806.0, ans=0.0 2023-06-26 09:15:11,778 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-26 09:15:32,091 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2188926.0, ans=0.2 2023-06-26 09:16:00,631 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2188986.0, ans=0.5 2023-06-26 09:16:14,809 INFO [train.py:996] (1/4) Epoch 12, batch 29400, loss[loss=0.1914, simple_loss=0.2911, pruned_loss=0.04584, over 21183.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3011, pruned_loss=0.0735, over 4272260.94 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:16:18,033 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.568e+02 9.109e+02 1.324e+03 1.907e+03 3.871e+03, threshold=2.647e+03, percent-clipped=5.0 2023-06-26 09:17:10,078 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-26 09:17:16,315 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2189166.0, ans=0.125 2023-06-26 09:17:36,910 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2189286.0, ans=0.1 2023-06-26 09:18:07,002 INFO [train.py:996] (1/4) Epoch 12, batch 29450, loss[loss=0.2443, simple_loss=0.3237, pruned_loss=0.08244, over 21724.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2995, pruned_loss=0.07341, over 4272473.11 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:18:18,659 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2189346.0, ans=0.0 2023-06-26 09:18:19,229 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-26 09:18:39,954 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=15.0 2023-06-26 09:18:53,340 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2189466.0, ans=0.2 2023-06-26 09:19:13,098 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2189526.0, ans=0.0 2023-06-26 09:19:20,285 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2189526.0, ans=0.125 2023-06-26 09:19:48,593 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2189586.0, ans=0.0 2023-06-26 09:19:53,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2189646.0, ans=0.1 2023-06-26 09:19:55,245 INFO [train.py:996] (1/4) Epoch 12, batch 29500, loss[loss=0.2279, simple_loss=0.3004, pruned_loss=0.0777, over 21865.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3013, pruned_loss=0.07583, over 4273311.13 frames. ], batch size: 351, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:20:06,886 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 8.502e+02 1.287e+03 1.918e+03 4.991e+03, threshold=2.573e+03, percent-clipped=9.0 2023-06-26 09:20:26,742 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-26 09:20:58,848 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2189826.0, ans=0.125 2023-06-26 09:21:03,919 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-26 09:21:04,964 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2189826.0, ans=0.125 2023-06-26 09:21:11,813 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2189826.0, ans=0.0 2023-06-26 09:21:45,276 INFO [train.py:996] (1/4) Epoch 12, batch 29550, loss[loss=0.2402, simple_loss=0.3071, pruned_loss=0.08668, over 21743.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3022, pruned_loss=0.07819, over 4284574.58 frames. ], batch size: 473, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:22:24,266 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2190006.0, ans=0.125 2023-06-26 09:23:18,387 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2190126.0, ans=0.0 2023-06-26 09:23:42,798 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2190246.0, ans=0.0 2023-06-26 09:23:49,843 INFO [train.py:996] (1/4) Epoch 12, batch 29600, loss[loss=0.269, simple_loss=0.3502, pruned_loss=0.09384, over 21708.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3075, pruned_loss=0.07991, over 4284759.06 frames. ], batch size: 247, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:23:55,075 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.495e+02 9.952e+02 1.397e+03 2.179e+03 6.553e+03, threshold=2.795e+03, percent-clipped=14.0 2023-06-26 09:24:05,534 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2190306.0, ans=0.125 2023-06-26 09:24:05,555 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2190306.0, ans=0.125 2023-06-26 09:24:46,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2190426.0, ans=0.0 2023-06-26 09:24:55,733 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-26 09:25:35,958 INFO [train.py:996] (1/4) Epoch 12, batch 29650, loss[loss=0.2053, simple_loss=0.276, pruned_loss=0.06733, over 21535.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3035, pruned_loss=0.0763, over 4282122.45 frames. ], batch size: 211, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:25:51,842 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2190606.0, ans=0.125 2023-06-26 09:26:09,029 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2190666.0, ans=0.0 2023-06-26 09:26:20,172 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-26 09:26:22,750 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2190666.0, ans=0.0 2023-06-26 09:26:26,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2190726.0, ans=0.125 2023-06-26 09:27:04,122 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=2190786.0, ans=0.02 2023-06-26 09:27:22,330 INFO [train.py:996] (1/4) Epoch 12, batch 29700, loss[loss=0.2711, simple_loss=0.3724, pruned_loss=0.0849, over 21654.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.308, pruned_loss=0.07687, over 4280404.66 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:27:26,159 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2190846.0, ans=0.2 2023-06-26 09:27:27,214 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.270e+02 1.083e+03 1.930e+03 2.745e+03 6.322e+03, threshold=3.861e+03, percent-clipped=21.0 2023-06-26 09:27:51,913 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2190906.0, ans=0.125 2023-06-26 09:28:22,783 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2191026.0, ans=0.125 2023-06-26 09:28:32,430 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2191026.0, ans=0.5 2023-06-26 09:28:46,542 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2191086.0, ans=0.125 2023-06-26 09:28:57,296 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2191086.0, ans=0.2 2023-06-26 09:29:06,640 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-26 09:29:06,925 INFO [train.py:996] (1/4) Epoch 12, batch 29750, loss[loss=0.2409, simple_loss=0.3363, pruned_loss=0.07279, over 21792.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3114, pruned_loss=0.07607, over 4281464.30 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:29:26,375 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:30:19,471 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2191326.0, ans=0.125 2023-06-26 09:30:53,178 INFO [train.py:996] (1/4) Epoch 12, batch 29800, loss[loss=0.2249, simple_loss=0.298, pruned_loss=0.07593, over 21683.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3126, pruned_loss=0.07679, over 4280000.57 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:30:58,549 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.420e+02 9.303e+02 1.235e+03 1.702e+03 3.129e+03, threshold=2.469e+03, percent-clipped=0.0 2023-06-26 09:31:56,895 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=22.5 2023-06-26 09:32:02,892 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2191626.0, ans=0.125 2023-06-26 09:32:33,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2191686.0, ans=0.125 2023-06-26 09:32:37,247 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-26 09:32:37,770 INFO [train.py:996] (1/4) Epoch 12, batch 29850, loss[loss=0.1934, simple_loss=0.2684, pruned_loss=0.05922, over 21179.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3087, pruned_loss=0.07489, over 4279289.84 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:33:05,297 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-26 09:33:27,381 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2191866.0, ans=0.1 2023-06-26 09:33:30,607 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2191866.0, ans=0.125 2023-06-26 09:34:29,928 INFO [train.py:996] (1/4) Epoch 12, batch 29900, loss[loss=0.323, simple_loss=0.3718, pruned_loss=0.137, over 21506.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3082, pruned_loss=0.07715, over 4281545.53 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:34:35,015 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 6.187e+02 8.891e+02 1.162e+03 1.804e+03 4.059e+03, threshold=2.324e+03, percent-clipped=12.0 2023-06-26 09:34:36,003 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-26 09:36:10,354 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-26 09:36:19,859 INFO [train.py:996] (1/4) Epoch 12, batch 29950, loss[loss=0.2501, simple_loss=0.3236, pruned_loss=0.08824, over 21468.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3126, pruned_loss=0.08113, over 4286135.26 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:36:44,193 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2192406.0, ans=0.125 2023-06-26 09:37:02,689 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-26 09:37:16,359 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-26 09:37:39,521 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2192526.0, ans=0.0 2023-06-26 09:37:41,143 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2192526.0, ans=0.1 2023-06-26 09:38:04,532 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-26 09:38:08,278 INFO [train.py:996] (1/4) Epoch 12, batch 30000, loss[loss=0.2073, simple_loss=0.3268, pruned_loss=0.04387, over 20756.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3139, pruned_loss=0.08126, over 4280493.24 frames. ], batch size: 608, lr: 2.37e-03, grad_scale: 32.0 2023-06-26 09:38:08,279 INFO [train.py:1019] (1/4) Computing validation loss 2023-06-26 09:38:23,950 INFO [zipformer.py:1728] (1/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.5630, 1.0995, 1.3658, 1.7384, 1.6012, 1.6433, 1.6403, 1.4877], device='cuda:1') 2023-06-26 09:38:26,468 INFO [train.py:1028] (1/4) Epoch 12, validation: loss=0.2465, simple_loss=0.3441, pruned_loss=0.07444, over 1796401.00 frames. 2023-06-26 09:38:26,469 INFO [train.py:1029] (1/4) Maximum memory allocated so far is 24453MB 2023-06-26 09:38:27,065 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=2192646.0, ans=0.05 2023-06-26 09:38:36,133 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2192646.0, ans=0.2 2023-06-26 09:38:37,055 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.580e+02 9.008e+02 1.239e+03 1.696e+03 3.151e+03, threshold=2.479e+03, percent-clipped=8.0 2023-06-26 09:39:41,586 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.66 vs. limit=15.0 2023-06-26 09:40:32,830 INFO [train.py:996] (1/4) Epoch 12, batch 30050, loss[loss=0.2222, simple_loss=0.359, pruned_loss=0.04274, over 19716.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3179, pruned_loss=0.0786, over 4275886.99 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:40:41,473 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2192946.0, ans=0.0 2023-06-26 09:42:10,388 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2193186.0, ans=0.07 2023-06-26 09:42:10,445 INFO [scaling.py:1052] (1/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.722e-03 2023-06-26 09:42:30,769 INFO [train.py:996] (1/4) Epoch 12, batch 30100, loss[loss=0.2343, simple_loss=0.2936, pruned_loss=0.08756, over 21537.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3163, pruned_loss=0.07813, over 4269024.87 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:42:42,351 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.763e+02 1.437e+03 2.326e+03 3.330e+03 7.267e+03, threshold=4.652e+03, percent-clipped=46.0 2023-06-26 09:42:52,885 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2193306.0, ans=0.0 2023-06-26 09:43:08,540 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2193366.0, ans=0.125 2023-06-26 09:44:04,561 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2193486.0, ans=0.0 2023-06-26 09:44:08,453 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2193486.0, ans=0.125 2023-06-26 09:44:19,986 INFO [train.py:996] (1/4) Epoch 12, batch 30150, loss[loss=0.244, simple_loss=0.3128, pruned_loss=0.08759, over 21595.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3125, pruned_loss=0.07983, over 4267551.71 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:45:15,224 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-26 09:46:06,660 INFO [train.py:996] (1/4) Epoch 12, batch 30200, loss[loss=0.2611, simple_loss=0.3525, pruned_loss=0.08485, over 21695.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3148, pruned_loss=0.07858, over 4269918.59 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:46:15,741 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.538e+02 9.559e+02 1.334e+03 1.956e+03 4.030e+03, threshold=2.668e+03, percent-clipped=0.0 2023-06-26 09:47:03,635 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2193966.0, ans=0.125 2023-06-26 09:47:19,981 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2194026.0, ans=0.125 2023-06-26 09:47:50,665 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2194146.0, ans=0.0 2023-06-26 09:47:51,641 INFO [train.py:996] (1/4) Epoch 12, batch 30250, loss[loss=0.2862, simple_loss=0.3994, pruned_loss=0.0865, over 21272.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3238, pruned_loss=0.08133, over 4272248.21 frames. ], batch size: 549, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:47:55,578 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2194146.0, ans=0.0 2023-06-26 09:49:03,937 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2194326.0, ans=0.125 2023-06-26 09:49:04,607 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-26 09:49:32,075 INFO [scaling.py:962] (1/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-26 09:49:39,830 INFO [train.py:996] (1/4) Epoch 12, batch 30300, loss[loss=0.1959, simple_loss=0.2608, pruned_loss=0.0655, over 21245.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.32, pruned_loss=0.0807, over 4269547.67 frames. ], batch size: 144, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:49:48,252 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.277e+02 8.599e+02 1.166e+03 1.867e+03 3.678e+03, threshold=2.332e+03, percent-clipped=6.0 2023-06-26 09:50:35,512 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2194566.0, ans=0.04949747468305833 2023-06-26 09:50:39,204 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2194566.0, ans=0.0 2023-06-26 09:50:59,907 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2194626.0, ans=0.125 2023-06-26 09:51:03,598 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2194626.0, ans=0.0 2023-06-26 09:51:34,371 INFO [train.py:996] (1/4) Epoch 12, batch 30350, loss[loss=0.266, simple_loss=0.3506, pruned_loss=0.09076, over 21864.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.319, pruned_loss=0.08138, over 4264274.28 frames. ], batch size: 317, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:51:56,990 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2194806.0, ans=0.125 2023-06-26 09:52:52,559 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2194926.0, ans=0.0 2023-06-26 09:53:10,545 INFO [train.py:996] (1/4) Epoch 12, batch 30400, loss[loss=0.2024, simple_loss=0.2579, pruned_loss=0.07343, over 20265.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3128, pruned_loss=0.07953, over 4251247.45 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:53:18,336 INFO [optim.py:471] (1/4) Clipping_scale=2.0, grad-norm quartiles 5.851e+02 1.020e+03 1.381e+03 2.107e+03 4.576e+03, threshold=2.761e+03, percent-clipped=19.0 2023-06-26 09:54:13,901 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2195286.0, ans=0.0 2023-06-26 09:54:39,885 INFO [train.py:996] (1/4) Epoch 12, batch 30450, loss[loss=0.304, simple_loss=0.4252, pruned_loss=0.09144, over 19895.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3147, pruned_loss=0.07887, over 4193384.32 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:55:09,322 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2195406.0, ans=0.125 2023-06-26 09:55:25,492 INFO [scaling.py:182] (1/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2195466.0, ans=0.2 2023-06-26 09:55:53,104 INFO [train.py:1249] (1/4) Done!